Chapter 14. Conclusion and Next Steps

In the Preface, I stated the following learning objective:

By the end of this book, you should be able to conduct exploratory data analysis and hypothesis testing using a programming language.

I sincerely hope you feel this objective has been met, and that you are confident to advance into further areas of analytics. To end this leg of your analytics journey, I’d like to share some topics to help round out and expand upon what you now know.

Further Slices of the Stack

Chapter 5 covered four major categories of software applications used in data analytics: spreadsheets, programming languages, databases, and BI tools. Because of our focus on the statistically based elements of analytics, we emphasized the first two slices of the stack. Refer back to that chapter on ideas for how the other slices tie in, and what to learn about them.

Research Design and Business Experiments

You learned in Chapter 3 that sound data analysis can only follow from sound data collection: as the saying goes, “garbage in, garbage out.” In this book, we’ve assumed our data was collected accurately, was the right data for our analysis, and contained a representative sample. And we’ve been working with well-known datasets often taken from peer-reviewed research, so this is a safe assumption.

But often you can’t be so sure about your data; you may responsible for collecting and analyzing it. It’s worth learning more, then, about research design and methods. This field can get quite sophisticated and academic, but it’s found practical applications in the field of business experiments. Check out Stefan H. Thomke’s Experimentation Works: The Surprising Power of Business Experiments (Harvard Business Review Press) for an overview of how and why to apply sound research methods to business.

Further Statistical Methods

As Chapter 4 mentioned, we’ve only scratched the surface of the types of statistical tests available, although many of them rest on the same framework of hypothesis testing covered in Chapter 3.

For a conceptual overview other statistical methods, check out Sarah Boslaugh’s Statistics in a Nutshell, 2nd edition (O’Reilly). Then, head to Practical Statistics for Data Scientists, 2nd Edition (O’Reilly) by Peter Bruce et al. to apply them using R and Python. As its title suggests, the latter book straddles the line between statistics and data science.

Data Science and Machine Learning

Chapter 5 reviewed the differences between statistics, data analytics, and data science and concluded that although there are differences in methods, more unites these fields than divides them.

If you are keenly interested in data science and machine learning, focus your learning efforts on R and Python, with some SQL and database knowledge too. To see how R is used in data science, check out R for Data Science by Hadley Wickham and Garrett Grolemund (O’Reilly). For Python, check out Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd edition by Aurélien Géron (O’Reilly).

Version Control

Chapter 5 also mentioned the importance of reproducibility. Let’s look at a key application in that fight. You’ve likely run into a set of files like the following before:

  • proposal.txt

  • proposal-v2.txt

  • proposal-Feb23.txt

  • proposal-final.txt

  • proposal-FINAL-final.txt

Maybe one user created proposal-v2.txt; another started proposal-Feb23.txt. Then there’s the difference between proposal-final.txt and proposal-FINAL-final.txt to contend with. It can be quite difficult to sort out which is the “main” copy, and how to reconstruct and migrate all changes to that copy while keeping a record of who contributed what.

A version control system can come to the rescue here. This is a way to track projects over time, such as the contributions and changes made by different users. Version control is a game changer for collaboration and tracking revisions but has a relatively steep learning curve.

Git is a dominant version control system that is quite popular among data scientists, software engineers, and other technical professionals. In particular, they often use the cloud-based hosting service GitHub to manage Git projects. For an overview of Git and GitHub, check out Version Control with Git, 2nd edition by Jon Loeliger and Matthew McCullough (O’Reilly). For a look at how to pair Git and GitHub with R and RStudio, check out the online resource Happy Git and GitHub for the useR by Jenny Bryan et al. At this time Git and other version control systems are relatively uncommon in data analytics workflows, but they are growing in popularity due in part to the growing demand for reproducibility.

Ethics

From recording and collecting it to analyzing and modeling it, data is surrounded by ethical concerns. In Chapter 3 you learned about statistical bias: especially in a machine learning context, it’s possible that a model could begin to discriminate against groups of people in unjust or illegal ways. If data is being collected about individuals, consideration should be given to those individuals’ privacy and consent.

Ethics hasn’t always been a priority in data analytics and data science. Fortunately, the tide appears to be turning here, and will only continue with sustained community support. For a brief guide on how to incorporate ethical standards into working with data, check out Ethics and Data Science by Mike Loukides et al. (O’Reilly).

Go Forth and Data How You Please

I’m often asked which of these tools one should focus on given employer demand and trending popularity. My answer: take some time to find out what you like, and let those interests shape your learning path rather than trying to tailor it toward the “next big thing” in analytics tools. These skills are all valuable. More important than any one analytics tool is the ability to contextualize and pair those tools, which requires exposure to a broad set of applications. But you can’t become an expert in everything. The best learning strategy will resemble a “T” shape: a wide exposure to various data tools, with relatively deeper knowledge in a handful of them.

Parting Words

Take a moment to look back at everything you’ve accomplished with this book: you should be proud. But don’t linger: there’s so much more to learn, and it won’t take long in your learning journey to realize what a tip of the iceberg this book represented to you. Here’s your end-of-chapter and end-of-book exercise: get out there, keep learning, and continue advancing into analytics.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.220.16.184