In this text, you have learned the foundational programming skills necessary for entering the data science field. The ability to write code to work with data empowers you to explore and communicate information in transparent, reusable, and collaborative ways. As many data scientists will attest, the most time-consuming part of a project is organizing and exploring the data—something that you are now more than capable of doing. These skills on their own are quite valuable for gaining insight from quantitative information, but there is always more to learn. If you are eager to expand your skills, there are a few areas that serve as obvious next steps in data science.
The term statistical learning encompasses the statistical and computational techniques used to transform data into information. This book has laid the groundwork for performing these techniques in
R, but has not explored specific functions or packages. The aims of statistical learning can typically be reduced to two categories: assessing relationships between variables and making predictions for unobserved values.
The programming skills covered in this book allow you to make comparisons across groups using summary statistics and visualization. However, it does not discuss statistical assessments for measuring the size or significance of these variations. A multitude of statistical techniques are available in
R for assessing the strength of relationships between variables. This includes questions such as Are salaries consistent across genders? and Is investment in education associated with lower healthcare costs for a city? While this text has taught you to perform exploratory data analysis techniques, it did not describe methods for measuring the strength of association that exists between variables. To draw conclusions about causality, and control for complex relationships in your data, you will need to understand the statistical methods available for answering these questions. Here are a few resources that may help you in this area:
R for Everyone1 introduces statistical modeling and evaluation in
R, including linear and non-linear methods.
1Lander, J. P. (2017). R for everyone: Advanced analytics and graphics (2nd ed.). Boston, MA: Addison-Wesley.
An Introduction to Statistical Learning2 is a more general introduction to statistical learning problems, which includes an implementation in
R (though the focus is more conceptual than programming oriented).
2James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning (Vol. 112). Springer. http://www-bcf.usc.edu/~gareth/ISL/
OpenIntro Stats3 is an open source4 set of texts that focus on the basics of probability and statistics.
3Diez, D. M., Barr, C. D., & Cetinkaya-Rundel, M. (2012). OpenIntro statistics. CreateSpace. https://www.openintro.org/stat/textbook.php
4OpenIntro Statistics: https://www.openintro.org
The other major domain of data science is making predictions for unobserved values. This includes questions like Which students are most likely to pass a course? and How is a congressperson likely to vote on a piece of legislation? Broadly speaking, statistical methods are better suited for assessing relationships, while machine learning techniques are optimized for making predictions. These techniques involve the application of specific algorithms to make predictions based on patterns identified in data (for a wonderful visual introduction to machine learning, see this online interactive tutorial5). While a vast amount of domain knowledge is needed to properly select and interpret machine learning algorithms, many of them can be implemented in
R in a single line of code using external packages.
5A Visual Introduction to Machine Learning: http://www.r2d3.us/visual-intro-to-machine-learning-part-1/
R is an excellent language for programmatically working with data (if it wasn’t, we wouldn’t have written a book about it!). Depending on which skills you want to expand, and what techniques your team is using, it may be worth investing in learning another programming language. Luckily, after learning one language, it’s much easier to learn another—you have already practiced the skills of installing software, reading documentation, debugging code, and writing programs. To advance your data science skills, you could invest in learning the following languages:
Python is another popular language for doing data science. Like
R, it is open source, and has a large community of people contributing to its statistical, machine learning, and visualization packages. Because
R and Python largely enable you to solve the same problems in data science, the motivations to learn Python would include collaboration (to work with people who only use Python), curiosity (about how a similar language solves the same problems), and analysis (if a specific sophisticated analysis is only available in a Python package). A great book for learning to program for data science in Python is the Python Data Science Handbook.6
6VanderPlas, J. (2016). Python Python data science handbook: Essential tools for working with data. O’Reilly Media, Inc.
The power of data science to change the world around us—for better and for worse—has become evident over the past decade. Data science has helped move forward research in a variety of socially impactful domains such as public health and education. At the same time it has been used to develop systems that systematically disenfranchise groups of people (both intentionally and unintentionally). Algorithms that appear to be unbiased have had profoundly negative impacts. For example, an analysis9 by ProPublica revealed the racist nature of a piece of software that predicts criminal activity (you can see all of the code on GitHub10).
9Angwin, J. L. (2016, May 23). Machine bias. ProPublica. https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing
10Machine Bias Analysis, Complete Code: https://github.com/propublica/compas-analysis
Such consequences of unchecked assumptions in data science can be difficult to detect and have outsized effects on people, so tread carefully as you move forward with your newly acquired skills. Remember: you are responsible for the impact of the programs that you write. The analytical and programming skills covered in this text empower you to identify and communicate about the injustices in the world. As a data scientist, you have a moral responsibility to do no harm with your skills (or better yet, to work to undo harms that have occurred in the past and are occurring today). As you begin to work in data science, you must always consider how people will be differentially impacted by your work. Think carefully about who is represented in—and excluded from—your data, what assumptions are built into your analysis, and how any decisions made using your data could differentially benefit different communities—particularly those communities that are often overlooked.
Thank you for reading our book! We hope that it provided inspiration and guidance in your pursuit of data science, and that you use these skills for good.