Moving Forward

In this text, you have learned the foundational programming skills necessary for entering the data science field. The ability to *write code to work with data* empowers you to explore and communicate information in transparent, reusable, and collaborative ways. As many data scientists will attest, the most time-consuming part of a project is organizing and exploring the data—something that you are now more than capable of doing. These skills on their own are quite valuable for gaining insight from quantitative information, but there is always more to learn. If you are eager to expand your skills, there are a few areas that serve as obvious next steps in data science.

The term **statistical learning** encompasses the statistical and computational techniques used to transform data into information. This book has laid the groundwork for performing these techniques in `R`

, but has not explored specific functions or packages. The aims of statistical learning can typically be reduced to two categories: assessing relationships between variables and making predictions for unobserved values.

The programming skills covered in this book allow you to make comparisons across groups using summary statistics and visualization. However, it does not discuss statistical assessments for measuring the size or significance of these variations. A multitude of statistical techniques are available in `R`

for assessing the strength of relationships between variables. This includes questions such as *Are salaries consistent across genders?* and *Is investment in education associated with lower healthcare costs for a city?* While this text has taught you to perform exploratory data analysis techniques, it did not describe methods for measuring the strength of association that exists between variables. To draw conclusions about **causality**, and control for complex relationships in your data, you will need to understand the **statistical methods** available for answering these questions. Here are a few resources that may help you in this area:

*R for Everyone*^{1}introduces statistical modeling and evaluation in`R`

, including linear and non-linear methods.^{1}Lander, J. P. (2017).*R for everyone: Advanced analytics and graphics*(2nd ed.). Boston, MA: Addison-Wesley.*An Introduction to Statistical Learning*^{2}is a more general introduction to statistical learning problems, which includes an implementation in`R`

(though the focus is more conceptual than programming oriented).^{2}James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013).*An introduction to statistical learning*(Vol. 112). Springer. http://www-bcf.usc.edu/~gareth/ISL/*OpenIntro Stats*^{3}is an open source^{4}set of texts that focus on the basics of probability and statistics.^{3}Diez, D. M., Barr, C. D., & Cetinkaya-Rundel, M. (2012). OpenIntro statistics. CreateSpace. https://www.openintro.org/stat/textbook.php^{4}**OpenIntro Statistics**: https://www.openintro.org

The other major domain of data science is making predictions for unobserved values. This includes questions like *Which students are most likely to pass a course?* and *How is a congressperson likely to vote on a piece of legislation?* Broadly speaking, statistical methods are better suited for *assessing relationships*, while **machine learning** techniques are optimized for *making predictions*. These techniques involve the application of specific algorithms to make predictions based on patterns identified in data (for a wonderful visual introduction to machine learning, see this online interactive tutorial^{5}). While a vast amount of domain knowledge is needed to properly select and interpret machine learning algorithms, many of them can be implemented in `R`

in a *single line of code* using external packages.

^{5}**A Visual Introduction to Machine Learning**: http://www.r2d3.us/visual-intro-to-machine-learning-part-1/

`R`

is an excellent language for programmatically working with data (if it wasn’t, we wouldn’t have written a book about it!). Depending on which skills you want to expand, and what techniques your team is using, it may be worth investing in learning another programming language. Luckily, after learning one language, it’s much easier to learn another—you have already practiced the skills of installing software, reading documentation, debugging code, and writing programs. To advance your data science skills, you could invest in learning the following languages:

**Python**is another popular language for doing data science. Like`R`

, it is open source, and has a large community of people contributing to its statistical, machine learning, and visualization packages. Because`R`

and Python largely enable you to solve the same problems in data science, the motivations to learn Python would include collaboration (to work with people who*only*use Python), curiosity (about how a similar language solves the same problems), and analysis (if a specific sophisticated analysis is only available in a Python package). A great book for learning to program for data science in Python is the*Python Data Science Handbook*.^{6}^{6}VanderPlas, J. (2016).*Python Python data science handbook: Essential tools for working with data*. O’Reilly Media, Inc.**Web development technologies**including*HTML*,*CSS*, and*JavaScript*represent a complementary skill set for data scientists. If you are passionate about building visual interfaces on the web for interacting with data, you will likely become frustrated by the limitations of the Shiny framework. Building interactive websites from scratch requires a notable time investment, but it gives you complete control over the style and behavior of your webpages. If you are seriously interested in building custom visualizations, look into using the**d3.js**^{7}JavaScript library, which you can also read about in*Visual storytelling with D3*.^{8}^{7}**d3.js**https://d3js.org^{8}King, R. S. (2014).*Visual storytelling with D3: An introduction to data visualization in JavaScript*. Addison-Wesley.

The power of data science to change the world around us—for better and for worse—has become evident over the past decade. Data science has helped move forward research in a variety of socially impactful domains such as public health and education. At the same time it has been used to develop systems that systematically disenfranchise groups of people (both intentionally and unintentionally). Algorithms that *appear* to be unbiased have had profoundly negative impacts. For example, an analysis^{9} by ProPublica revealed the racist nature of a piece of software that predicts criminal activity (you can see all of the code on GitHub^{10}).

^{9}Angwin, J. L. (2016, May 23). Machine bias. ProPublica. https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing

^{10}**Machine Bias Analysis, Complete Code**: https://github.com/propublica/compas-analysis

Such consequences of unchecked assumptions in data science can be difficult to detect and have outsized effects on people, so tread carefully as you move forward with your newly acquired skills. Remember: *you* are responsible for the impact of the programs that you write. The analytical and programming skills covered in this text empower you to identify and communicate about the injustices in the world. As a data scientist, you have a **moral responsibility** to *do no harm* with your skills (or better yet, to work to *undo* harms that have occurred in the past and are occurring today). As you begin to work in data science, you must *always* consider how people will be differentially impacted by your work. Think carefully about *who* is represented in—and excluded from—your data, what *assumptions* are built into your analysis, and how any decisions made using your data could differentially benefit different communities—particularly those communities that are often overlooked.

Thank you for reading our book! We hope that it provided inspiration and guidance in your pursuit of data science, and that you use these skills for good.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.