Preface

In 2013, I didn’t even know the term “data science” existed. I was a master’s of public health (MPH) student in epidemiology at the time and was already captivated with the statistical methods beyond the t-test, ANOVA, and linear regression from my psychology and neuroscience undergraduate background. It was also in the fall of 2013 that I attended my first Software-Carpentry workshop and that I taught my first recitation section as a teaching assistant for my MPH program’s Quantitative Methods course (essentially a combination of a first-semester epidemiology and biostatistics course). I’ve been learning and teaching ever since.

I’ve come a long way since taking my first Introduction to Data Science course, which was taught by Rachel Schutt, PhD; Kayur Patel, PhD; and Jared Lander. They opened my eyes to what was possible. Things that were inconceivable (to me) were actually common practices, and anything I could think of was possible (although I now know that “possible” doesn’t mean “performs well”). The technical details of data science—the coding aspects—were taught by Jared in R. Jared’s friends and colleagues know how much of an aficionado he is of the R language.

At the time, I had been meaning to learn R, but the Python/R language war never breached my consciousness. On the one hand, I saw Python as just a programming language; on the other hand, I had no idea Python had an analytics stack (I’ve come a long way since then). When I learned about the SciPy stack and Pandas, I saw it as a bridge between what I knew how to do in Python from my undergraduate and high school days and what I had learned in my epidemiology studies and through my newly acquired data science knowledge. As I became more proficient in R, I saw the similarities to Python. I also realized that a lot of the data cleaning tasks (and programming in general) involve thinking about how to get what you need—the rest is more or less syntax. It’s important to try to imagine what the steps are and not get bogged down by the programming details. I’ve always been comfortable bouncing around the languages and never gave too much thought to which language was “better.” Having said that, this book is geared toward a newcomer to the Python data analytics world.

This book encapsulates all the people I’ve met, events I’ve attended, and skills I’ve learned over the past few years. One of the more important things I’ve learned (outside of knowing what things are called so Google can take me to the relevant StackOverflow page) is that reading the documentation is essential. As someone who has worked on collaborative lessons and written Python and R libraries, I can assure you that a lot of time and effort go into writing documentation. That’s why I constantly refer to the relevant documentation page throughout this book. Some functions have so many parameters used for varying use cases that it’s impractical to go through each of them. If that were the focus of this book, it might as well be titled Loading Data Into Python. But, as you practice working with data and become more comfortable with the various data structures, you’ll eventually be able to make “educated guesses” about what the output of something will be, even though you’ve never written that particular line of code before. I hope this book gives you a solid foundation to explore on your own and be a self-guided learner.

I met a lot of people and learned a lot from them during the time I was putting this book together. A lot of the things I learned dealt with best practices, writing vectorized statements instead of loops, formally testing code, organizing project folder structures, and so on. I also learned lot about teaching from actually teaching. Teaching really is the best way to learn material. Many of the things I’ve learned in the past few years have come to me when I was trying to figure them out to teach others. Once you have a basic foundation of knowledge, learning the next bit of information is relatively easy. Repeat the process enough times, and you’ll be surprised how much you actually know. That includes knowing the terms to use for Google and interpreting the StackOverflow answers. The very best of us all search for our questions. Whether this is your first language or your fourth, I hope this book gives you a solid foundation to build upon and learn as well as a bridge to other analytics languages.

Breakdown of the Book

This book is organized into five parts plus a set of appendixes.

Part I

Part I aims to be an introduction to Pandas using a realistic data set.

Chapter 1: Starts by using Pandas to load a data set and begin looking at various rows and columns of the data. Here you will get a general sense of the syntax of Python and Pandas. The chapter ends with a series of motivating examples that illustrate what Pandas can do.

Chapter 2: Dives deeper into what the Pandas DataFrame and Series objects are. This chapter also covers boolean subsetting, dropping values, and different ways to import and export data.

Chapter 3: Covers plotting methods using matplotlib, seaborn, and Pandas to create plots for exploratory data analysis.

Part II

Part II focuses on what happens after you load data and need to combine data together. It also introduces “tidy data”—a series of data manipulations aimed at “cleaning” data.

Chapter 4: Focuses on combining data sets, either by concatenating them together or by merging disparate data.

Chapter 5: Covers what happens when there is missing data, how data are created to fill in missing data, and how to work with missing data, especially what happens when certain calculations are performed on them.

Chapter 6: Discusses Hadley Wickham’s “Tidy Data” paper, which deals with reshaping and cleaning common data problems.

Part III

Part III covers the topics needed to clean and munge data.

Chapter 7: Deals with data types and how to convert from different types within DataFrame columns.

Chapter 8: Introduces string manipulation, which is frequently needed as part of the data cleaning task because data are often encoded as text.

Chapter 9: Focuses on applying functions over data, an important skill that encompasses many programming topics. Understanding how apply works will pave the way for more parallel and distributed coding when your data manipulations need to scale.

Chapter 10: Describes groupby operations. These powerful concepts, like apply, are often needed to scale data. They are also great ways to efficiently aggregate, transform, or filter your data.

Chapter 11: Explores Pandas’s powerful date and time capabilities.

Part IV

With the data all cleaned and ready, the next step is to fit some models. Models can be used for exploratory purposes, not just for prediction, clustering, and inference. The goal of Part IV is not to teach statistics (there are plenty of books in that realm), but rather to show you how these models are fit and how they interface with Pandas. Part IV can be used as a bridge to fitting models in other languages.

Chapter 12: Linear models are the simpler models to fit. This chapter covers fitting these models using the statsmodels and sklean libraries.

Chapter 13: Generalized linear models, as the name suggests, are linear models specified in a more general sense. They allow us to fit models with different response variables, such as binary data or count data. This chapter also covers survival models.

Chapter 14: Since we have a core set of models that we can fit, the next step is to perform some model diagnostics to compare multiple models and pick the “best” one.

Chapter 15: Regularization is a technique used when the models we are fitting are too complex or overfit our data.

Chapter 16: Clustering is a technique we use when we don’t know the actual answer within our data, but we need a method to cluster or group “similar” data points together.

Part V

The book concludes with a few points about the larger Python ecosystem, and additional references.

Chapter 17: Quickly summarizes the computation stack in Python, and starts down the path to code performance and scaling.

Chapter 18: Provides some links and references on learning beyond the book.

Appendixes

The appendixes can be thought as a primer to Python programming. While they are not a complete introduction to Python, the various appendixes do supplement some of the topics throughout the book.

Appendixes AG: These appendixes cover all the tasks related to running Python code—from installing Python, to using the command line to execute your scripts, and to organizing your code. They also cover creating Python environments and installing libraries.

Appendixes HT: The appendixes cover general programming concepts that are relevant to Python and Pandas. They are supplemental references to the main part of the book.

How to Read This Book

Whether you are a newcomer to Python or a fluent Python programmer, this book is meant to be read from the beginning. Educators, or people who plan to use the book for teaching, may also find the order of the chapters to be suitable for a workshop or class.

Newcomers

Absolute newcomers are encouraged to first look through Appendixes AF, as they explain how to install Python and get it working. After taking these steps, readers will be ready to jump into the main body of the book. The earlier chapters make references to the relevant appendixes as needed. The concept map and objectives found at the beginning of the earlier chapters help organize and prepare the reader for what will be covered in the chapter, as well as point to the relevant appendixes to be read before continuing.

Fluent Python Programmers

Fluent Python programmers may find the first two chapters to be sufficient to get started and grasp the syntax of Pandas; they can then use the rest of the book as a reference. The objectives at the beginning of the earlier chapters point out which topics are covered in the chapter. The chapter on “tidy data” in Part II, and the chapters in Part III, will be particularly helpful in data manipulation.

Instructors

Instructors who want to use the book as a teaching reference may teach each chapter in the order presented. It should take approximately 45 minutes to 1 hour to teach each chapter. I have sought to structure the book so that chapters do not reference future chapters, so as to minimize the cognitive overload for students—but feel free to shuffle the chapters as needed.

Setup

Everyone will have a different setup, so the best way to get the most updated set of instructions on setting up an environment to code through the book would be on the accompanying GitHub repository:

https://github.com/chendaniely/pandas_for_everyone

Otherwise, see Appendix A for information on how to install Python on your computer.

Getting the Data

The easiest way to get all the data to code along the book is to download the repository using the following URL:

https://github.com/chendaniely/pandas_for_everyone/archive/master.zip

This will download everything in the repository, as well as provide a folder in which you can put your Python scripts or notebooks. You can also copy the data folder from the repository and put it in a folder of your choosing. The instructions on the GitHub repository will be updated as necessary to facilitate downloading the data for the book.

Setting up Python

Appendixes F and G cover environments and installing packages, respectively. Following are the commands used to build the book and should be sufficient to help you get started.

   $ conda create -n book python=3.6
   $ source activate book
   $ conda install pandas xlwt openpyxl feather -format seaborn numpy
ipython jupyter statsmodels scikit-learnregex
wget odo numba
   $ conda install -c conda-forge pweave
   $ pip install lifelines
   $ pip install pandas-datareader

Feedback, Please!

Thank you for taking the time to go through this book. If you find any problems, issues, or mistakes within the book, please send me feedback! GitHub issues may be the best place to provide this information, but you can also email me at [email protected]. Just be sure to use the [PFE] tag in the beginning of the subject line so I can make sure your emails do not get flooded by various listserv emails. If there are topics that you feel should be covered in the book, please let me know. I will try my best to put up a notebook in the GitHub repository, and to get it incorporated in a later printing or edition of the book.

Words of encouragement are appreciated.

Image
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.181.170