Preface

Welcome to Mastering pandas. This book will teach you how to effectively use pandas, which is a one of the most popular Python packages today for performing data analysis. The first half of this book starts off with the rationale for performing data analysis. Then it introduces Python and pandas in particular, taking you through the installation steps, what pandas is all about, what it can be used for, data structures in pandas, and how to select, merge and group data in pandas. Then it covers handling missing data and time series data, as well as plotting for data visualization.

The second half of this book shows you how to use pandas to perform inferential statistics using the classical and Bayesian approaches, followed by a chapter on pandas architecture, before rounding off with a whirlwind tour of machine learning, which introduces the scikit-learn library. The aim of this book is to immerse you into pandas through the use of illustrative examples on real-world datasets.

What this book covers

Chapter 1, Introduction to pandas and Data Analysis, explains the motivation for doing data analysis, introduces the Python language and the pandas library, and discusses how they can be used for data analysis. It also describes the benefits of using pandas for data analysis.

Chapter 2, Installation of pandas and the Supporting Software, gives a detailed description on how to install pandas. It gives installation instructions across multiple operating system platforms: Unix, MacOS X, and Windows. It also describes how to install supporting software, such as NumPy and IPython.

Chapter 3, The pandas Data Structures, introduces the data structures that form the bedrock of the pandas library. The numpy.ndarray data structure is first introduced and discussed as it forms the basis for the pandas.Series and pandas.DataFrame data structures, which are the foundation data structures used in pandas. This chapter may be the most important on in the book, as knowledge of these data structures is absolutely necessary to do data analysis using pandas.

Chapter 4, Operations in pandas, Part I – Indexing and Selecting, focuses on how to access and select data from the pandas data structures. It discusses the various ways of selecting data via Basic, Label, Integer, and Mixed Indexing. It explains more advanced indexing concepts such as MultiIndex, Boolean indexing, and operations on Index types.

Chapter 5, Operations in pandas, Part II – Grouping, Merging, and Reshaping of Data, tackles the problem of rearranging data in pandas' data structures. The various functions in pandas that enable the user to rearrange data are examined by utilizing them on real-world datasets. This chapter examines the different ways in which data can be rearranged: by aggregation/grouping, merging, concatenating, and reshaping.

Chapter 6, Missing Data, Time Series, and Plotting using Matplotlib, discusses topics that are necessary for the pre-processing of data that is to be used as input for data analysis, prediction, and visualization. These topics include how to handle missing values in the input data, how to handle time series data, and how to use the matplotlib library to plot data for visualization purposes.

Chapter 7, A Tour of Statistics – The Classical Approach, takes you on a brief tour of classical statistics and shows how pandas can be used together with Python's statistical packages to conduct statistical analyses. Various statistical topics are addressed, including statistical inference, measures of central tendency, hypothesis testing, Z- and T-tests, analysis of variance, confidence intervals, and correlation and regression.

Chapter 8, A Brief Tour of Bayesian Statistics, discusses an alternative approach to performing statistical analysis, known as Bayesian analysis. This chapter introduces Bayesian statistics and discusses the underlying mathematical framework. It examines the various probability distributions used in Bayesian analysis and shows how to generate and visualize them using matplotlib and scipy.stats. It also introduces the PyMC library for performing Monte Carlo simulations, and provides a real-world example of conducting a Bayesian inference using online data.

Chapter 9, The pandas Library Architecture, provides a fairly detailed description of the code underlying pandas. It gives a breakdown of how the pandas library code is organized and describes the various modules that make up pandas, with some details. It also has a section that shows the user how to improve Python and pandas's performance using extensions.

Chapter 10, R and pandas Compared, focuses on comparing pandas with R, the stats package on which much of pandas's functionality is based. This chapter compares R data types and their pandas equivalents, and shows how the various operations compare in both libraries. Operations such as slicing, selection, arithmetic operations, aggregation, group-by, matching, split-apply-combine, and melting are compared.

Chapter 11, Brief Tour of Machine Learning, takes you on a whirlwind tour of machine learning, with focus on using the pandas library as a tool to preprocess input data into machine learning programs. It also introduces the scikit-learn library, which is the most widely used machine learning toolkit in Python. Various machine learning techniques and algorithms are introduced by applying them to a well-known machine learning classification problem: which passengers survived the sinking of the Titanic?

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.219.4.174