0%

Book Description

Perform advanced data manipulation tasks using pandas and become an expert data analyst.

Key Features

  • Manipulate and analyze your data expertly using the power of pandas
  • Work with missing data and time series data and become a true pandas expert
  • Includes expert tips and techniques on making your data analysis tasks easier

Book Description

pandas is a popular Python library used by data scientists and analysts worldwide to manipulate and analyze their data. This book presents useful data manipulation techniques in pandas to perform complex data analysis in various domains.

An update to our highly successful previous edition with new features, examples, updated code, and more, this book is an in-depth guide to get the most out of pandas for data analysis. Designed for both intermediate users as well as seasoned practitioners, you will learn advanced data manipulation techniques, such as multi-indexing, modifying data structures, and sampling your data, which allow for powerful analysis and help you gain accurate insights from it. With the help of this book, you will apply pandas to different domains, such as Bayesian statistics, predictive analytics, and time series analysis using an example-based approach. And not just that; you will also learn how to prepare powerful, interactive business reports in pandas using the Jupyter notebook.

By the end of this book, you will learn how to perform efficient data analysis using pandas on complex data, and become an expert data analyst or data scientist in the process.

What you will learn

  • Speed up your data analysis by importing data into pandas
  • Keep relevant data points by selecting subsets of your data
  • Create a high-quality dataset by cleaning data and fixing missing values
  • Compute actionable analytics with grouping and aggregation in pandas
  • Master time series data analysis in pandas
  • Make powerful reports in pandas using Jupyter notebooks

Who this book is for

This book is for data scientists, analysts and Python developers who wish to explore advanced data analysis and scientific computing techniques using pandas. Some fundamental understanding of Python programming and familiarity with the basic data analysis concepts is all you need to get started with this book.

Table of Contents

  1. Title Page
  2. Copyright and Credits
    1. Mastering pandas Second Edition
  3. About Packt
    1. Why subscribe?
  4. Contributors
    1. About the author
    2. About the reviewer
    3. Packt is searching for authors like you
  5. Preface
    1. Who this book is for
    2. What this book covers
    3. To get the most out of this book
      1. Download the example code files
      2. Download the color images
      3. Conventions used
    4. Get in touch
      1. Reviews
  6. Section 1: Overview of Data Analysis and pandas
  7. Introduction to pandas and Data Analysis
    1. Motivation for data analysis
      1. We live in a big data world
      2. The four V's of big data
        1. Volume of big data
        2. Velocity of big data
        3. Variety of big data
        4. Veracity of big data
      3. So much data, so little time for analysis
      4. The move towards real-time analytics
    2. Data analytics pipeline
    3. How Python and pandas fit into the data analytics pipeline
    4. What is pandas?
    5. Where does pandas fit in the pipeline?
    6. Benefits of using pandas
    7. History of pandas
    8. Usage pattern and adoption of pandas
    9. pandas on the technology adoption curve
    10. Popular applications of pandas
    11. Summary
    12. References
  8. Installation of pandas and Supporting Software
    1. Selecting a version of Python to use
    2. Standalone Python installation
      1. Linux
        1. Installing Python from a compressed tarball
      2. Windows
        1. Core Python installation
        2. Installing third-party Python and packages 
      3. macOS/X
        1. Installation using a package manager
    3. Installation of Python and pandas using Anaconda
      1. What is Anaconda?
      2. Why Anaconda?
      3. Installing Anaconda
        1. Windows Installation
        2. macOS Installation
        3. Linux Installation
        4. Cloud installation
      4. Other numeric and analytics-focused Python distributions
    4. Dependency packages for pandas
    5. Review of items installed with Anaconda
      1. JupyterLab
      2. GlueViz
      3. Walk-through of Jupyter Notebook and Spyder
        1. Jupyter Notebook
        2. Spyder
    6. Cross tooling – combining pandas awesomeness with R, Julia, H20.ai, and Azure ML Studio
      1. Pandas with R
      2. pandas with Azure ML Studio
      3. pandas with Julia
      4. pandas with H2O
    7. Command line tricks for pandas
    8. Options and settings for pandas
    9. Summary
    10. Further reading
  9. Section 2: Data Structures and I/O in pandas
  10. Using NumPy and Data Structures with pandas
    1. NumPy ndarrays
      1. NumPy array creation
        1. Array of ones and zeros
        2. Array based on a numerical range
        3. Random and empty arrays
        4. Arrays based on existing arrays
      2. NumPy data types
      3. NumPy indexing and slicing
        1. Array slicing
        2. Array masking
        3. Complex indexing
      4. Copies and views
      5. Operations
        1. Basic operators
        2. Mathematical operators
        3. Statistical operators
        4. Logical operators
      6. Broadcasting
      7. Array shape manipulation
        1. Reshaping
        2. Transposing
        3. Ravel
        4. Adding a new axis
      8. Basic linear algebra operations
      9. Array sorting
    2. Implementing neural networks with NumPy
    3. Practical applications of multidimensional arrays
      1. Selecting only one channel
      2. Selecting the region of interest of an image
      3. Multiple channel selection and suppressing other channels
    4. Data structures in pandas
      1. Series
        1. Series creation
          1. Using an ndarray
          2. Using a Python dictionary
          3. Using a scalar value
        2. Operations on Series
          1. Assignment
          2. Slicing
        3. Other operations
      2. DataFrames
        1. DataFrame creation
          1. Using a dictionary of Series
          2. Using a dictionary of ndarrays/lists
          3. Using a structured array
          4. Using a list of dictionaries
          5. Using a dictionary of tuples for multilevel indexing
          6. Using a Series
        2. Operations on pandas DataFrames
          1. Column selection
          2. Adding a new column
          3. Deleting columns
        3. Alignment of DataFrames
        4. Other mathematical operations
      3. Panels
        1. Using a 3D NumPy array with axis labels
        2. Using a Python dictionary of DataFrame objects
        3. Using the DataFrame.to_panel method
        4. Other operations
    5. Summary
    6. References
  11. I/Os of Different Data Formats with pandas
    1. Data sources and pandas methods
    2. CSV and TXT
      1. Reading CSV and TXT files
        1. Reading a CSV file
        2. Specifying column names for a dataset
        3. Reading from a string of data
        4. Skipping certain rows
        5. Row index
        6. Reading a text file
        7. Subsetting while reading
        8. Reading thousand format numbers as numbers
        9. Indexing and multi-indexing
        10. Reading large files in chunks
      2. Handling delimiter characters in column data
      3. Writing to a CSV
    3. Excel
    4. URL and S3
    5. HTML
      1. Writing to an HTML file
    6. JSON
      1. Writing a JSON to a file
      2. Reading a JSON
      3. Writing JSON to a DataFrame
      4. Subsetting a JSON
      5. Looping over JSON keys
    7. Reading HDF formats
    8. Reading feather files
    9. Reading parquet files
    10. Reading a SQL file
    11. Reading a SAS/Stata file
    12. Reading from Google BigQuery
    13. Reading from a clipboard
    14. Managing sparse data
    15. Writing JSON objects to a file
    16. Serialization/deserialization
    17. Writing to exotic file types
      1. to_pickle()
      2. to_parquet()
      3. to_hdf()
      4. to_sql()
      5. to_feather()
      6. to_html()
      7. to_msgpack()
      8. to_latex()
      9. to_stata()
      10. to_clipboard()
    18. GeoPandas
      1. What is geospatial data?
      2. Installation and dependencies
      3. Working with GeoPandas
      4. GeoDataFrames
    19. Open source APIs – Quandl
      1. read_sql_query
    20. Pandas plotting
      1. Andrews curves
      2. Parallel plot
      3. Radviz plots
      4. Scatter matrix plot
      5. Lag plot
      6. Bootstrap plot
    21. pandas-datareader
      1. Yahoo Finance
      2. World Bank
    22. Summary
  12. Section 3: Mastering Different Data Operations in pandas
  13. Indexing and Selecting in pandas
    1. Basic indexing
      1. Accessing attributes using the dot operator
      2. Range slicing
    2. Labels, integer, and mixed indexing
      1. Label-oriented indexing
      2. Integer-oriented indexing
      3. The .iat and .at operators
      4. Mixed indexing with the .ix operator
    3. Multi-indexing
      1. Swapping and re-ordering levels
      2. Cross-sections
    4. Boolean indexing
      1. The isin and any all methods
      2. Using the where() method
    5. Operations on indexes
    6. Summary
  14. Grouping, Merging, and Reshaping Data in pandas
    1. Grouping data
      1. The groupby operation
      2. Using groupby with a MultiIndex
      3. Using the aggregate method
      4. Applying multiple functions
      5. The transform() method
      6. Filtering
    2. Merging and joining
      1. The concat function
      2. Using append
        1. Appending a single row to a DataFrame
      3. SQL-like merging/joining of DataFrame objects
      4. The join function
    3. Pivots and reshaping data
      1. Stacking and unstacking
        1. The stack() function
        2. The unstack() function
    4. Other methods for reshaping DataFrames
      1. Using the melt function
      2. The pandas.get_dummies() function
      3. pivot table
      4. Transpose in pandas
      5. Squeeze
      6. nsmallest and nlargest
    5. Summary
  15. Special Data Operations in pandas
    1. Writing and applying one-liner custom functions
      1. lambda and apply
    2. Handling missing values
      1. Sources of missing values
        1. Data extraction 
        2. Data collection 
        3. Data missing at random 
        4. Data not missing at random 
      2. Different types of missing values
      3. Miscellaneous analysis of missing values
      4. Strategies for handling missing values
        1. Deletion 
        2. Imputation
        3. Interpolation 
        4. KNN 
    3. A survey of methods on series
      1. The items() method
      2. The keys() method
      3. The pop() method
      4. The apply() method
      5. The map() method
      6. The drop() method
      7. The equals() method
      8. The sample() method
      9. The ravel() function
      10. The value_counts() function
      11. The interpolate() function
      12. The align() function
    4. pandas string methods
      1. upper(), lower(), capitalize(), title(), and swapcase()
      2. contains(), find(), and replace()
      3. strip() and split()
      4. startswith() and endswith()
      5. The is...() functions
    5. Binary operations on DataFrames and series
    6. Binning values
    7. Using mathematical methods on DataFrames
      1. The abs() function
      2. corr() and cov()
      3. cummax(), cumin(), cumsum(), and cumprod()
      4. The describe() function
      5. The diff() function
      6. The rank() function
      7. The quantile() function
      8. The round() function
      9. The pct_change() function
      10. min(), max(), median(), mean(), and mode()
      11. all() and any()
      12. The clip() function
      13. The count() function
    8. Summary
  16. Time Series and Plotting Using Matplotlib
    1. Handling time series data
      1. Reading in time series data
      2. Assigning date indexes and subsetting in time series data
      3. Plotting the time series data
      4. Resampling and rolling of the time series data
      5. Separating timestamp components
      6. DateOffset and TimeDelta objects
      7. Time series-related instance methods
        1. Shifting/lagging
        2. Frequency conversion
        3. Resampling of data
      8. Aliases for time series frequencies
      9. Time series concepts and datatypes
        1. Period and PeriodIndex
          1. PeriodIndex
      10. Conversion between time series datatypes
    2. A summary of time series-related objects
      1. Interconversions between strings and timestamps
      2. Data-processing techniques for time series data
        1. Data transformation
    3. Plotting using matplotlib
    4. Summary
  17. Section 4: Going a Step Beyond with pandas
  18. Making Powerful Reports In Jupyter Using pandas
    1. pandas styling
      1. In-built styling options
      2. User-defined styling options
    2. Navigating Jupyter Notebook
      1. Exploring the menu bar of Jupyter Notebook
        1. Edit mode and command mode
      2. Mouse navigation
      3. Jupyter Notebook Dashboard
      4. Ipywidgets
      5. Interactive visualizations
      6. Writing mathematical equations in Jupyter Notebook
      7. Formatting text in Jupyter Notebook
        1. Headers
        2. Bold and italics
        3. Alignment
        4. Font color
        5. Bulleted lists
        6. Tables
        7. Tables
        8. HTML
        9. Citation
      8. Miscellaneous operations in Jupyter Notebook
        1. Loading an image
        2. Hyperlinks
        3. Writing to a Python file
        4. Running a Python file
        5. Loading a Python file
        6. Internal Links
      9. Sharing Jupyter Notebook reports
        1. Using NbViewer
        2. Using the browser
        3. Using Jupyter Hub
    3. Summary
  19. A Tour of Statistics with pandas and NumPy
    1. Descriptive statistics versus inferential statistics
    2. Measures of central tendency and variability
      1. Measures of central tendency
        1. The mean
        2. The median
        3. The mode
        4. Computing the measures of central tendency of a dataset in Python
      2. Measures of variability, dispersion, or spread
        1. Range
        2. Quartile
        3. Deviation and variance
    3. Hypothesis testing – the null and alternative hypotheses
      1. The null and alternative hypotheses
        1. The alpha and p-values
        2. Type I and Type II errors
      2. Statistical hypothesis tests
        1. Background
        2. The z-test
        3. The t-test
          1. Types of t-tests
        4. A t-test example
        5. chi-square test
      3. ANOVA test
      4. Confidence intervals
        1. An illustrative example
      5. Correlation and linear regression
        1. Correlation
        2. Linear regression
        3. An illustrative example
    4. Summary
  20. A Brief Tour of Bayesian Statistics and Maximum Likelihood Estimates
    1. Introduction to Bayesian statistics
    2. The mathematical framework for Bayesian statistics
      1. Bayes' theory and odds
      2. Applications of Bayesian statistics
    3. Probability distributions
      1. Fitting a distribution
        1. Discrete probability distributions
        2. Discrete uniform distribution
          1. The Bernoulli distribution
          2. The binomial distribution
          3. The Poisson distribution
          4. The geometric distribution
          5. The negative binomial distribution
        3. Continuous probability distributions
          1. The continuous uniform distribution
          2. The exponential distribution
          3. The normal distribution
    4. Bayesian statistics versus frequentist statistics
      1. What is probability?
      2. How the model is defined
      3. Confidence (frequentist) versus credible (Bayesian) intervals
    5. Conducting Bayesian statistical analysis
    6. Monte Carlo estimation of the likelihood function and PyMC
      1. Bayesian analysis example – switchpoint detection
      2. Maximum likelihood estimate
        1. MLE calculation examples
          1. Uniform distribution
          2. Poisson distribution
    7. References
    8. Summary
  21. Data Case Studies Using pandas
    1. End-to-end exploratory data analysis
      1. Data overview
      2. Feature selection
      3. Feature extraction
      4. Data aggregation
    2. Web scraping with Python
      1. Web scraping using pandas
      2. Web scraping using BeautifulSoup
    3. Data validation
      1. Data overview
      2. Structured databases versus unstructured databases
      3. Validating data types
      4. Validating dimensions
      5. Validating individual entries
        1. Using pandas indexing
        2. Using loops
    4. Summary
  22. The pandas Library Architecture
    1. Understanding the pandas file hierarchy
      1. Description of pandas modules and files
        1. pandas/core
        2. pandas/io
        3. pandas/tools
        4. pandas/util
        5. pandas/tests
        6. pandas/compat
        7. pandas/computation
        8. pandas/plotting
        9. pandas/tseries
    2. Improving performance using Python extensions
    3. Summary
  23. pandas Compared with Other Tools
    1. Comparison with R
      1. Data types in R
      2. R lists
      3. R DataFrames
    2. Slicing and selection
      1. Comparing R-matrix and NumPy array
      2. Comparing R lists and pandas series
        1. Specifying a column name in R
        2. Specifying a column name in pandas
      3. R DataFrames versus pandas DataFrames
        1. Multi-column selection in R
        2. Multi-column selection in pandas
      4. Arithmetic operations on columns
      5. Aggregation and GroupBy
        1. Aggregation in R
        2. The pandas GroupBy operator
      6. Comparing matching operators in R and pandas
        1. R %in% operator
        2. Pandas isin() function
      7. Logical subsetting
        1. Logical subsetting in R
        2. Logical subsetting in pandas
      8. Split-apply-combine
        1. Implementation in R
        2. Implementation in pandas
      9. Reshaping using melt
        1. R melt function
        2. The pandas melt function
      10. Categorical data
        1. R example using cut()
        2. The pandas solution
    3. Comparison with SQL
      1. SELECT
        1. SQL
        2. pandas
      2. Where
        1. SQL
        2. pandas
        3. SQL
        4. pandas
        5. SQL
        6. pandas
      3. group by
        1. SQL
        2. pandas
        3. SQL
        4. pandas
        5. SQL
        6. pandas
      4. update
        1. SQL
        2. pandas
      5. delete
        1. SQL
        2. pandas
      6. JOIN
        1. SQL
        2. pandas
        3. SQL
        4. pandas
        5. SQL
        6. pandas
    4. Comparison with SAS
    5. Summary
  24. A Brief Tour of Machine Learning
    1. The role of pandas in machine learning
    2. Installation of scikit-learn
      1. Installing via Anaconda
      2. Installing on Unix (Linux/macOS)
      3. Installing on Windows
    3. Introduction to machine learning
      1. Supervised versus unsupervised learning
      2. Illustration using document classification
        1. Supervised learning
        2. Unsupervised learning
      3. How machine learning systems learn
    4. Application of machine learning – Kaggle Titanic competition
      1. The Titanic: Machine Learning from Disaster problem
      2. The problem of overfitting
    5. Data analysis and preprocessing using pandas
      1. Examining the data
      2. Handling missing values
    6. A naive approach to the Titanic problem
    7. The scikit-learn ML/classifier interface
    8. Supervised learning algorithms
      1. Constructing a model using Patsy for scikit-learn
      2. General boilerplate code explanation
      3. Logistic regression
      4. Support vector machine
      5. Decision trees
      6. Random forest
    9. Unsupervised learning algorithms
      1. Dimensionality reduction
      2. K-means clustering
      3. XGBoost case study
      4. Entropy
    10. Summary
  25. Other Books You May Enjoy
    1. Leave a review - let other readers know what you think
3.135.183.89