0%

Gain hands-on experience of Python programming with industry-standard machine learning techniques using pandas, scikit-learn, and XGBoost

Key Features

  • Think critically about data and use it to form and test a hypothesis
  • Choose an appropriate machine learning model and train it on your data
  • Communicate data-driven insights with confidence and clarity

Book Description

If data is the new oil, then machine learning is the drill. As companies gain access to ever-increasing quantities of raw data, the ability to deliver state-of-the-art predictive models that support business decision-making becomes more and more valuable.

In this book, you'll work on an end-to-end project based around a realistic data set and split up into bite-sized practical exercises. This creates a case-study approach that simulates the working conditions you'll experience in real-world data science projects.

You'll learn how to use key Python packages, including pandas, Matplotlib, and scikit-learn, and master the process of data exploration and data processing, before moving on to fitting, evaluating, and tuning algorithms such as regularized logistic regression and random forest.

Now in its second edition, this book will take you through the end-to-end process of exploring data and delivering machine learning models. Updated for 2021, this edition includes brand new content on XGBoost, SHAP values, algorithmic fairness, and the ethical concerns of deploying a model in the real world.

By the end of this data science book, you'll have the skills, understanding, and confidence to build your own machine learning models and gain insights from real data.

What you will learn

  • Load, explore, and process data using the pandas Python package
  • Use Matplotlib to create compelling data visualizations
  • Implement predictive machine learning models with scikit-learn
  • Use lasso and ridge regression to reduce model overfitting
  • Evaluate random forest and logistic regression model performance
  • Deliver business insights by presenting clear, convincing conclusions

Who this book is for

Data Science Projects with Python – Second Edition is for anyone who wants to get started with data science and machine learning. If you're keen to advance your career by using data analysis and predictive modeling to generate business insights, then this book is the perfect place to begin. To quickly grasp the concepts covered, it is recommended that you have basic experience of programming with Python or another similar language, and a general interest in statistics.

Table of Contents

  1. Data Science Projects with Python
  2. second edition
  3. Preface
    1. About the Book
    2. About the Author
    3. Objectives
    4. Audience
    5. Approach
    6. About the Chapters
    7. Hardware Requirements
    8. Software Requirements
    9. Installation and Setup
    10. Code Bundle
    11. Anaconda and Setting up Your Environment
    12. Conventions
    13. Code Presentation
    14. Get in Touch
    15. Please Leave a Review
  4. 1. Data Exploration and Cleaning
    1. Introduction
    2. Python and the Anaconda Package Management System
    3. Indexing and the Slice Operator
    4. Exercise 1.01: Examining Anaconda and Getting Familiar with Python
    5. Different Types of Data Science Problems
    6. Loading the Case Study Data with Jupyter and pandas
    7. Exercise 1.02: Loading the Case Study Data in a Jupyter Notebook
    8. Getting Familiar with Data and Performing Data Cleaning
    9. The Business Problem
    10. Data Exploration Steps
    11. Exercise 1.03: Verifying Basic Data Integrity
    12. Boolean Masks
    13. Exercise 1.04: Continuing Verification of Data Integrity
    14. Exercise 1.05: Exploring and Cleaning the Data
    15. Data Quality Assurance and Exploration
    16. Exercise 1.06: Exploring the Credit Limit and Demographic Features
    17. Deep Dive: Categorical Features
    18. Exercise 1.07: Implementing OHE for a Categorical Feature
    19. Exploring the Financial History Features in the Dataset
    20. Activity 1.01: Exploring the Remaining Financial Features in the Dataset
    21. Summary
  5. 2. Introduction to Scikit-Learn and Model Evaluation
    1. Introduction
    2. Exploring the Response Variable and Concluding the Initial Exploration
    3. Introduction to Scikit-Learn
    4. Generating Synthetic Data
    5. Data for Linear Regression
    6. Exercise 2.01: Linear Regression in Scikit-Learn
    7. Model Performance Metrics for Binary Classification
    8. Splitting the Data: Training and Test Sets
    9. Classification Accuracy
    10. True Positive Rate, False Positive Rate, and Confusion Matrix
    11. Exercise 2.02: Calculating the True and False Positive and Negative Rates and Confusion Matrix in Python
    12. Discovering Predicted Probabilities: How Does Logistic Regression Make Predictions?
    13. Exercise 2.03: Obtaining Predicted Probabilities from a Trained Logistic Regression Model
    14. The Receiver Operating Characteristic (ROC) Curve
    15. Precision
    16. Activity 2.01: Performing Logistic Regression with a New Feature and Creating a Precision-Recall Curve
    17. Summary
  6. 3. Details of Logistic Regression and Feature Exploration
    1. Introduction
    2. Examining the Relationships Between Features and the Response Variable
    3. Pearson Correlation
    4. Mathematics of Linear Correlation
    5. F-test
    6. Exercise 3.01: F-test and Univariate Feature Selection
    7. Finer Points of the F-test: Equivalence to the t-test for Two Classes and Cautions
    8. Hypotheses and Next Steps
    9. Exercise 3.02: Visualizing the Relationship Between the Features and Response Variable
    10. Univariate Feature Selection: What it Does and Doesn't Do
    11. Understanding Logistic Regression and the Sigmoid Function Using Function Syntax in Python
    12. Exercise 3.03: Plotting the Sigmoid Function
    13. Scope of Functions
    14. Why Is Logistic Regression Considered a Linear Model?
    15. Exercise 3.04: Examining the Appropriateness of Features for Logistic Regression
    16. From Logistic Regression Coefficients to Predictions Using Sigmoid
    17. Exercise 3.05: Linear Decision Boundary of Logistic Regression
    18. Activity 3.01: Fitting a Logistic Regression Model and Directly Using the Coefficients
    19. Summary
  7. 4. The Bias-Variance Trade-Off
    1. Introduction
    2. Estimating the Coefficients and Intercepts of Logistic Regression
    3. Gradient Descent to Find Optimal Parameter Values
    4. Exercise 4.01: Using Gradient Descent to Minimize a Cost Function
    5. Assumptions of Logistic Regression
    6. The Motivation for Regularization: The Bias-Variance Trade-Off
    7. Exercise 4.02: Generating and Modeling Synthetic Classification Data
    8. Lasso (L1) and Ridge (L2) Regularization
    9. Cross-Validation: Choosing the Regularization Parameter
    10. Exercise 4.03: Reducing Overfitting on the Synthetic Data Classification Problem
    11. Options for Logistic Regression in Scikit-Learn
    12. Scaling Data, Pipelines, and Interaction Features in Scikit-Learn
    13. Activity 4.01: Cross-Validation and Feature Engineering with the Case Study Data
    14. Summary
  8. 5. Decision Trees and Random Forests
    1. Introduction
    2. Decision Trees
    3. The Terminology of Decision Trees and Connections to Machine Learning
    4. Exercise 5.01: A Decision Tree in Scikit-Learn
    5. Training Decision Trees: Node Impurity
    6. Features Used for the First Splits: Connections to Univariate Feature Selection and Interactions
    7. Training Decision Trees: A Greedy Algorithm
    8. Training Decision Trees: Different Stopping Criteria and Other Options
    9. Using Decision Trees: Advantages and Predicted Probabilities
    10. A More Convenient Approach to Cross-Validation
    11. Exercise 5.02: Finding Optimal Hyperparameters for a Decision Tree
    12. Random Forests: Ensembles of Decision Trees
    13. Random Forest: Predictions and Interpretability
    14. Exercise 5.03: Fitting a Random Forest
    15. Checkerboard Graph
    16. Activity 5.01: Cross-Validation Grid Search with Random Forest
    17. Summary
  9. 6. Gradient Boosting, XGBoost, and SHAP Values
    1. Introduction
    2. Gradient Boosting and XGBoost
    3. What Is Boosting?
    4. Gradient Boosting and XGBoost
    5. XGBoost Hyperparameters
    6. Early Stopping
    7. Tuning the Learning Rate
    8. Other Important Hyperparameters in XGBoost
    9. Exercise 6.01: Randomized Grid Search for Tuning XGBoost Hyperparameters
    10. Another Way of Growing Trees: XGBoost's grow_policy
    11. Explaining Model Predictions with SHAP Values
    12. Exercise 6.02: Plotting SHAP Interactions, Feature Importance, and Reconstructing Predicted Probabilities from SHAP Values
    13. Missing Data
    14. Saving Python Variables to a File
    15. Activity 6.01: Modeling the Case Study Data with XGBoost and Explaining the Model with SHAP
    16. Summary
  10. 7. Test Set Analysis, Financial Insights, and Delivery to the Client
    1. Introduction
    2. Review of Modeling Results
    3. Feature Engineering
    4. Ensembling Multiple Models
    5. Different Modeling Techniques
    6. Balancing Classes
    7. Model Performance on the Test Set
    8. Distribution of Predicted Probability and Decile Chart
    9. Exercise 7.01: Equal-Interval Chart
    10. Calibration of Predicted Probabilities
    11. Financial Analysis
    12. Financial Conversation with the Client
    13. Exercise 7.02: Characterizing Costs and Savings
    14. Activity 7.01: Deriving Financial Insights
    15. Final Thoughts on Delivering a Predictive Model to the Client
    16. Model Monitoring
    17. Ethics in Predictive Modeling
    18. Summary
  11. Appendix
    1. 1. Data Exploration and Cleaning
    2. Activity 1.01: Exploring the Remaining Financial Features in the Dataset
    3. 2. Introduction to Scikit-Learn and Model Evaluation
    4. Activity 2.01: Performing Logistic Regression with a New Feature and Creating a Precision-Recall Curve
    5. 3. Details of Logistic Regression and Feature Exploration
    6. Activity 3.01: Fitting a Logistic Regression Model and Directly Using the Coefficients
    7. 4. The Bias-Variance Trade-Off
    8. Activity 4.01: Cross-Validation and Feature Engineering with the Case Study Data
    9. 5. Decision Trees and Random Forests
    10. Activity 5.01: Cross-Validation Grid Search with Random Forest
    11. 6. Gradient Boosting, XGBoost, and SHAP Values
    12. Activity 6.01: Modeling the Case Study Data with XGBoost and Explaining the Model with SHAP 
    13. 7. Test Set Analysis, Financial Insights, and Delivery to the Client
    14. Activity 7.01: Deriving Financial Insights
    15. Hey!
18.224.149.242