Data Science Projects with Python

second edition

Data Science Projects with Python

second edition

Copyright © 2021 Packt Publishing

All rights reserved. No part of this course may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this course to ensure the accuracy of the information presented. However, the information contained in this course is sold without warranty, either express or implied. Neither the author nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this course.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this course by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Author: Stephen Klosterman

Reviewers: Ashish Jain and Deepti Miyan Gupta

Managing Editor: Mahesh Dhyani

Acquisitions Editors: Sneha Shinde and Anindya Sil

Production Editor: Shantanu Zagade

Editorial Board: Megan Carlisle, Mahesh Dhyani, Heather Gopsill, Manasa Kumar, Alex Mazonowicz, Monesh Mirpuri, Bridget Neale, Abhishek Rane, Brendan Rodrigues, Ankita Thakur, Nitesh Thakur, and Jonathan Wray

First published: April 2019

Second edition: July 2021

Production reference: 1280721

ISBN: 978-1-80056-448-0

Published by Packt Publishing Ltd.

Livery Place, 35 Livery Street

Birmingham B3 2PB, UK

Table of Contents

Preface

1. Data Exploration and Cleaning

Introduction

Python and the Anaconda Package Management System

Indexing and the Slice Operator

Exercise 1.01: Examining Anaconda and Getting Familiar with Python

Different Types of Data Science Problems

Loading the Case Study Data with Jupyter and pandas

Exercise 1.02: Loading the Case Study Data in a Jupyter Notebook

Getting Familiar with Data and Performing Data Cleaning

The Business Problem

Data Exploration Steps

Exercise 1.03: Verifying Basic Data Integrity

Boolean Masks

Exercise 1.04: Continuing Verification of Data Integrity

Exercise 1.05: Exploring and Cleaning the Data

Data Quality Assurance and Exploration

Exercise 1.06: Exploring the Credit Limit and Demographic Features

Deep Dive: Categorical Features

Exercise 1.07: Implementing OHE for a Categorical Feature

Exploring the Financial History Features in the Dataset

Activity 1.01: Exploring the Remaining Financial Features in the Dataset

Summary

2. Introduction to Scikit-Learn and Model Evaluation

Introduction

Exploring the Response Variable and Concluding the Initial Exploration

Introduction to Scikit-Learn

Generating Synthetic Data

Data for Linear Regression

Exercise 2.01: Linear Regression in Scikit-Learn

Model Performance Metrics for Binary Classification

Splitting the Data: Training and Test Sets

Classification Accuracy

True Positive Rate, False Positive Rate, and Confusion Matrix

Exercise 2.02: Calculating the True and False Positive and Negative Rates and Confusion Matrix in Python

Discovering Predicted Probabilities: How Does Logistic Regression Make Predictions?

Exercise 2.03: Obtaining Predicted Probabilities from a Trained Logistic Regression Model

The Receiver Operating Characteristic (ROC) Curve

Precision

Activity 2.01: Performing Logistic Regression with a New Feature and Creating a Precision-Recall Curve

Summary

3. Details of Logistic Regression and Feature Exploration

Introduction

Examining the Relationships Between Features and the Response Variable

Pearson Correlation

Mathematics of Linear Correlation

F-test

Exercise 3.01: F-test and Univariate Feature Selection

Finer Points of the F-test: Equivalence to the t-test for Two Classes and Cautions

Hypotheses and Next Steps

Exercise 3.02: Visualizing the Relationship Between the Features and Response Variable

Univariate Feature Selection: What it Does and Doesn't Do

Understanding Logistic Regression and the Sigmoid Function Using Function Syntax in Python

Exercise 3.03: Plotting the Sigmoid Function

Scope of Functions

Why Is Logistic Regression Considered a Linear Model?

Exercise 3.04: Examining the Appropriateness of Features for Logistic Regression

From Logistic Regression Coefficients to Predictions Using Sigmoid

Exercise 3.05: Linear Decision Boundary of Logistic Regression

Activity 3.01: Fitting a Logistic Regression Model and Directly Using the Coefficients

Summary

4. The Bias-Variance Trade-Off

Introduction

Estimating the Coefficients and Intercepts of Logistic Regression

Gradient Descent to Find Optimal Parameter Values

Exercise 4.01: Using Gradient Descent to Minimize a Cost Function

Assumptions of Logistic Regression

The Motivation for Regularization: The Bias-Variance Trade-Off

Exercise 4.02: Generating and Modeling Synthetic Classification Data

Lasso (L1) and Ridge (L2) Regularization

Cross-Validation: Choosing the Regularization Parameter

Exercise 4.03: Reducing Overfitting on the Synthetic Data Classification Problem

Options for Logistic Regression in Scikit-Learn

Scaling Data, Pipelines, and Interaction Features in Scikit-Learn

Activity 4.01: Cross-Validation and Feature Engineering with the Case Study Data

Summary

5. Decision Trees and Random Forests

Introduction

Decision Trees

The Terminology of Decision Trees and Connections to Machine Learning

Exercise 5.01: A Decision Tree in Scikit-Learn

Training Decision Trees: Node Impurity

Features Used for the First Splits: Connections to Univariate Feature Selection and Interactions

Training Decision Trees: A Greedy Algorithm

Training Decision Trees: Different Stopping Criteria and Other Options

Using Decision Trees: Advantages and Predicted Probabilities

A More Convenient Approach to Cross-Validation

Exercise 5.02: Finding Optimal Hyperparameters for a Decision Tree

Random Forests: Ensembles of Decision Trees

Random Forest: Predictions and Interpretability

Exercise 5.03: Fitting a Random Forest

Checkerboard Graph

Activity 5.01: Cross-Validation Grid Search with Random Forest

Summary

6. Gradient Boosting, XGBoost, and SHAP Values

Introduction

Gradient Boosting and XGBoost

What Is Boosting?

Gradient Boosting and XGBoost

XGBoost Hyperparameters

Early Stopping

Tuning the Learning Rate

Other Important Hyperparameters in XGBoost

Exercise 6.01: Randomized Grid Search for Tuning XGBoost Hyperparameters

Another Way of Growing Trees: XGBoost's grow_policy

Explaining Model Predictions with SHAP Values

Exercise 6.02: Plotting SHAP Interactions, Feature Importance, and Reconstructing Predicted Probabilities from SHAP Values

Missing Data

Saving Python Variables to a File

Activity 6.01: Modeling the Case Study Data with XGBoost and Explaining the Model with SHAP

Summary

7. Test Set Analysis, Financial Insights, and Delivery to the Client

Introduction

Review of Modeling Results

Feature Engineering

Ensembling Multiple Models

Different Modeling Techniques

Balancing Classes

Model Performance on the Test Set

Distribution of Predicted Probability and Decile Chart

Exercise 7.01: Equal-Interval Chart

Calibration of Predicted Probabilities

Financial Analysis

Financial Conversation with the Client

Exercise 7.02: Characterizing Costs and Savings

Activity 7.01: Deriving Financial Insights

Final Thoughts on Delivering a Predictive Model to the Client

Model Monitoring

Ethics in Predictive Modeling

Summary

Appendix

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.8.42