0%

Book Description

A beginner's guide to simplifying Extract, Transform, Load (ETL) processes with the help of hands-on tips, tricks, and best practices, in a fun and interactive way

Key Features

  • Explore data wrangling with the help of real-world examples and business use cases
  • Study various ways to extract the most value from your data in minimal time
  • Boost your knowledge with bonus topics, such as random data generation and data integrity checks

Book Description

While a huge amount of data is readily available to us, it is not useful in its raw form. For data to be meaningful, it must be curated and refined.

If you're a beginner, then The Data Wrangling Workshop will help to break down the process for you. You'll start with the basics and build your knowledge, progressing from the core aspects behind data wrangling, to using the most popular tools and techniques.

This book starts by showing you how to work with data structures using Python. Through examples and activities, you'll understand why you should stay away from traditional methods of data cleaning used in other languages and take advantage of the specialized pre-built routines in Python. Later, you'll learn how to use the same Python backend to extract and transform data from an array of sources, including the internet, large database vaults, and Excel financial tables. To help you prepare for more challenging scenarios, the book teaches you how to handle missing or incorrect data, and reformat it based on the requirements from your downstream analytics tool.

By the end of this book, you will have developed a solid understanding of how to perform data wrangling with Python, and learned several techniques and best practices to extract, clean, transform, and format your data efficiently, from a diverse array of sources.

What you will learn

  • Get to grips with the fundamentals of data wrangling
  • Understand how to model data with random data generation and data integrity checks
  • Discover how to examine data with descriptive statistics and plotting techniques
  • Explore how to search and retrieve information with regular expressions
  • Delve into commonly-used Python data science libraries
  • Become well-versed with how to handle and compensate for missing data

Who this book is for

The Data Wrangling Workshop is designed for developers, data analysts, and business analysts who are looking to pursue a career as a full-fledged data scientist or analytics expert. Although this book is for beginners who want to start data wrangling, prior working knowledge of the Python programming language is necessary to easily grasp the concepts covered here. It will also help to have a rudimentary knowledge of relational databases and SQL.

Table of Contents

  1. The Data Wrangling Workshop
  2. Second Edition
  3. Preface
    1. About the Book
      1. Audience
      2. About the Chapters
      3. Conventions
      4. Code Presentation
      5. Setting up Your Environment
      6. Installing Python
        1. Installing Python on Windows
        2. Installing Python on Linux
        3. Installing Python on MacOS
      7. Installing Libraries
      8. Project Jupyter
      9. Accessing the Code Files
  4. 1. Introduction to Data Wrangling with Python
    1. Introduction
    2. Importance of Data Wrangling
    3. Python for Data Wrangling
    4. Lists, Sets, Strings, Tuples, and Dictionaries
      1. Lists
    5. List Functions
      1. Exercise 1.01: Accessing the List Members
      2. Exercise 1.02: Generating and Iterating through a List
      3. Exercise 1.03: Iterating over a List and Checking Membership
      4. Exercise 1.04: Sorting a List
      5. Exercise 1.05: Generating a Random List
      6. Activity 1.01: Handling Lists
      7. Sets
      8. Introduction to Sets
      9. Union and Intersection of Sets
      10. Creating Null Sets
      11. Dictionary
      12. Exercise 1.06: Accessing and Setting Values in a Dictionary
      13. Exercise 1.07: Iterating over a Dictionary
      14. Exercise 1.08: Revisiting the Unique Valued List Problem
      15. Exercise 1.09: Deleting a Value from Dict
      16. Exercise 1.10: Dictionary Comprehension
      17. Tuples
      18. Creating a Tuple with Different Cardinalities
      19. Unpacking a Tuple
      20. Exercise 1.11: Handling Tuples
      21. Strings
      22. Exercise 1.12: Accessing Strings
      23. Exercise 1.13: String Slices
      24. String Functions
      25. Exercise 1.14: Splitting and Joining a String
      26. Activity 1.02: Analyzing a Multiline String and Generating the Unique Word Count
    6. Summary
  5. 2. Advanced Operations on Built-In Data Structures
    1. Introduction
    2. Advanced Data Structures
      1. Iterator
      2. Exercise 2.01: Introducing to the Iterator
      3. Stacks
      4. Exercise 2.02: Implementing a Stack in Python
      5. Exercise 2.03: Implementing a Stack Using User-Defined Methods
      6. Lambda Expressions
      7. Exercise 2.04: Implementing a Lambda Expression
      8. Exercise 2.05: Lambda Expression for Sorting
      9. Exercise 2.06: Multi-Element Membership Checking
      10. Queue
      11. Exercise 2.07: Implementing a Queue in Python
      12. Activity 2.01: Permutation, Iterator, Lambda, and List
    3. Basic File Operations in Python
      1. Exercise 2.08: File Operations
      2. File Handling
      3. Exercise 2.09: Opening and Closing a File
      4. The with Statement
      5. Opening a File Using the with Statement
      6. Exercise 2.10: Reading a File Line by Line
      7. Exercise 2.11: Writing to a File
      8. Activity 2.02: Designing Your Own CSV Parser
    4. Summary
  6. 3. Introduction to NumPy, Pandas, and Matplotlib
    1. Introduction
    2. NumPy Arrays
      1. NumPy Arrays and Features
      2. Exercise 3.01: Creating a NumPy Array (from a List)
      3. Exercise 3.02: Adding Two NumPy Arrays
      4. Exercise 3.03: Mathematical Operations on NumPy Arrays
    3. Advanced Mathematical Operations
      1. Exercise 3.04: Advanced Mathematical Operations on NumPy Arrays
      2. Exercise 3.05: Generating Arrays Using arange and linspace Methods
      3. Exercise 3.06: Creating Multi-Dimensional Arrays
      4. Exercise 3.07: The Dimension, Shape, Size, and Data Type of Two-dimensional Arrays
      5. Exercise 3.08: Zeros, Ones, Random, Identity Matrices, and Vectors
      6. Exercise 3.09: Reshaping, Ravel, Min, Max, and Sorting
      7. Exercise 3.10: Indexing and Slicing
      8. Conditional SubSetting
      9. Exercise 3.11: Array Operations
      10. Stacking Arrays
      11. Pandas DataFrames
      12. Exercise 3.12: Creating a Pandas Series
      13. Exercise 3.13: Pandas Series and Data Handling
      14. Exercise 3.14: Creating Pandas DataFrames
      15. Exercise 3.15: Viewing a DataFrame Partially
      16. Indexing and Slicing Columns
      17. Indexing and Slicing Rows
      18. Exercise 3.16: Creating and Deleting a New Column or Row
    4. Statistics and Visualization with NumPy and Pandas
      1. Refresher on Basic Descriptive Statistics
      2. Exercise 3.17: Introduction to Matplotlib through a Scatter Plot
    5. The Definition of Statistical Measures – Central Tendency and Spread
      1. Random Variables and Probability Distribution
      2. What is a Probability Distribution?
      3. Discrete Distributions
      4. Continuous Distributions
    6. Data Wrangling in Statistics and Visualization
      1. Using NumPy and Pandas to Calculate Basic Descriptive Statistics
      2. Random Number Generation Using NumPy
      3. Exercise 3.18: Generating Random Numbers from a Uniform Distribution
      4. Exercise 3.19: Generating Random Numbers from a Binomial Distribution and Bar Plot
      5. Exercise 3.20: Generating Random Numbers from a Normal Distribution and Histograms
      6. Exercise 3.21: Calculating Descriptive Statistics from a DataFrame
      7. Exercise 3.22: Built-in Plotting Utilities
      8. Activity 3.01: Generating Statistics from a CSV File
    7. Summary
  7. 4. A Deep Dive into Data Wrangling with Python
    1. Introduction
    2. Subsetting, Filtering, and Grouping
      1. Exercise 4.01: Examining the Superstore Sales Data in an Excel File
      2. Subsetting the DataFrame
      3. An Example Use Case – Determining Statistics on Sales and Profit
      4. Exercise 4.02: The unique Function
      5. Conditional Selection and Boolean Filtering
      6. Exercise 4.03: Setting and Resetting the Index
      7. The GroupBy Method
      8. Exercise 4.04: The GroupBy Method
    3. Detecting Outliers and Handling Missing Values
      1. Missing Values in Pandas
      2. Exercise 4.05: Filling in the Missing Values Using the fillna Method
      3. The dropna Method
      4. Exercise 4.06: Dropping Missing Values with dropna
      5. Outlier Detection Using a Simple Statistical Test
    4. Concatenating, Merging, and Joining
      1. Exercise 4.07: Concatenation in Datasets
      2. Merging by a Common Key
      3. Exercise 4.08: Merging by a Common Key
      4. The join Method
      5. Exercise 4.09: The join Method
    5. Useful Methods of Pandas
      1. Randomized Sampling
      2. Exercise 4.10: Randomized Sampling
      3. The value_counts Method
      4. Pivot Table Functionality
      5. Exercise 4.11: Sorting by Column Values – the sort_values Method
      6. Exercise 4.12: Flexibility of User-Defined Functions with the apply Method
      7. Activity 4.01: Working with the Adult Income Dataset (UCI)
    6. Summary
  8. 5. Getting Comfortable with Different Kinds of Data Sources
    1. Introduction
    2. Reading Data from Different Sources
      1. Data Files Provided with This Chapter
      2. Libraries to Install for This Chapter
      3. Reading Data Using Pandas
      4. Exercise 5.01: Working with Headers When Reading Data from a CSV File
      5. Exercise 5.02: Reading from a CSV File Where Delimiters Are Not Commas
      6. Exercise 5.03: Bypassing and Renaming the Headers of a CSV File
      7. Exercise 5.04: Skipping Initial Rows and Footers When Reading a CSV File
      8. Reading Only the First N Rows
      9. Exercise 5.05: Combining skiprows and nrows to Read Data in Small Chunks
      10. Setting the skip_blank_lines Option
      11. Reading CSV Data from a Zip File
      12. Reading from an Excel File Using sheet_name and Handling a Distinct sheet_name
      13. Exercise 5.06: Reading a General Delimited Text File
      14. Reading HTML Tables Directly from a URL
      15. Exercise 5.07: Further Wrangling to Get the Desired Data
      16. Reading from a JSON file
      17. Exercise 5.08: Reading from a JSON File
      18. Reading a PDF File
      19. Exercise 5.09: Reading Tabular Data from a PDF File
    3. Introduction to Beautiful Soup 4 and Web Page Parsing
      1. Structure of HTML
      2. Exercise 5.10: Reading an HTML File and Extracting Its Contents Using Beautiful Soup
      3. Exercise 5.11: DataFrames and BeautifulSoup
      4. Exercise 5.12: Exporting a DataFrame as an Excel File
      5. Exercise 5.13: Stacking URLs from a Document Using bs4
      6. Activity 5.01: Reading Tabular Data from a Web Page and Creating DataFrames
    4. Summary
  9. 6. Learning the Hidden Secrets of Data Wrangling
    1. Introduction
    2. Advanced List Comprehension and the zip Function
      1. Introduction to Generator Expressions
      2. Exercise 6.01: Generator Expressions
      3. Exercise 6.02: Single-Line Generator Expression
      4. Exercise 6.03: Extracting a List with Single Words
      5. Exercise 6.04: The zip Function
      6. Exercise 6.05: Handling Messy Data
    3. Data Formatting
      1. The % operator
      2. Using the format Function
      3. Exercise 6.06: Data Representation Using {}
    4. Identifying and Cleaning Outliers
      1. Exercise 6.07: Outliers in Numerical Data
      2. Z-score
      3. Exercise 6.08: The Z-Score Value to Remove Outliers
    5. Levenshtein Distance
      1. Additional Software Required for This Section
      2. Exercise 6.09: Fuzzy String Matching
      3. Activity 6.01: Handling Outliers and Missing Data
    6. Summary
  10. 7. Advanced Web Scraping and Data Gathering
    1. Introduction
    2. The Requests and BeautifulSoup Libraries
      1. Exercise 7.01: Using the Requests Library to Get a Response from the Wikipedia Home Page
      2. Exercise 7.02: Checking the Status of the Web Request
      3. Checking the Encoding of a Web Page
      4. Exercise 7.03: Decoding the Contents of a Response and Checking Its Length
      5. Exercise 7.04: Extracting Readable Text from a BeautifulSoup Object
      6. Extracting Text from a Section
      7. Extracting Important Historical Events that Happened on Today's Date
      8. Exercise 7.05: Using Advanced BS4 Techniques to Extract Relevant Text
      9. Exercise 7.06: Creating a Compact Function to Extract the On this day Text from the Wikipedia Home Page
    3. Reading Data from XML
      1. Exercise 7.07: Creating an XML File and Reading XML Element Objects
      2. Exercise 7.08: Finding Various Elements of Data within a Tree (Element)
      3. Reading from a Local XML File into an ElementTree Object
      4. Exercise 7.09: Traversing the Tree, Finding the Root, and Exploring All the Child Nodes and Their Tags and Attributes
      5. Exercise 7.10: Using the text Method to Extract Meaningful Data
      6. Extracting and Printing the GDP/Per Capita Information Using a Loop
      7. Finding All the Neighboring Countries for Each Country and Printing Them
      8. Exercise 7.11: A Simple Demo of Using XML Data Obtained by Web Scraping
    4. Reading Data from an API
      1. Defining the Base URL (or API Endpoint)
      2. Exercise 7.12: Defining and Testing a Function to Pull Country Data from an API
      3. Using the Built-In JSON Library to Read and Examine Data
      4. Printing All the Data Elements
      5. Using a Function that Extracts a DataFrame Containing Key Information
      6. Exercise 7.13: Testing the Function by Building a Small Database of Country Information
    5. Fundamentals of Regular Expressions (RegEx)
      1. RegEx in the Context of Web Scraping
      2. Exercise 7.14: Using the match Method to Check Whether a Pattern Matches a String/Sequence
      3. Using the compile Method to Create a RegEx Program
      4. Exercise 7.15: Compiling Programs to Match Objects
      5. Exercise 7.16: Using Additional Parameters in the match Method to Check for Positional Matching
      6. Finding the Number of Words in a List That End with "ing"
      7. The search Method in RegEx
      8. Exercise 7.17: The search Method in RegEx
      9. Exercise 7.18: Using the span Method of the Match Object to Locate the Position of the Matched Pattern
      10. Exercise 7.19: Examples of Single-Character Pattern Matching with search
      11. Exercise 7.20: Handling Pattern Matching at the Start or End of a String
      12. Exercise 7.21: Pattern Matching with Multiple Characters
      13. Exercise 7.22: Greedy versus Non-Greedy Matching
      14. Exercise 7.23: Controlling Repetitions to Match in a Text
      15. Sets of Matching Characters
      16. Exercise 7.24: Sets of Matching Characters
      17. Exercise 7.25: The Use of OR in RegEx Using the OR Operator
      18. The findall Method
      19. Activity 7.01: Extracting the Top 100 e-books from Gutenberg
      20. Activity 7.02: Building Your Own Movie Database by Reading an API
    6. Summary
  11. 8. RDBMS and SQL
    1. Introduction
    2. Refresher of RDBMS and SQL
      1. How Is an RDBMS Structured?
      2. SQL
      3. Using an RDBMS (MySQL/PostgreSQL/SQLite)
      4. Exercise 8.01: Connecting to a Database in SQLite
      5. DDL and DML Commands in SQLite
      6. Exercise 8.02: Using DDL and DML Commands in SQLite
      7. Reading Data from a Database in SQLite
      8. Exercise 8.03: Sorting Values That Are Present in the Database
      9. The ALTER Command
      10. Exercise 8.04: Altering the Structure of a Table and Updating the New Fields
      11. The GROUP BY clause
      12. Exercise 8.05: Grouping Values in Tables
    3. Relation Mapping in Databases
      1. Adding Rows in the comments Table
    4. Joins
    5. Retrieving Specific Columns from a JOIN Query
      1. Deleting Rows from Tables
      2. Exercise 8.06: Deleting Rows from Tables
      3. Updating Specific Values in a Table
      4. Exercise 8.07: RDBMS and DataFrames
      5. Activity 8.01: Retrieving Data Accurately from Databases
    6. Summary
  12. 9. Applications in Business Use Cases and Conclusion of the Course
    1. Introduction
    2. Applying Your Knowledge to a Data Wrangling Task
      1. Activity 9.01: Data Wrangling Task – Fixing UN Data
      2. Activity 9.02: Data Wrangling Task – Cleaning GDP Data
      3. Activity 9.03: Data Wrangling Task – Merging UN Data and GDP Data
      4. Activity 9.04: Data Wrangling Task – Connecting the New Data to the Database
    3. An Extension to Data Wrangling
      1. Additional Skills Required to Become a Data Scientist
      2. Basic Familiarity with Big Data and Cloud Technologies
      3. What Goes with Data Wrangling?
      4. Tips and Tricks for Mastering Machine Learning
    4. Summary
  13. Appendix
    1. 1. Introduction to Data Wrangling with Python
      1. Activity 1.01: Handling Lists
      2. Activity 1.02: Analyzing a Multiline String and Generating the Unique Word Count
    2. 2. Advanced Operations on Built-In Data Structures
      1. Activity 2.01: Permutation, Iterator, Lambda, and List
      2. Activity 2.02: Designing Your Own CSV Parser
    3. 3. Introduction to NumPy, Pandas, and Matplotlib
      1. Activity 3.01: Generating Statistics from a CSV File
    4. 4. A Deep Dive into Data Wrangling with Python
      1. Activity 4.01: Working with the Adult Income Dataset (UCI)
    5. 5. Getting Comfortable with Different Kinds of Data Sources
      1. Activity 5.01: Reading Tabular Data from a Web Page and Creating DataFrames
    6. 6. Learning the Hidden Secrets of Data Wrangling
      1. Activity 6.01: Handling Outliers and Missing Data
    7. 7. Advanced Web Scraping and Data Gathering
      1. Activity 7.01: Extracting the Top 100 e-books from Gutenberg
      2. Activity 7.02: Building Your Own Movie Database by Reading an API
    8. 8. RDBMS and SQL
      1. Activity 8.01: Retrieving Data Accurately from Databases
    9. 9. Applications in Business Use Cases and Conclusion of the Course
      1. Activity 9.01: Data Wrangling Task – Fixing UN Data
      2. Activity 9.02: Data Wrangling Task – Cleaning GDP Data
      3. Activity 9.03: Data Wrangling Task – Merging UN Data and GDP Data
      4. Activity 9.04: Data Wrangling Task – Connecting the New Data to the Database
3.145.52.189