There are awesome discoveries to be made and valuable stories to be told in datasets--and this book will help you uncover them. Whether you already work with data or just want to understand its possibilities, the techniques and advice in this practical book will help you learn how to better clean, evaluate, and analyze data to generate meaningful insights and compelling visualizations.

Through foundational concepts and worked examples, author Susan McGregor provides the tools you need to evaluate and analyze all kinds of data and communicate your findings effectively. This book provides a methodical, jargon-free way for practitioners of all levels to harness the power of data.

  • Use Python 3.8+ to read, write, and transform data from a variety of sources
  • Understand and use programming basics in Python to wrangle data at scale
  • Organize, document, and structure your code using best practices
  • Complete exercises either on your own machine or on the web
  • Collect data from structured data files, web pages, and APIs
  • Perform basic statistical analysis to make meaning from data sets
  • Visualize and present data in clear and compelling ways

Table of Contents

  1. Preface
    1. Who should read this book?
    2. Who shouldn’t read this book?
    3. What to expect from this volume
    4. Conventions Used in This Book
    5. Using Code Examples
    6. O’Reilly Online Learning
    7. How to Contact Us
  2. 1. Introduction to Data Wrangling and Data Quality
    1. What is Data Wrangling?
    2. What is Data Quality?
    3. Data integrity
    4. Data “fit”
    5. Why Python?
    6. Versatility
    7. Accessibility
    8. Readability
    9. Community
    10. Python Alternatives
    11. Getting started with Python
    12. Writing and “Running” Python
    13. Working with Python on your own device
    14. Getting started with the command line
    15. Installing Python, Jupyter Notebook and a Code Editor
    16. Running Jupyter Notebook
    17. Working with Python online
    18. Hello World!
    19. Documenting, saving and versioning your work
    20. Documenting
    21. Saving
    22. Versioning
    23. Conclusion
  3. 2. Introduction to Python
    1. The Programming “Parts of Speech”
    2. Nouns ≈ Variables
    3. Verbs ≈ Functions
    4. Cooking with Custom Functions
    5. Libraries: Borrowing Custom Functions from Other Coders
    6. Taking Control: “Loops” and “Conditionals”
    7. In the Loop
    8. One Condition…
    9. Understanding errors
    10. Syntax Snafus
    11. Runtime Runaround
    12. Logic Loss
    13. Hitting the Road with Citi Bike Data
    14. Seeking Scale
    15. Conclusion
  4. 3. Understanding Data Quality
    1. Assessing data fit
    2. Validity
    3. Reliability
    4. Representativeness
    5. Assessing data integrity
    6. Characteristics of Data Integrity
    7. Improving Data Quality
    8. Data cleaning
    9. Data Augmentation
  5. 4. Working with File-Based Data in Python
    1. Structured, Semi-Structured and Unstructured data
    2. Working with Structured Data
    3. Table-Type Data — Take it to Delimit
    4. Wrangling Table-Type Data with Python
    5. Real-World Data Wrangling: Understanding Unemployment
    6. Finally, Fixed-Width
    7. Working with Semi-Structured Data
    8. Feed-Type Data — Web-Based Live Updates
    9. Wrangling Feed-Type Data with Python
    10. Working with Unstructured Data
    11. Image-Based Text - Accessing Data in PDFs
    12. Wrangling PDFs with Python
    13. Accessing PDF Tables with Tabula
    14. Conclusion
  6. 5. Working with Web-Based Data
    1. Accessing Online XML and JSON
    2. Introducing APIs
    3. Basic APIs: A Search Engine Example
    4. Specialized APIs - Adding Basic Authentication
    5. Reading API documentation
    6. Protecting Your API Key When Using Python
    7. Creating Your “Credentials” File
    8. Using Your Credentials in a Separate Script
    9. Getting Started with .gitignore
    10. Specialized APIs - Working With OAuth
    11. Applying for a Twitter Developer Account
    12. Creating Your Twitter “App” and Credentials
    13. Encoding your API Key and Secret
    14. Requesting an Access Token and Data from the Twitter API
    15. API Ethics
    16. Web Scraping: The Data Source of Last Resort
    17. Carefully Scraping the MTA
    18. Using Browser Inspection Tools
    19. Starting the Soup
    20. Conclusion
  7. 6. Assessing Data Quality
    1. The Pandemic and the PPP
    2. Assessing Data Integrity
    3. Is it Timely?
    4. Is it Complete?
    5. Is it Well-Annotated?
    6. Is it High-Volume?
    7. Is it Historical?
    8. Is it Consistent?
    9. Is it Multivariate?
    10. Is it Atomic?
    11. Is it Clear?
    12. Is it Dimensionally Structured?
    13. Is it of Known Pedigree?
    14. Assessing Data Fit
    15. Validity
    16. Reliability
    17. Representativeness
    18. Conclusion
  8. 7. Cleaning, Transforming and Augmenting Data
    1. Selecting a Subset of Citi Bike Data
    2. A Simple Split
    3. Regular Expressions: Super-charged String-matching
    4. Making a Date
    5. De-crufting Data Files
    6. Decrypting Excel Dates
    7. Generating True CSVs from Fixed-Width Data
    8. Correcting for Spelling Inconsistencies
    9. The Circuitous Path to “Simple” Solutions
    10. Gotchas That Will Get Ya!
    11. Augmenting Your Data
    12. Conclusion
  9. 8. Structuring and Refactoring Your Code
    1. Revisiting Custom Functions
    2. Will You Use It More Than Once?
    3. Is It Ugly And Confusing?
    4. Do You Just Really Hate The Default Functionality?
    5. Understanding Scope
    6. Defining the Parameters for Function “Ingredients”
    7. What Are Your Options?
    8. Return Values
    9. Climbing the “Stack”
    10. Refactoring For Fun and Profit
    11. A Function for Identifying Weekdays
    12. Metadata Without the Mess
    13. Documenting Your Custom Scripts and Functions with pydoc
    14. The Case for Command-Line Arguments
    15. Where Scripts and Notebooks Diverge
    16. Conclusion