Home Page Icon
Home Page
Table of Contents for
Cover
Close
Cover
by Q. Ethan McCallum
Bad Data Handbook
Bad Data Handbook
About the Authors
Preface
Conventions Used in This Book
Using Code Examples
Safari® Books Online
How to Contact Us
Acknowledgments
1. Setting the Pace: What Is Bad Data?
2. Is It Just Me, or Does This Data Smell Funny?
Understand the Data Structure
Field Validation
Value Validation
Physical Interpretation of Simple Statistics
Visualization
Keyword PPC Example
Search Referral Example
Recommendation Analysis
Time Series Data
Conclusion
3. Data Intended for Human Consumption, Not Machine Consumption
The Data
The Problem: Data Formatted for Human Consumption
The Arrangement of Data
Data Spread Across Multiple Files
The Solution: Writing Code
Reading Data from an Awkward Format
Reading Data Spread Across Several Files
Postscript
Other Formats
Summary
4. Bad Data Lurking in Plain Text
Which Plain Text Encoding?
Guessing Text Encoding
Normalizing Text
Problem: Application-Specific Characters Leaking into Plain Text
Text Processing with Python
Exercises
5. (Re)Organizing the Web’s Data
Can You Get That?
General Workflow Example
robots.txt
Identifying the Data Organization Pattern
Store Offline Version for Parsing
Scrape the Information Off the Page
The Real Difficulties
Download the Raw Content If Possible
Forms, Dialog Boxes, and New Windows
Flash
The Dark Side
Conclusion
6. Detecting Liars and the Confused in Contradictory Online Reviews
Weotta
Getting Reviews
Sentiment Classification
Polarized Language
Corpus Creation
Training a Classifier
Validating the Classifier
Designing with Data
Lessons Learned
Summary
Resources
7. Will the Bad Data Please Stand Up?
Example 1: Defect Reduction in Manufacturing
Example 2: Who’s Calling?
Example 3: When “Typical” Does Not Mean “Average”
Lessons Learned
Will This Be on the Test?
8. Blood, Sweat, and Urine
A Very Nerdy Body Swap Comedy
How Chemists Make Up Numbers
All Your Database Are Belong to Us
Check, Please
Live Fast, Die Young, and Leave a Good-Looking Corpse Code Repository
Rehab for Chemists (and Other Spreadsheet Abusers)
tl;dr
9. When Data and Reality Don’t Match
Whose Ticker Is It Anyway?
Splits, Dividends, and Rescaling
Bad Reality
Conclusion
10. Subtle Sources of Bias and Error
Imputation Bias: General Issues
Reporting Errors: General Issues
Other Sources of Bias
Topcoding/Bottomcoding
Seam Bias
Proxy Reporting
Sample Selection
Conclusions
References
11. Don’t Let the Perfect Be the Enemy of the Good: Is Bad Data Really Bad?
But First, Let’s Reflect on Graduate School …
Moving On to the Professional World
Moving into Government Work
Government Data Is Very Real
Service Call Data as an Applied Example
Moving Forward
Lessons Learned and Looking Ahead
12. When Databases Attack: A Guide for When to Stick to Files
History
Building My Toolset
The Roadblock: My Datastore
Consider Files as Your Datastore
Files Are Simple!
Files Work with Everything
Files Can Contain Any Data Type
Data Corruption Is Local
They Have Great Tooling
There’s No Install Tax
File Concepts
Encoding
Text Files
Binary Data
Memory-Mapped Files
File Formats
Delimiters
A Web Framework Backed by Files
Motivation
Implementation
Reflections
13. Crouching Table, Hidden Network
A Relational Cost Allocations Model
The Delicate Sound of a Combinatorial Explosion…
The Hidden Network Emerges
Storing the Graph
Navigating the Graph with Gremlin
Finding Value in Network Properties
Think in Terms of Multiple Data Models and Use the Right Tool for the Job
Acknowledgments
14. Myths of Cloud Computing
Introduction to the Cloud
What Is “The Cloud”?
The Cloud and Big Data
Introducing Fred
At First Everything Is Great
They Put 100% of Their Infrastructure in the Cloud
As Things Grow, They Scale Easily at First
Then Things Start Having Trouble
They Need to Improve Performance
Higher IO Becomes Critical
A Major Regional Outage Causes Massive Downtime
Higher IO Comes with a Cost
Data Sizes Increase
Geo Redundancy Becomes a Priority
Horizontal Scale Isn’t as Easy as They Hoped
Costs Increase Dramatically
Fred’s Follies
Myth 1: Cloud Is a Great Solution for All Infrastructure Components
How This Myth Relates to Fred’s Story
Myth 2: Cloud Will Save Us Money
How This Myth Relates to Fred’s Story
Myth 3: Cloud IO Performance Can Be Improved to Acceptable Levels Through Software RAID
How This Myth Relates to Fred’s Story
Myth 4: Cloud Computing Makes Horizontal Scaling Easy
How This Myth Relates to Fred’s Story
Conclusion and Recommendations
15. The Dark Side of Data Science
Avoid These Pitfalls
Know Nothing About Thy Data
Be Inconsistent in Cleaning and Organizing the Data
Assume Data Is Correct and Complete
Spillover of Time-Bound Data
Thou Shalt Provide Your Data Scientists with a Single Tool for All Tasks
Using a Production Environment for Ad-Hoc Analysis
The Ideal Data Science Environment
Thou Shalt Analyze for Analysis’ Sake Only
Thou Shalt Compartmentalize Learnings
Thou Shalt Expect Omnipotence from Data Scientists
Where Do Data Scientists Live Within the Organization?
Final Thoughts
16. How to Feed and Care for Your Machine-Learning Experts
Define the Problem
Fake It Before You Make It
Create a Training Set
Pick the Features
Encode the Data
Split Into Training, Test, and Solution Sets
Describe the Problem
Respond to Questions
Integrate the Solutions
Conclusion
17. Data Traceability
Why?
Personal Experience
Snapshotting
Saving the Source
Weighting Sources
Backing Out Data
Separating Phases (and Keeping them Pure)
Identifying the Root Cause
Finding Areas for Improvement
Immutability: Borrowing an Idea from Functional Programming
An Example
Crawlers
Change
Clustering
Popularity
Conclusion
18. Social Media: Erasable Ink?
Social Media: Whose Data Is This Anyway?
Control
Commercial Resyndication
Expectations Around Communication and Expression
Technical Implications of New End User Expectations
What Does the Industry Do?
Validation API
Update Notification API
What Should End Users Do?
How Do We Work Together?
19. Data Quality Analysis Demystified: Knowing When Your Data Is Good Enough
Framework Introduction: The Four Cs of Data Quality Analysis
Complete
Coherent
Correct
aCcountable
Conclusion
Index
About the Author
Colophon
Copyright
Search in book...
Toggle Font Controls
Playlists
Add To
Create new playlist
Name your new playlist
Playlist description (optional)
Cancel
Create playlist
Sign In
Email address
Password
Forgot Password?
Create account
Login
or
Continue with Facebook
Continue with Google
Sign Up
Full Name
Email address
Confirm Email Address
Password
Login
Create account
or
Continue with Facebook
Continue with Google
Next
Next Chapter
O'Reilly Strata Conference
Add Highlight
No Comment
..................Content has been hidden....................
You can't read the all page of ebook, please click
here
login for view all page.
Day Mode
Cloud Mode
Night Mode
Reset