Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Harry J. Foxwell

Creating Good Data

A Guide to Dataset Structure and Data Representation

1st ed.

../images/489489_1_En_BookFrontmatter_Figa_HTML.png

Harry J. Foxwell

Fairfax, VA, USA

ISBN 978-1-4842-6102-6e-ISBN 978-1-4842-6103-3

https://doi.org/10.1007/978-1-4842-6103-3

This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed.

The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.

The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Distributed to the book trade worldwide by Springer Science+Business Media New York, 1 NY Plaza, New York NY 10004. Phone 1-800-SPRINGER, fax (201) 348-4505, e-mail [email protected], or visit www.springeronline.com. Apress Media, LLC is a California LLC and the sole member (owner) is Springer Science + Business Media Finance Inc (SSBM Finance Inc). SSBM Finance Inc is a Delaware corporation.

To Eileen, for her endless love and support.

Introduction

Extracting actionable knowledge from data is a major ongoing challenge of modern IT in corporations, governments, and academia. Creating effectively usable datasets requires an understanding of data quality issues and of data types and the related analytics which can properly be applied. There are numerous data analytics resources – books, articles, blogs, and even commercial software – describing how to clean up and transform data after it has been collected, yet there is little practical guidance on how to avoid or minimize the typical “data cleaning” tasks beforehand. Such guidance and best practices are needed to eliminate or reduce lengthy dataset preparation.

Data analysts are often simply presented with datasets for exploration and study which are poorly designed, leading to difficulties in interpretation and to delays in producing usable results. In fact, some analysts report spending up to 80% of their time just getting data ready to be explored so that it can be effectively interpreted. And much data analytics training and published resources focus on how to clean and transform datasets before serious analyses can even begin. Inappropriate or confusing representations, unit of measurement choices, coding errors, missing values, outliers, and others can be avoided by using good data item selection, good dataset design and collection, and by understanding how data types determine the kinds of analyses that can be performed.

Why not create good data from the start, keeping in mind how it will be used, rather than fixing it after it is collected?

Creating Good Data discusses the principles and best practices of dataset creation and covers basic data types and their related appropriate statistics and visualizations. Following these guidelines results in more effective analyses and presentations of your research data. A key focus of this book is why certain data types and structures are chosen for representing concepts and measurements, in contrast to the usual discussions of how to analyze a specific data type once it has been selected.

Acknowledgments

I have benefited greatly from valuable encouragement and support for this work from numerous colleagues at George Mason University. Dr. James Baldo, Director of the Data Analytics Engineering program, provided helpful early advice and focus suggestions. And special thanks to Ms. Vidhyasri Ganapathi, Teaching Assistant for several of my data analytics courses, for identifying students’ challenges in learning and practicing data science and for confirming their need for this guidance in preparing good datasets.

Table of Contents

Chapter 1: The Need for Good Data 1

Who This Book Is For 1

Assumptions 2

The Importance of Getting Data Right 3

What Exactly Is “Data” and Where Does It Come From? 4

What Is “Good” Data? 5

Where “Bad” Data Comes From 6

Preventive Action 8

Summary 9

Chapter References 9

Chapter 2: Basic Data Types and When to Use Them 11

Four Analytic Data Types 12

Nominal/Categorical Data 13

Ordinal Data 16

Ratio Data 21

Interval Data 23

Other Data Types 24

Summary 25

Chapter References 26

Chapter 3: Representing Quantitative Data 27

Units of Measurement 27

Magnitudes and Quantities 28

Time Data 29

Money and Currency Data 31

Transformations and Indexing 31

Measurement Standards 32

Other Quantitative Measurement Issues 33

Summary 34

Chapter References 34

Chapter 4: Planning Your Data Collection and Analysis 37

Describing, Comparing, and Predicting 37

Example: Choosing a Data Type 38

Plan for Visualizing Your Data and Analysis 39

Data Analysis Tools 43

Summary 44

Chapter References 45

Chapter 5: Good Datasets 47

Sharing Data 47

Dataset Dictionaries/Metadata 48

Good Metadata 49

What’s in a Name? 50

Dataset Formats 51

Keep It Simple 52

Is Your Data Ready? 56

Summary 57

Chapter References 57

Chapter 6: Good Data Collection 59

What Is Bias? 59

Major Types of Bias 60

Sampling Bias 61

More Data Collection Problems 62

Recognizing and Reducing Bias 64

Understanding Outliers 64

The Consequences of Bias 65

Summary 65

Chapter References 66

Chapter 7: Dataset Examples and Use Cases 67

The Titanic Survivor Dataset 67

The IBM Employee Attrition Dataset 68

The Internet Movie Database (IMDb) 69

US Hurricane Data 70

UFO Sighting Data 71

Lessons Learned 72

Useful Dataset Sources 72

Summary 73

Chapter References 73

Chapter 8: Cleaning Your Data 75

Data Cleaning Challenges 75

Assessing Data Quality 77

Software and Methods for Data Cleaning 77

General Procedures 77

Microsoft Excel 78

R Project 79

Python 84

Operating System Utilities 88

AI/ML-Based Software 89

Summary 90

Chapter References 90

Chapter 9: Good Data Analytics 93

What Is Good Analytics? 93

Big Data Analytics 94

Data for Good 95

Summary 97

Chapter References 97

Appendix A: Recommended Reading 99

Books 99

Websites 100

Oldies but Goodies 101

Index 103

About the Author

Harry J. Foxwell

../images/489489_1_En_BookFrontmatter_Figb_HTML.jpg

teaches graduate data analytics courses at George Mason University’s Department of Information Sciences and Technology. He draws on his decades of prior experience as a Principal System Engineer for Oracle and for other major IT companies to help his students understand the concepts, tools, and practices of big data projects. He is a coauthor of several books on operating systems administration and is a designer of the data analytics curricula for his university courses. He is also a US Army combat veteran, having served in Vietnam as a Platoon Sergeant in the 1st Infantry Division. He lives in Fairfax, Virginia, with his wife Eileen and two bothersome cats. Find out more about him at https://cs.gmu.edu/~hfoxwell/ .

About the Technical Reviewer

Thomas Plunkett

has extensive experience with big data and data analytics. He has taught university courses on related technical topics.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Front Matter

Create new playlist

Sign In

Sign Up

Creating Good Data

A Guide to Dataset Structure and Data Representation

Table of Contents for
Front Matter