Harry J. Foxwell

Creating Good Data

A Guide to Dataset Structure and Data Representation

1st ed.
Harry J. Foxwell
Fairfax, VA, USA
ISBN 978-1-4842-6102-6e-ISBN 978-1-4842-6103-3
© Harry J. Foxwell 2020
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Distributed to the book trade worldwide by Springer Science+Business Media New York, 1 NY Plaza, New York NY 10004. Phone 1-800-SPRINGER, fax (201) 348-4505, e-mail [email protected], or visit www.springeronline.com. Apress Media, LLC is a California LLC and the sole member (owner) is Springer Science + Business Media Finance Inc (SSBM Finance Inc). SSBM Finance Inc is a Delaware corporation.

To Eileen, for her endless love and support.

Introduction

Extracting actionable knowledge from data is a major ongoing challenge of modern IT in corporations, governments, and academia. Creating effectively usable datasets requires an understanding of data quality issues and of data types and the related analytics which can properly be applied. There are numerous data analytics resources – books, articles, blogs, and even commercial software – describing how to clean up and transform data after it has been collected, yet there is little practical guidance on how to avoid or minimize the typical “data cleaning” tasks beforehand. Such guidance and best practices are needed to eliminate or reduce lengthy dataset preparation.

Data analysts are often simply presented with datasets for exploration and study which are poorly designed, leading to difficulties in interpretation and to delays in producing usable results. In fact, some analysts report spending up to 80% of their time just getting data ready to be explored so that it can be effectively interpreted. And much data analytics training and published resources focus on how to clean and transform datasets before serious analyses can even begin. Inappropriate or confusing representations, unit of measurement choices, coding errors, missing values, outliers, and others can be avoided by using good data item selection, good dataset design and collection, and by understanding how data types determine the kinds of analyses that can be performed.

Why not create good data from the start, keeping in mind how it will be used, rather than fixing it after it is collected?

Creating Good Data discusses the principles and best practices of dataset creation and covers basic data types and their related appropriate statistics and visualizations. Following these guidelines results in more effective analyses and presentations of your research data. A key focus of this book is why certain data types and structures are chosen for representing concepts and measurements, in contrast to the usual discussions of how to analyze a specific data type once it has been selected.

Acknowledgments

I have benefited greatly from valuable encouragement and support for this work from numerous colleagues at George Mason University. Dr. James Baldo, Director of the Data Analytics Engineering program, provided helpful early advice and focus suggestions. And special thanks to Ms. Vidhyasri Ganapathi, Teaching Assistant for several of my data analytics courses, for identifying students’ challenges in learning and practicing data science and for confirming their need for this guidance in preparing good datasets.

Table of Contents
Books 99
Index 103
About the Author
Harry J. Foxwell
../images/489489_1_En_BookFrontmatter_Figb_HTML.jpg

teaches graduate data analytics courses at George Mason University’s Department of Information Sciences and Technology. He draws on his decades of prior experience as a Principal System Engineer for Oracle and for other major IT companies to help his students understand the concepts, tools, and practices of big data projects. He is a coauthor of several books on operating systems administration and is a designer of the data analytics curricula for his university courses. He is also a US Army combat veteran, having served in Vietnam as a Platoon Sergeant in the 1st Infantry Division. He lives in Fairfax, Virginia, with his wife Eileen and two bothersome cats. Find out more about him at https://cs.gmu.edu/~hfoxwell/ .

 
About the Technical Reviewer
Thomas Plunkett

has extensive experience with big data and data analytics. He has taught university courses on related technical topics.

 
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.216.138.58