To Eileen, for her endless love and support.
Extracting actionable knowledge from data is a major ongoing challenge of modern IT in corporations, governments, and academia. Creating effectively usable datasets requires an understanding of data quality issues and of data types and the related analytics which can properly be applied. There are numerous data analytics resources – books, articles, blogs, and even commercial software – describing how to clean up and transform data after it has been collected, yet there is little practical guidance on how to avoid or minimize the typical “data cleaning” tasks beforehand. Such guidance and best practices are needed to eliminate or reduce lengthy dataset preparation.
Data analysts are often simply presented with datasets for exploration and study which are poorly designed, leading to difficulties in interpretation and to delays in producing usable results. In fact, some analysts report spending up to 80% of their time just getting data ready to be explored so that it can be effectively interpreted. And much data analytics training and published resources focus on how to clean and transform datasets before serious analyses can even begin. Inappropriate or confusing representations, unit of measurement choices, coding errors, missing values, outliers, and others can be avoided by using good data item selection, good dataset design and collection, and by understanding how data types determine the kinds of analyses that can be performed.
Why not create good data from the start, keeping in mind how it will be used, rather than fixing it after it is collected?
Creating Good Data discusses the principles and best practices of dataset creation and covers basic data types and their related appropriate statistics and visualizations. Following these guidelines results in more effective analyses and presentations of your research data. A key focus of this book is why certain data types and structures are chosen for representing concepts and measurements, in contrast to the usual discussions of how to analyze a specific data type once it has been selected.
I have benefited greatly from valuable encouragement and support for this work from numerous colleagues at George Mason University. Dr. James Baldo, Director of the Data Analytics Engineering program, provided helpful early advice and focus suggestions. And special thanks to Ms. Vidhyasri Ganapathi, Teaching Assistant for several of my data analytics courses, for identifying students’ challenges in learning and practicing data science and for confirming their need for this guidance in preparing good datasets.
teaches graduate data analytics courses at George Mason University’s Department of Information Sciences and Technology. He draws on his decades of prior experience as a Principal System Engineer for Oracle and for other major IT companies to help his students understand the concepts, tools, and practices of big data projects. He is a coauthor of several books on operating systems administration and is a designer of the data analytics curricula for his university courses. He is also a US Army combat veteran, having served in Vietnam as a Platoon Sergeant in the 1st Infantry Division. He lives in Fairfax, Virginia, with his wife Eileen and two bothersome cats. Find out more about him at https://cs.gmu.edu/~hfoxwell/ .
has extensive experience with big data and data analytics. He has taught university courses on related technical topics.
18.216.138.58