Chapter 4. Exploratory Data Analysis

Exploratory Data Analysis (EDA) performed in commercial settings is generally commissioned as part of a larger piece of work that is organized and executed along the lines of a feasibility assessment. The aim of this feasibility assessment, and thus the focus of what we can term an extended EDA, is to answer a broad set of questions about whether the data examined is fit for purpose and thus worthy of further investment.

Under this general remit, the data investigations are expected to cover several aspects of feasibility that include the practical aspects of using the data in production, such as its timeliness, quality, complexity, and coverage, as well as being appropriate for the intended hypothesis to be tested. While some of these aspects are potentially less fun from a data science perspective, these data quality led investigations are no less important than purely statistical insights. This is especially true when the datasets in question are very large and complex and when the investment needed to prepare the data for the data science might be significant. To illustrate this point, and to bring the topic to life, we present methods for doing an EDA of the vast and complex Global Knowledge Graph (GKG) data feeds, made available by the Global Database of Events, Language and Tone (GDELT) project.

In this chapter, we will create and interpret an EDA while covering the following topics:

  • Understanding the problems and design goals for planning and structuring an Extended Exploratory Data Analysis
  • What data profiling is, with examples, and how a general framework for data quality can be formed around the technique for continuous data quality monitoring
  • How to construct a general mask-based data profiler around the method
  • How to store the exploratory metrics to a standard schema, to facilitate the study of data drift in the metrics over time, with examples
  • How to use Apache Zeppelin notebooks for quick EDA work, and for plotting charts and graphs
  • How to extract and study the GCAM sentiments in GDELT, both as time series and as spatio-temporal datasets
  • How to extend Apache Zeppelin to generate custom charts using the plot.ly library

The problem, principles and planning

In this section, we will explore why an EDA might be required and discuss the important considerations for creating one.

Understanding the EDA problem

A difficult question that precedes an EDA project is: Can you give me an estimate and breakdown of your proposed EDA costs, please?

How we answer this question ultimately shapes our EDA strategy and tactics. In days gone by, the answer to this question typically started like this: Basically you pay by the column.... This rule of thumb is based on the premise that there is an iterable unit of data exploration work, and these units of work drive the estimate of effort and thus the rough price of performing the EDA.

What's interesting about this idea is that the units of work are quoted in terms of the data structures to investigate rather than functions that need writing. The reason for this is simple. Data processing pipelines of functions are assumed to exist already, rather than being new work, and so the quotation offered is actually the implied cost of configuring the new inputs' data structures to our standard data processing pipelines for exploring data.

This thinking brings us to the main EDA problem, that exploring seems hard to pin down in terms of planning tasks and estimating timings. The recommended approach is to consider explorations as configuration driven tasks. This helps us to structure and estimate the work more effectively, as well as helping to shape the thinking around the effort so that configuration is the central challenge, rather than the writing of a lot of ad hoc throw-away code.

The process of configuring data exploration also drives us to consider the processing templates we might need. We would need to configure these based on the form of the data we explore. For instance, we would need a standard exploration pipeline for structured data, for text data, for graph shaped data, for image data, for sound data, for time series data, and for spatial data. Once we have these templates, we need to simply map our input data to them and configure our ingestion filters to deliver a focused lens over the data.

Design principles

Modernizing these ideas for Apache Spark based EDA processing means that we need to design our configurable EDA functions and code with some general principles in mind:

  • Easily reusable functions/features: We need to define our functions to work on general data structures in general ways so they produce good exploratory features and deliver them in ways that minimize the effort needed to configure them for new datasets
  • Minimize intermediate data structures: We need to avoid proliferating intermediate schemas, helping to minimize intermediate configurations, and where possible create reusable data structures
  • Data driven configuration: Where possible, we need to have configurations that can be generated from metadata to reduce the manual boilerplate work
  • Templated visualizations: General reusable visualizations driven from common input schemas and metadata

Lastly, although it is not a strict principle per se, we need to construct exploratory tools that are flexible enough to discover data structures rather than depend on rigid pre-defined configurations. This helps when things go wrong, by helping us to reverse engineer the file content, the encodings, or the potential errors in the file definitions when we come across them.

General plan of exploration

The early stages of all EDA work are invariably based on the simple goal of establishing whether the data is of good quality. If we focus here, to create a general getting started plan that is widely applicable, then we can lay down a general set of tasks.

These tasks create the general shape of a proposed EDA project plan, which is as follows:

  • Prepare source tools, source our input datasets, review the documentation, and so on. Review security of data where necessary.
  • Obtain, decrypt, and stage the data in HDFS; collect non-functional requirements (NFRs) for planning.
  • Run code point level frequency reports on the file content.
  • Run a population check on the amount of missing data in the files' fields.
  • Run a low grain format profiler to check on the high cardinality fields in the files.
  • Run a high grain format profiler check on format-controlled fields in the files.
  • Run referential integrity checks, where appropriate.
  • Run in-dictionary checks, to verify external dimensions.
  • Run basic numeric and statistical explorations of numeric data.
  • Run more visualization-based explorations of key data of interest.

Note

In character encoding terminology, a code point or code position is any of the numerical values that make up the code space. Many code points represent single characters, but they can also have other meanings, such as for formatting.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.133.156.107