Exploratory Data Analysis (EDA) performed in commercial settings is generally commissioned as part of a larger piece of work that is organized and executed along the lines of a feasibility assessment. The aim of this feasibility assessment, and thus the focus of what we can term an extended EDA, is to answer a broad set of questions about whether the data examined is fit for purpose and thus worthy of further investment.
Under this general remit, the data investigations are expected to cover several aspects of feasibility that include the practical aspects of using the data in production, such as its timeliness, quality, complexity, and coverage, as well as being appropriate for the intended hypothesis to be tested. While some of these aspects are potentially less fun from a data science perspective, these data quality led investigations are no less important than purely statistical insights. This is especially true when the datasets in question are very large and complex and when the investment needed to prepare the data for the data science might be significant. To illustrate this point, and to bring the topic to life, we present methods for doing an EDA of the vast and complex Global Knowledge Graph (GKG) data feeds, made available by the Global Database of Events, Language and Tone (GDELT) project.
In this chapter, we will create and interpret an EDA while covering the following topics:
plot.ly
libraryIn this section, we will explore why an EDA might be required and discuss the important considerations for creating one.
A difficult question that precedes an EDA project is: Can you give me an estimate and breakdown of your proposed EDA costs, please?
How we answer this question ultimately shapes our EDA strategy and tactics. In days gone by, the answer to this question typically started like this: Basically you pay by the column.... This rule of thumb is based on the premise that there is an iterable unit of data exploration work, and these units of work drive the estimate of effort and thus the rough price of performing the EDA.
What's interesting about this idea is that the units of work are quoted in terms of the data structures to investigate rather than functions that need writing. The reason for this is simple. Data processing pipelines of functions are assumed to exist already, rather than being new work, and so the quotation offered is actually the implied cost of configuring the new inputs' data structures to our standard data processing pipelines for exploring data.
This thinking brings us to the main EDA problem, that exploring seems hard to pin down in terms of planning tasks and estimating timings. The recommended approach is to consider explorations as configuration driven tasks. This helps us to structure and estimate the work more effectively, as well as helping to shape the thinking around the effort so that configuration is the central challenge, rather than the writing of a lot of ad hoc throw-away code.
The process of configuring data exploration also drives us to consider the processing templates we might need. We would need to configure these based on the form of the data we explore. For instance, we would need a standard exploration pipeline for structured data, for text data, for graph shaped data, for image data, for sound data, for time series data, and for spatial data. Once we have these templates, we need to simply map our input data to them and configure our ingestion filters to deliver a focused lens over the data.
Modernizing these ideas for Apache Spark based EDA processing means that we need to design our configurable EDA functions and code with some general principles in mind:
Lastly, although it is not a strict principle per se, we need to construct exploratory tools that are flexible enough to discover data structures rather than depend on rigid pre-defined configurations. This helps when things go wrong, by helping us to reverse engineer the file content, the encodings, or the potential errors in the file definitions when we come across them.
The early stages of all EDA work are invariably based on the simple goal of establishing whether the data is of good quality. If we focus here, to create a general getting started plan that is widely applicable, then we can lay down a general set of tasks.
These tasks create the general shape of a proposed EDA project plan, which is as follows:
3.133.156.107