Training models at scale

In an earlier section of this chapter, we listed and studied what the industry experts agree on as the most common phases of any predictive analytics project.

To recall, they are as follows:

  • Defining the data source
  • Profiling and preparation of the data source
  • Determining the question(s) that you want to ask your data
  • Choosing an algorithm to train on the data source
  • Application of a predictive model

In a predictive analytics project using big data, those same phases are present, but may be slightly varied and require some supplementary efforts.

Pain by phase

In the initial phase of a project, once you've chosen a source for your data (determined the data source), the data must be attained. Some industry experts describe this as the acquisition and recording of data. In a predictive project that involves a more common data source, access to the data might be as straightforward as opening a file on your local disk; with a big data source, it's a bit more difficult. For example, suppose your project sources data from a combination of devices (multiple servers and many mobile devices, that is, the Internet of Things data).

This activity-generated data might include a combination of website tracking information, application logs, sensor data – among other machine-generated content – perfect for your analysis. You can see how the effort to access this information as a single data source for your project would take some effort (and expertise!).

In the profiling and preparation phase, data is extracted, cleaned, and annotated. Typically, any analytics project will require this pre-processing of the data: setting context, identifying operational definitions and statistical types, and so on. This step is critical as this is the phase where we establish an understanding of the data challenges so that later surprises can be minimized. This phase usually involves time spent querying and re-querying the data, creating visualizations to validate findings, and then performing updates to the data to address areas of concern. Big data inhibits these activities since it includes either more data to process, is perhaps inconsistent in format, and could be changing rapidly.

In the phase where question determination takes place, data integration, aggregation, and the representation of the data must be considered so that the proper questions to ask the data can be identified. This phase may be divided into three steps; preparation, integration, and determination (of questions). The prep step involves assembling the data, identification of unique keys, aggregation/duplication, scrubbing as required, format manipulation, and perhaps mapping of values. The integration step involves merging data, testing, and reconciliation. Finally, project questions are established. Once again, big data's volumes, varieties, and velocities can slow this phase down considerably.

Choosing an algorithm and application of a predictive model are the phases where there is analysis, modeling, and interpretation of the data. Considering the volumes, varieties, and velocities of a big data source, selecting the appropriate algorithm to be used to train data can be much more involved. An example would be the idea that predictive modeling works best given the lowest level of granularity possible and, in the previous phase, perhaps the sheer volume of a big data source required that extensive aggregation be done, thus potentially burying anomalies and variations that exist within the data.

Specific challenges

Let's take a few moments to address some very specific challenges brought on by big data. Among these top topics are:

  • Heterogeneity
  • Scale
  • Location
  • Timeliness
  • Privacy
  • Collaborations
  • Reproducibility

Heterogeneity

By variety, we usually need to consider the heterogeneity of data types, representation, and semantic interpretation. Efforts to correctly review and understand these variations in big data sources can be time consuming and complex. Interestingly, an element may be homogeneous (more uniform) on a larger scale, compared to being heterogeneous (less uniform) on a smaller scale. This means that your approach to addressing a big data source may cause very different results!

Scale

We've already touched on the idea of scale – typically scale refers to the sheer size of the data source, but could also refer to its complexities.

Location

Typically, you'll see that when you decide to use a big data source, it is not located all in one place, but spread throughout electronic space. This means that any process (manual or automated) will have to consolidate the data – physically or virtually before it can be properly used in a project.

Timeliness

The bigger the data, the more time it will take to analyze. However, it is not just this time that is meant when one speaks of velocity in the context of big data. Rather, there is the challenge of the acquisition rate of the data. In other words, with data piling up or being updated continuously within the data source, when (or how often) is the correct snapshot established? In addition, scanning the entire data source to find a suitable sample pertinent to a particular predictive analytics objective is obviously impractical.

Privacy

Data privacy should be a consideration when using any data source, and one that increases in complexity in the context of big data. The most well-known example of this is with electronic health records – which have strict laws governing them.

Suppose, for example having the requirement to pre-process a big data source that is over a terabyte in size to hide both a user's identify and location information?

Collaborations

One might think that in this day and age, analytics and predictive models are entirely computational (especially when you hear the term machine learning), however, no matter how advanced a predicative algorithm or model proclaims to be, there remain many patterns in data that humans can simply detect, but computer algorithms no matter how complex in their logic, have a hard time finding.

Note

There is a new following in analytics that may be considered a sub-field of visual analytics that utilizes SME input, at least with respect to the modelling and analysis phase of a predictive project.

To include a subject matter expert in a predictive analytics project might not be a huge issue but, with a big data source, it often takes multiple experts from different domains to really understand what is going on with the data and to share their respective exploration of results and advice.

These multiple experts may be separated in space and time and be difficult to assemble in one location at one time. Again, causing additional time and effort to be spent.

Reproducibility

Believe it or not, most predictive analytics projects are repeated for a variety of reasons. For example, if the results are in question for any reason or if the data is suspect, all of the phases of the project may be repeated. Reproduction of a big data analytics project is seldom reasonable. In most cases, all that can be done is that the bad data in a big data resource will be found and flagged as such.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.17.162.26