Linking together datasets

When building each analytical dataset, you should ask yourself how you can link them to other analytical datasets? Which fields are natural bridges to related key datasets? Then, intentionally build the analytical dataset in a way that makes it easy to connect to others. This is typically done by creating a new or using an existing identifier key. The identify key would be the same in both datasets.

If this sounds familiar it is the exact same concept as with relational database design. It is also closely related to the star schema design in Online Analytical Processing (OLAP) data warehousing. The goal is, however, somewhat different.

With relational database design, the goal is to minimize or even eliminate data duplication by denormalizing the data into multiple linked tables. Denormalizing separates the identifier from its description in different tables, so that the description is only stored once.

With star schema design, the goal is to make drill down, drill through, and predefined metric calculations easy and fast. Datasets are stored and linked along dimensions such as time, category, fiscal year, and company divisions.

However, with LAD design, the goal is to minimize joins while still allowing easy creation of hybrid datasets, not previously conceived. The goal is to minimize data transformation work for the IoT analyst, who is building training sets for ML modeling. The tradeoff is in data size, the initial ETL complexity when developing the analytical datasets, and the duplication of data.

The following diagram shows a simple example:

Linked Analytic Dataset design

The cost of storing large datasets has dropped dramatically when using big data system, such as HDFS or S3. The cost of missing an IoT analytics business opportunity due to your data scientists being tied up with data munging can be very high. It is a worthy tradeoff and can greatly accelerate the iteration time for new ML model development.

Follow these steps to identify and build links between analytical datasets:

  1. Identify fields that can create a bridge to other analytical datasets: These datasets may either be already created or are being considered for creation.In our GPS data example, the combination of latitude and longitude identifies a location. Certain locations, such as rest stops, distribution centers, and fueling stations, have useful and possibly predictive data tied to them.
  1. If necessary, combine multiple fields to create a single field that identifies the linkage: Big data systems handle single field joins much better than multiple field joins. It also makes it much simpler for the data scientist to use and therefore less likely to make a mistake in combining datasets. Following our example, you can combine a slightly rounded latitude and longitude value into a single identification field and store it in a separate field in your dataset. The rounding is to adjust for extremely precise GPS values that are at the location, but just in different areas of the parking lot.
  1. Repeat for all linkable fields in the dataset: The GPS grid identifier is another candidate.
  2. Add the dataset and its links to a master diagram to use as reference.

The following diagram shows how our simple GPS example can link to other analytical datasets:

Simple LAD example
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.252.201