Effective data segmentation

Simply by virtue of collecting some types of data (system performance metrics, log files, and so on) from underlying servers/hosts, there is likely already a natural segmentation of the data by server/host. Let's look at a sample measurement from Metricbeat:

{  
   "_index":"metricbeat-6.0.0-2018.01.01",
   "_type":"doc",
   "_id":"ZQtas2ABB_sNnq-vMrgR",
   "_score":1,
   "_source":{  
      "@timestamp":"2018-01-01T20:10:19.227Z",
      "system":{  
         "memory":{  
            "swap":{  
               "used":{  
                  "bytes":0,
                  "pct":0
               },
               "free":0,
               "total":0
            },
            "total":15464677376,
            "used":{  
               "bytes":9050693632,
               "pct":0.5852
            },
            "free":6158319616,
            "actual":{  
               "free":6413983744,
               "used":{  
                  "pct":0.5852,
                  "bytes":9050693632
               }
            }
         }
      },
      "metricset":{  
         "rtt":214,
         "module":"system",
         "name":"memory"
      },
      "beat":{  
         "name":"demo",
         "hostname":"demo",
         "version":"6.0.0"
      }
   }
}

In the document, we can see the beat.name nested object field (and/or beat.hostname, both having a value of demo in this example). This is the name of the system that the data originates in. By default, all data from all instances of Metricbeat will be collated into a single, daily index of documents with a name similar to metricbeat-6.0.0.2018-01.01, where the date is the particular day in which the data was recorded. Time-based index names are a common practice in the Elastic Stack for this kind of time series data, primarily because it is easy to manage historical data based upon a certain retention policy (dropping data older than X days old is accomplished by deleting the appropriate indices).

Perhaps some of our hosts in our environment support one application (that is, online purchases), but other hosts support a different application (that is, invoice processing). With all hosts reporting their Metricbeat data into a single index, if we are interested in orienting our reporting and analysis of the data for one or both of these applications, it is obviously inappropriate to orient the analysis based solely on the index. And, as we've seen in previous chapters with respect to setting up ML jobs, the job configurations are very index-centric: you need to specify the index of the data to be analyzed in the first step of the configuration.

Despite it seeming like a conundrum, with our desire for our analysis to be application-centric but our data to not be application-centric, we have a few options:

Modifying the base query of the ML job so that it filters the data for the hosts associated with the application of interest
Modify the data on ingest to insert additional contextual information into each document, which is later used to filter the query made by the ML job

Both require the customization of the data query that the ML job makes to the raw data. However, the first option usually requires a more complex query and the second option requires an interstitial step of data enrichment using something like Logstash. Let's briefly discuss each.

Table of Contents for Effective data segmentation

Create new playlist

Sign In

Sign Up

Table of Contents for
Effective data segmentation