ML job throughput considerations

ML is awesome, and is no doubt very fast and scalable, but there will still be a practical upper bound of events/second processed to any ML job, depending on a couple of different factors:

  • The speed at which data can be delivered to the ML algorithms (that is, query performance)
  • The speed at which the ML algorithms can chew through the data, given the desired analysis

For the latter, much of the performance is based upon the following:

  • The function(s) chosen for the analysis, that is, count is faster than lat_long
  •  The chosen bucket_span (longer bucket spans are faster than smaller bucket spans because more buckets analyzed per unit of time compound the per-bucket processing overhead that's writing results and so on)

However, if you have a defined analysis setup and can't really change it for other reasons, then there's not really much that you can do unless you get creative and split the data up into multiple jobs. This is because the ML jobs (at least for now) are currently tied to a single CPU for the analysis bit (running the C++ process called autodetect). So, splitting the data into a few separate ML jobs to at least take advantage of multiple CPUs might be an option. But, before that, let's focus on the former, the query's performance, as there are a variety of possibilities here:

  • Avoid doing a cross-cluster search to limit data transmission across the network.
  • Tweak datafeed parameters to optimize performance.
  • Use Elasticsearch query aggregations to distribute the task of distilling the data to a smaller set of ML algorithms.

The first one is sort of obvious. You're only going to improve performance if you move the analysis closer to the raw data.

The second one may take some experimentation. There are parameters, such as scroll_size, which control the size of each scroll. The default is 1,000 and for decent sized clusters, this could be safely increased to 10,000. Run some tests at different scroll sizes and see how it affects query and cluster performance.

The last one should make the biggest impact on performance, in my opinion, but obviously it is a little tricky and error-prone to get the ES aggregation correct for it to work properly with MLbut it's not so bad. See the documentation at https://www.elastic.co/guide/en/elastic-stack-overview/current/ml-configuring-aggregation.html for more information. The downside of using aggregations with ML, in general, is that you lose access to the other fields in the data that might be good as influencers.

All in all, there are a few things to consider when optimizing the ML job's performance.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.12.164.101