Sizing ML deployments

People often ask how they should appropriately size their cluster if they plan on using Elastic ML. Other than the obvious it depends answer, it is useful to have an empirical approach to the process. As seen on the Elastic blog Sizing for Machine Learning with Elasticsearch (https://www.elastic.co/blog/sizing-machine-learning-with-elasticsearch), there is a key recommendation: use dedicated nodes for ML so that you don't have ML jobs interfere with the other tasks of the data nodes of a cluster (indexing, searching, and so on). To scope how many dedicated nodes are necessary, follow this approach:

If there are no representative jobs created yet, use generic rules of thumb based on the overall cluster size from the blog. These rules of thumb are as follows:
- Recommend 1 dedicated ML node (2 for HA) for cluster sizes < 10 data nodes
- Definitely at least 2 ML nodes for clusters up to 25 nodes
- Add additional ML nodes for every additional 25 data nodes

Example: 60 data nodes should have 3 (aggressive) or 4 (conservative) ML nodes

Representative jobs already created for a proof of concept (PoC) or some already in production:
- Empirically analyze current jobs and extrapolate
- Run a script (https://github.com/PacktPublishing/Machine-Learning-with-the-Elastic-Stack/blob/master/Chapter10/get_all_job_model_bytes.sh) to dump all of the job's memory usage from the ML API (this also uses the free, open source utility jq (https://stedolan.github.io/jq/)
- Use the spreadsheet in the GitHub repository (ml_job_sizing.xlsx https://github.com/PacktPublishing/Machine-Learning-with-the-Elastic-Stack/blob/master/Chapter10/ml_job_sizing.xlsx) to do the extrapolation for you, as follows:

#!/bin/bash
HOST='localhost'
PORT=9200
#CURL_AUTH="-u elastic:changeme"
list=`curl $CURL_AUTH -s http://$HOST:$PORT/_xpack/ml/anomaly_detectors?pretty | awk -F" : " '/job_id/{print $2}' | sed 's/",//g' | sed 's/"//g'` 
while read -r JOB_ID; do
   curl $CURL_AUTH -s -XPOST $HOST:$PORT/_xpack/ml/anomaly_detectors/${JOB_ID}/model_snapshots?pretty |  jq '{job_id: .model_snapshots[0].job_id, size: .model_snapshots[0].model_size_stats.model_bytes}' | jq --raw-output '"(.job_id),(.size)"'
done <<< "$list"

There are several assumptions being made in the sizing spreadsheet:

CPU and query load are not going to be a limiting factor
Future jobs will be similar to existing jobs, so extrapolation is a legitimate approach
This approach does not exclude test jobs that may be configured, but unused
The user will set xpack.ml.max_open_jobs to the appropriate number after the sizing
The user sets xpack.ml.max_machine_memory_percent appropriately (the default is 30%)
The user will not try to run the analysis over lots of historical data on all jobs at the same time, as this is the most intensive part of ML's operation

Here are some tips:

ML jobs run outside of the JVM, so if there is a dedicated ML node, the JVM isn't used for much and you can reduce the heap size (from the typical size for a data node) to gain more room for ML jobs
You can increase xpack.ml.max_machine_memory_percent on larger RAM machines, especially if the JVM heap has been reduced
See other settings in the docs at https://www.elastic.co/guide/en/elasticsearch/reference/current/ml-settings.html

Table of Contents for Sizing ML deployments

Create new playlist

Sign In

Sign Up

Table of Contents for
Sizing ML deployments