Elasticsearch is often deployed as a cluster of nodes. A variety of APIs let you manage and monitor the cluster itself, rather than interact with the data stored within the cluster.
As with most functionality in Elasticsearch, there is an overarching design goal that tasks should be performed through an API rather than by modifying static configuration files. This becomes especially important as your cluster scales. Even with a provisioning system (such as Puppet, Chef, and Ansible), a single HTTP API call is often simpler than pushing new configurations to hundreds of physical machines.
To that end, this chapter presents the various APIs that allow you to dynamically tweak, tune, and configure your cluster. It also covers a host of APIs that provide statistics about the cluster itself so you can monitor for health and performance.
At the very beginning of the book (“Installing Marvel”), we encouraged you to install Marvel, a management monitoring tool for Elasticsearch, because it would enable interactive code samples throughout the book.
If you didn’t install Marvel then, we encourage you to install it now. This chapter introduces a large number of APIs that emit an even larger number of statistics. These stats track everything from heap memory usage and garbage collection counts to open file descriptors. These statistics are invaluable for debugging a misbehaving cluster.
The problem is that these APIs provide a single data point: the statistic right now. Often you’ll want to see historical data too, so you can plot a trend. Knowing memory usage at this instant is helpful, but knowing memory usage over time is much more useful.
Furthermore, the output of these APIs can get truly hairy as your cluster grows. Once you have a dozen nodes, let alone a hundred, reading through stacks of JSON becomes very tedious.
Marvel periodically polls these APIs and stores the data back in Elasticsearch. This allows Marvel to query and aggregate the metrics, and then provide interactive graphs in your browser. There are no proprietary statistics that Marvel exposes; it uses the same stats APIs that are accessible to you. But it does greatly simplify the collection and graphing of those statistics.
Marvel is free to use in development, so you should definitely try it out!
An Elasticsearch cluster may consist of a single node with a single index. Or it may have a hundred data nodes, three dedicated masters, a few dozen client nodes—all operating on a thousand indices (and tens of thousands of shards).
No matter the scale of the cluster, you’ll want a quick way to assess the status
of your cluster. The Cluster Health
API fills that role. You can think of it
as a 10,000-foot view of your cluster. It can reassure you that everything
is all right, or alert you to a problem somewhere in your cluster.
Let’s execute a cluster-health
API and see what the response looks like:
GET _cluster/health
Like other APIs in Elasticsearch, cluster-health
will return a JSON response.
This makes it convenient to parse for automation and alerting. The response
contains some critical information about your cluster:
{
"cluster_name"
:
"elasticsearch_zach"
,
"status"
:
"green"
,
"timed_out"
:
false
,
"number_of_nodes"
:
1
,
"number_of_data_nodes"
:
1
,
"active_primary_shards"
:
10
,
"active_shards"
:
10
,
"relocating_shards"
:
0
,
"initializing_shards"
:
0
,
"unassigned_shards"
:
0
}
The most important piece of information in the response is the status
field.
The status may be one of three values:
green
All primary and replica shards are allocated. Your cluster is 100% operational.
yellow
All primary shards are allocated, but at least one replica is missing.
No data is missing, so search results will still be complete. However, your
high availability is compromised to some degree. If more shards disappear, you
might lose data. Think of yellow
as a warning that should prompt investigation.
red
At least one primary shard (and all of its replicas) are missing. This means that you are missing data: searches will return partial results, and indexing into that shard will return an exception.
The green
/yellow
/red
status is a great way to glance at your cluster and understand
what’s going on. The rest of the metrics give you a general summary of your cluster:
number_of_nodes
and number_of_data_nodes
are fairly self-descriptive.
active_primary_shards
indicates the number of primary shards in your cluster. This
is an aggregate total across all indices.
active_shards
is an aggregate total of all shards across all indices, which
includes replica shards.
relocating_shards
shows the number of shards that are currently moving from
one node to another node. This number is often zero, but can increase when
Elasticsearch decides a cluster is not properly balanced, a new node is added,
or a node is taken down, for example.
initializing_shards
is a count of shards that are being freshly created. For
example, when you first create an index, the shards will all briefly reside in
initializing
state. This is typically a transient event, and shards shouldn’t
linger in initializing
too long. You may also see initializing shards when a
node is first restarted: as shards are loaded from disk, they start as initializing
.
unassigned_shards
are shards that exist in the cluster state, but cannot be
found in the cluster itself. A common source of unassigned shards are unassigned
replicas. For example, an index with five shards and one replica will have five unassigned
replicas in a single-node cluster. Unassigned shards will also be present if your
cluster is red
(since primaries are missing).
Imagine something goes wrong one day, and you notice that your cluster health looks like this:
{
"cluster_name"
:
"elasticsearch_zach"
,
"status"
:
"red"
,
"timed_out"
:
false
,
"number_of_nodes"
:
8
,
"number_of_data_nodes"
:
8
,
"active_primary_shards"
:
90
,
"active_shards"
:
180
,
"relocating_shards"
:
0
,
"initializing_shards"
:
0
,
"unassigned_shards"
:
20
}
OK, so what can we deduce from this health status? Well, our cluster is red
,
which means we are missing data (primary + replicas). We know our cluster has
10 nodes, but see only 8 data nodes listed in the health. Two of our nodes
have gone missing. We see that there are 20 unassigned shards.
That’s about all the information we can glean. The nature of those missing shards are still a mystery. Are we missing 20 indices with 1 primary shard each? Or 1 index with 20 primary shards? Or 10 indices with 1 primary + 1 replica? Which index?
To answer these questions, we need to ask cluster-health
for a little more
information by using the level
parameter:
GET _cluster/health?level=
indices
This parameter will make the cluster-health
API add a list of indices in our
cluster and details about each of those indices (status, number of shards,
unassigned shards, and so forth):
{
"cluster_name"
:
"elasticsearch_zach"
,
"status"
:
"red"
,
"timed_out"
:
false
,
"number_of_nodes"
:
8
,
"number_of_data_nodes"
:
8
,
"active_primary_shards"
:
90
,
"active_shards"
:
180
,
"relocating_shards"
:
0
,
"initializing_shards"
:
0
,
"unassigned_shards"
:
20
"indices"
:
{
"v1"
:
{
"status"
:
"green"
,
"number_of_shards"
:
10
,
"number_of_replicas"
:
1
,
"active_primary_shards"
:
10
,
"active_shards"
:
20
,
"relocating_shards"
:
0
,
"initializing_shards"
:
0
,
"unassigned_shards"
:
0
},
"v2"
:
{
"status"
:
"red"
,
"number_of_shards"
:
10
,
"number_of_replicas"
:
1
,
"active_primary_shards"
:
0
,
"active_shards"
:
0
,
"relocating_shards"
:
0
,
"initializing_shards"
:
0
,
"unassigned_shards"
:
20
},
"v3"
:
{
"status"
:
"green"
,
"number_of_shards"
:
10
,
"number_of_replicas"
:
1
,
"active_primary_shards"
:
10
,
"active_shards"
:
20
,
"relocating_shards"
:
0
,
"initializing_shards"
:
0
,
"unassigned_shards"
:
0
},
....
}
}
We can now see that the v2
index is the index that has made the cluster red
.
And it becomes clear that all 20 missing shards are from this index.
Once we ask for the indices output, it becomes immediately clear which index is
having problems: the v2
index. We also see that the index has 10 primary shards
and one replica, and that all 20 shards are missing. Presumably these 20 shards
were on the two nodes that are missing from our cluster.
The level
parameter accepts one more option:
GET _cluster/health?level=
shards
The shards
option will provide a very verbose output, which lists the status
and location of every shard inside every index. This output is sometimes useful,
but because of the verbosity can be difficult to work with. Once you know the index
that is having problems, other APIs that we discuss in this chapter will tend
to be more helpful.
The cluster-health
API has another neat trick that is useful when building
unit and integration tests, or automated scripts that work with Elasticsearch.
You can specify a wait_for_status
parameter, which will only return after the status is satisfied. For example:
GET _cluster/health?wait_for_status=
green
This call will block (not return control to your program) until the cluster-health
has turned green
, meaning all primary and replica shards have been allocated.
This is important for automated scripts and tests.
If you create an index, Elasticsearch must broadcast the change in cluster state
to all nodes. Those nodes must initialize those new shards, and then respond to the
master that the shards are Started
. This process is fast, but because network
latency may take 10–20ms.
If you have an automated script that (a) creates an index and then (b) immediately attempts to index a document, this operation may fail, because the index has not been fully initialized yet. The time between (a) and (b) will likely be less than 1ms—not nearly enough time to account for network latency.
Rather than sleeping, just have your script/test call cluster-health
with
a wait_for_status
parameter. As soon as the index is fully created, the cluster-health
will change to green
, the call will return control to your script, and you may
begin indexing.
Valid options are green
, yellow
, and red
. The call will return when the
requested status (or one “higher”) is reached. For example, if you request yellow
,
a status change to yellow
or green
will unblock the call.
Cluster-health
is at one end of the spectrum—a very high-level overview of
everything in your cluster. The node-stats
API is at the other end. It provides
a bewildering array of statistics about each node in your cluster.
Node-stats
provides so many stats that, until you are accustomed to the output,
you may be unsure which metrics are most important to keep an eye on. We’ll
highlight the most important metrics to monitor (but we encourage you to
log all the metrics provided—or use Marvel—because you’ll never know when
you need one stat or another).
The node-stats
API can be executed with the following:
GET _nodes/stats
Starting at the top of the output, we see the cluster name and our first node:
{
"cluster_name"
:
"elasticsearch_zach"
,
"nodes"
:
{
"UNr6ZMf5Qk-YCPA_L18BOQ"
:
{
"timestamp"
:
1408474151742
,
"name"
:
"Zach"
,
"transport_address"
:
"inet[zacharys-air/192.168.1.131:9300]"
,
"host"
:
"zacharys-air"
,
"ip"
:
[
"inet[zacharys-air/192.168.1.131:9300]"
,
"NONE"
],
...
The nodes are listed in a hash, with the key being the UUID of the node. Some information about the node’s network properties are displayed (such as transport address, and host). These values are useful for debugging discovery problems, where nodes won’t join the cluster. Often you’ll see that the port being used is wrong, or the node is binding to the wrong IP address/interface.
The indices
section lists aggregate statistics for all the indices that reside
on this particular node:
"indices"
:
{
"docs"
:
{
"count"
:
6163666
,
"deleted"
:
0
},
"store"
:
{
"size_in_bytes"
:
2301398179
,
"throttle_time_in_millis"
:
122850
},
The returned statistics are grouped into the following sections:
docs
shows how many documents reside on
this node, as well as the number of deleted docs that haven’t been purged
from segments yet.
The store
portion indicates how much physical storage is consumed by the node.
This metric includes both primary and replica shards. If the throttle time is
large, it may be an indicator that your disk throttling is set too low
(discussed in “Segments and Merging”).
"indexing"
:
{
"index_total"
:
803441
,
"index_time_in_millis"
:
367654
,
"index_current"
:
99
,
"delete_total"
:
0
,
"delete_time_in_millis"
:
0
,
"delete_current"
:
0
},
"get"
:
{
"total"
:
6
,
"time_in_millis"
:
2
,
"exists_total"
:
5
,
"exists_time_in_millis"
:
2
,
"missing_total"
:
1
,
"missing_time_in_millis"
:
0
,
"current"
:
0
},
"search"
:
{
"open_contexts"
:
0
,
"query_total"
:
123
,
"query_time_in_millis"
:
531
,
"query_current"
:
0
,
"fetch_total"
:
3
,
"fetch_time_in_millis"
:
55
,
"fetch_current"
:
0
},
"merges"
:
{
"current"
:
0
,
"current_docs"
:
0
,
"current_size_in_bytes"
:
0
,
"total"
:
1128
,
"total_time_in_millis"
:
21338523
,
"total_docs"
:
7241313
,
"total_size_in_bytes"
:
5724869463
},
indexing
shows the number of docs that have been indexed. This value is a monotonically
increasing counter; it doesn’t decrease when docs are deleted. Also note that it
is incremented anytime an index operation happens internally, which includes
things like updates.
Also listed are times for indexing, the number of docs currently being indexed, and similar statistics for deletes.
get
shows statistics about get-by-ID statistics. This includes GET
and
HEAD
requests for a single document.
search
describes the number of active searches (open_contexts
), number of
queries total, and the amount of time spent on queries since the node was
started. The ratio between query_time_in_millis / query_total
can be used as a
rough indicator for how efficient your queries are. The larger the ratio,
the more time each query is taking, and you should consider tuning or optimization.
The fetch statistics detail the second half of the query process (the fetch in
query-then-fetch). If more time is spent in fetch than query, this can be an
indicator of slow disks or very large documents being fetched, or
potentially search requests with paginations that are too large (for example, size: 10000
).
merges
contains information about Lucene segment merges. It will tell you
the number of merges that are currently active, the number of docs involved, the cumulative
size of segments being merged, and the amount of time spent on merges in total.
Merge statistics can be important if your cluster is write heavy. Merging consumes a large amount of disk I/O and CPU resources. If your index is write heavy and you see large merge numbers, be sure to read “Indexing Performance Tips”.
Note: updates and deletes will contribute to large merge numbers too, since they cause segment fragmentation that needs to be merged out eventually.
"filter_cache"
:
{
"memory_size_in_bytes"
:
48
,
"evictions"
:
0
},
"id_cache"
:
{
"memory_size_in_bytes"
:
0
},
"fielddata"
:
{
"memory_size_in_bytes"
:
0
,
"evictions"
:
0
},
"segments"
:
{
"count"
:
319
,
"memory_in_bytes"
:
65812120
},
...
filter_cache
indicates the amount of memory used by the cached filter bitsets,
and the number of times a filter has been evicted. A large number of evictions
could indicate that you need to increase the filter cache size, or that
your filters are not caching well (for example, they are churning heavily because of high cardinality,
such as caching now
date expressions).
However, evictions are a difficult metric to evaluate. Filters are cached on a per-segment basis, and evicting a filter from a small segment is much less expensive than evicting a filter from a large segment. It’s possible that you have many evictions, but they all occur on small segments, which means they have little impact on query performance.
Use the eviction metric as a rough guideline. If you see a large number, investigate your filters to make sure they are caching well. Filters that constantly evict, even on small segments, will be much less effective than properly cached filters.
id_cache
shows the memory usage by parent/child mappings. When you use
parent/children, the id_cache
maintains an in-memory join table that maintains
the relationship. This statistic will show you how much memory is being used.
There is little you can do to affect this memory usage, since it has a fairly linear
relationship with the number of parent/child docs. It is heap-resident, however,
so it’s a good idea to keep an eye on it.
field_data
displays the memory used by fielddata, which is used for aggregations,
sorting, and more. There is also an eviction count. Unlike filter_cache
, the eviction
count here is useful: it should be zero or very close. Since field data
is not a cache, any eviction is costly and should be avoided. If you see
evictions here, you need to reevaluate your memory situation, fielddata limits,
queries, or all three.
segments
will tell you the number of Lucene segments this node currently serves.
This can be an important number. Most indices should have around 50–150 segments,
even if they are terabytes in size with billions of documents. Large numbers
of segments can indicate a problem with merging (for example, merging is not keeping up
with segment creation). Note that this statistic is the aggregate total of all
indices on the node, so keep that in mind.
The memory
statistic gives you an idea of the amount of memory being used by the
Lucene segments themselves. This includes low-level data structures such as
posting lists, dictionaries, and bloom filters. A very large number of segments
will increase the amount of overhead lost to these data structures, and the memory
usage can be a handy metric to gauge that overhead.
The OS
and Process
sections are fairly self-explanatory and won’t be covered
in great detail. They list basic resource statistics such as CPU and load. The
OS
section describes it for the entire OS
, while the Process
section shows just
what the Elasticsearch JVM process is using.
These are obviously useful metrics, but are often being measured elsewhere in your monitoring stack. Some stats include the following:
CPU
Load
Memory usage
Swap usage
Open file descriptors
The jvm
section contains some critical information about the JVM process that
is running Elasticsearch. Most important, it contains garbage collection details,
which have a large impact on the stability of your Elasticsearch cluster.
Because garbage collection is so critical to Elasticsearch, you should become intimately
familiar with this section of the node-stats
API:
"jvm"
:
{
"timestamp"
:
1408556438203
,
"uptime_in_millis"
:
14457
,
"mem"
:
{
"heap_used_in_bytes"
:
457252160
,
"heap_used_percent"
:
44
,
"heap_committed_in_bytes"
:
1038876672
,
"heap_max_in_bytes"
:
1038876672
,
"non_heap_used_in_bytes"
:
38680680
,
"non_heap_committed_in_bytes"
:
38993920
,
The jvm
section first lists some general stats about heap memory usage. You
can see how much of the heap is being used, how much is committed (actually allocated
to the process), and the max size the heap is allowed to grow to. Ideally,
heap_committed_in_bytes
should be identical to heap_max_in_bytes
. If the
committed size is smaller, the JVM will have to resize the heap eventually—and this is a very expensive process. If your numbers are not identical, see
“Heap: Sizing and Swapping” for how to configure it correctly.
The heap_used_percent
metric is a useful number to keep an eye on. Elasticsearch
is configured to initiate GCs when the heap reaches 75% full. If your node is
consistently >= 75%, your node is experiencing memory pressure.
This is a warning sign that slow GCs may be in your near future.
If the heap usage is consistently >=85%, you are in trouble. Heaps over 90–95% are in risk of horrible performance with long 10–30s GCs at best, and out-of-memory (OOM) exceptions at worst.
"pools"
:
{
"young"
:
{
"used_in_bytes"
:
138467752
,
"max_in_bytes"
:
279183360
,
"peak_used_in_bytes"
:
279183360
,
"peak_max_in_bytes"
:
279183360
},
"survivor"
:
{
"used_in_bytes"
:
34865152
,
"max_in_bytes"
:
34865152
,
"peak_used_in_bytes"
:
34865152
,
"peak_max_in_bytes"
:
34865152
},
"old"
:
{
"used_in_bytes"
:
283919256
,
"max_in_bytes"
:
724828160
,
"peak_used_in_bytes"
:
283919256
,
"peak_max_in_bytes"
:
724828160
}
}
},
The young
, survivor
, and old
sections will give you a breakdown of memory
usage of each generation in the GC. These stats are handy for keeping an eye on
relative sizes, but are often not overly important when debugging problems.
"gc"
:
{
"collectors"
:
{
"young"
:
{
"collection_count"
:
13
,
"collection_time_in_millis"
:
923
},
"old"
:
{
"collection_count"
:
0
,
"collection_time_in_millis"
:
0
}
}
}
gc
section shows the garbage collection counts and cumulative time for both
young and old generations. You can safely ignore the young generation counts
for the most part: this number will usually be large. That is perfectly
normal.
In contrast, the old generation collection count should remain small, and
have a small collection_time_in_millis
. These are cumulative counts, so it is
hard to give an exact number when you should start worrying (for example, a node with a
one-year uptime will have a large count even if it is healthy). This is one of the
reasons that tools such as Marvel are so helpful. GC counts over time are the
important consideration.
Time spent GC’ing is also important. For example, a certain amount of garbage is generated while indexing documents. This is normal and causes a GC every now and then. These GCs are almost always fast and have little effect on the node: young generation takes a millisecond or two, and old generation takes a few hundred milliseconds. This is much different from 10-second GCs.
Our best advice is to collect collection counts and duration periodically (or use Marvel) and keep an eye out for frequent GCs. You can also enable slow-GC logging, discussed in “Logging”.
Elasticsearch maintains threadpools internally. These threadpools cooperate to get work done, passing work between each other as necessary. In general, you don’t need to configure or tune the threadpools, but it is sometimes useful to see their stats so you can gain insight into how your cluster is behaving.
There are about a dozen threadpools, but they all share the same format:
"index"
:
{
"threads"
:
1
,
"queue"
:
0
,
"active"
:
0
,
"rejected"
:
0
,
"largest"
:
1
,
"completed"
:
1
}
Each threadpool lists the number of threads that are configured (threads
),
how many of those threads are actively processing some work (active
), and how
many work units are sitting in a queue (queue
).
If the queue fills up to its limit, new work units will begin to be rejected, and
you will see that reflected in the rejected
statistic. This is often a sign
that your cluster is starting to bottleneck on some resources, since a full
queue means your node/cluster is processing at maximum speed but unable to keep
up with the influx of work.
There are a dozen threadpools. Most you can safely ignore, but a few are good to keep an eye on:
indexing
Threadpool for normal indexing requests
bulk
Bulk requests, which are distinct from the nonbulk indexing requests
get
Get-by-ID operations
search
All search and query requests
merging
Threadpool dedicated to managing Lucene merges
Continuing down the node-stats
API, you’ll see a bunch of statistics about your
filesystem: free space, data directory paths, disk I/O stats, and more. If you are
not monitoring free disk space, you can get those stats here. The disk I/O stats
are also handy, but often more specialized command-line tools (iostat
, for example)
are more useful.
Obviously, Elasticsearch has a difficult time functioning if you run out of disk space—so make sure you don’t.
There are also two sections on network statistics:
"transport"
:
{
"server_open"
:
13
,
"rx_count"
:
11696
,
"rx_size_in_bytes"
:
1525774
,
"tx_count"
:
10282
,
"tx_size_in_bytes"
:
1440101928
},
"http"
:
{
"current_open"
:
4
,
"total_opened"
:
23
},
transport
shows some basic stats about the transport address. This
relates to inter-node communication (often on port 9300) and any transport client
or node client connections. Don’t worry if you see many connections here;
Elasticsearch maintains a large number of connections between nodes.
http
represents stats about the HTTP port (often 9200). If you see a very
large total_opened
number that is constantly increasing, that is a sure sign
that one of your HTTP clients is not using keep-alive connections. Persistent,
keep-alive connections are important for performance, since building up and tearing
down sockets is expensive (and wastes file descriptors). Make sure your clients
are configured appropriately.
Finally, we come to the last section: stats about the fielddata circuit breaker (introduced in “Circuit Breaker”):
"fielddata_breaker"
:
{
"maximum_size_in_bytes"
:
623326003
,
"maximum_size"
:
"594.4mb"
,
"estimated_size_in_bytes"
:
0
,
"estimated_size"
:
"0b"
,
"overhead"
:
1.03
,
"tripped"
:
0
}
Here, you can determine the maximum circuit-breaker size (for example, at what size the circuit breaker will trip if a query attempts to use more memory). This section will also let you know the number of times the circuit breaker has been tripped, and the currently configured overhead. The overhead is used to pad estimates, because some queries are more difficult to estimate than others.
The main thing to watch is the tripped
metric. If this number is large or
consistently increasing, it’s a sign that your queries may need to be optimized
or that you may need to obtain more memory (either per box or by adding more
nodes).
The cluster-stats
API provides similar output to the node-stats
. There
is one crucial difference: Node Stats shows you statistics per node, while
cluster-stats
shows you the sum total of all nodes in a single metric.
This provides some useful stats to glance at. You can see for example, that your entire cluster
is using 50% of the available heap or that filter cache is not evicting heavily. Its
main use is to provide a quick summary that is more extensive than
the cluster-health
, but less detailed than node-stats
. It is also useful for
clusters that are very large, which makes node-stats
output difficult
to read.
The API may be invoked as follows:
GET
_cluster
/
stats
So far, we have been looking at node-centric statistics: How much memory does this node have? How much CPU is being used? How many searches is this node servicing?
Sometimes it is useful to look at statistics from an index-centric perspective: How many search requests is this index receiving? How much time is spent fetching docs in that index?
To do this, select the index (or indices) that you are interested in and
execute an Index stats
API:
GET
my_index
/
_stats
GET
my_index
,
another_index
/
_stats
GET
_all
/
_stats
Stats for my_index
.
Stats for multiple indices can be requested by separating their names with a comma.
Stats indices can be requested using the special _all
index name.
The stats returned will be familar to the node-stats
output: search
fetch
get
index
bulk
segment counts
and so forth
Index-centric stats can be useful for identifying or verifying hot indices inside your cluster, or trying to determine why some indices are faster/slower than others.
In practice, however, node-centric statistics tend to be more useful. Entire nodes tend to bottleneck, not individual indices. And because indices are usually spread across multiple nodes, index-centric statistics are usually not very helpful because they aggregate data from different physical machines operating in different environments.
Index-centric stats are a useful tool to keep in your repertoire, but are not usually the first tool to reach for.
There are certain tasks that only the master can perform, such as creating a new index or moving shards around the cluster. Since a cluster can have only one master, only one node can ever process cluster-level metadata changes. For 99.9999% of the time, this is never a problem. The queue of metadata changes remains essentially zero.
In some rare clusters, the number of metadata changes occurs faster than the master can process them. This leads to a buildup of pending actions that are queued.
The pending-tasks
API will show you what (if any) cluster-level metadata changes
are pending in the queue:
GET
_cluster
/
pending_tasks
Usually, the response will look like this:
{
"tasks"
:
[]
}
This means there are no pending tasks. If you have one of the rare clusters that bottlenecks on the master node, your pending task list may look like this:
{
"tasks"
:
[
{
"insert_order"
:
101
,
"priority"
:
"URGENT"
,
"source"
:
"create-index [foo_9], cause [api]"
,
"time_in_queue_millis"
:
86
,
"time_in_queue"
:
"86ms"
},
{
"insert_order"
:
46
,
"priority"
:
"HIGH"
,
"source"
:
"shard-started ([foo_2][1], node[tMTocMvQQgGCkj7QDHl3OA], [P],
s[INITIALIZING]), reason [after recovery from gateway]"
,
"time_in_queue_millis"
:
842
,
"time_in_queue"
:
"842ms"
},
{
"insert_order"
:
45
,
"priority"
:
"HIGH"
,
"source"
:
"shard-started ([foo_2][0], node[tMTocMvQQgGCkj7QDHl3OA], [P],
s[INITIALIZING]), reason [after recovery from gateway]"
,
"time_in_queue_millis"
:
858
,
"time_in_queue"
:
"858ms"
}
]
}
You can see that tasks are assigned a priority (URGENT
is processed before HIGH
,
for example), the order it was inserted, how long the action has been queued and
what the action is trying to perform. In the preceding list, there is a create-index
action and two shard-started
actions pending.
If you work from the command line often, the cat
APIs will be helpful
to you. Named after the linux cat
command, these APIs are designed to
work like *nix command-line tools.
They provide statistics that are identical to all the previously discussed APIs
(Health, node-stats
, and so forth), but present the output in tabular form instead of
JSON. This is very convenient for a system administrator, and you just want
to glance over your cluster or find nodes with high memory usage.
Executing a plain GET
against the cat
endpoint will show you all available
APIs:
GET /_cat=
^.^=
/_cat/allocation /_cat/shards /_cat/shards/{
index}
/_cat/master /_cat/nodes /_cat/indices /_cat/indices/{
index}
/_cat/segments /_cat/segments/{
index}
/_cat/count /_cat/count/{
index}
/_cat/recovery /_cat/recovery/{
index}
/_cat/health /_cat/pending_tasks /_cat/aliases /_cat/aliases/{
alias
}
/_cat/thread_pool /_cat/plugins /_cat/fielddata /_cat/fielddata/{
fields}
Many of these APIs should look familiar to you (and yes, that’s a cat at the top :) ). Let’s take a look at the Cat Health API:
GET /_cat/health1408723713
12:08:33 elasticsearch_zach yellow1
1
114
114
0
0
114
The first thing you’ll notice is that the response is plain text in tabular form, not JSON. The second thing you’ll notice is that there are no column headers enabled by default. This is designed to emulate *nix tools, since it is assumed that once you become familiar with the output, you no longer want to see the headers.
To enable headers, add the ?v
parameter:
GET /_cat/health?v epochtime
cluster status node.total node.data shards pri relo init 1408[
..]
12[
..]
el[
..]
1
1
114
114
0
0
114 unassign
Ah, much better. We now see the timestamp, cluster name, status, the number of
nodes in the cluster, and more—all the same information as the cluster-health
API.
Let’s look at node-stats
in the cat
API:
GET /_cat/nodes?v host ip heap.percent ram.percent load node.role master name zacharys-air 192.168.1.13145
72
1.85 d * Zach
We see some stats about the nodes in our cluster, but the output is basic compared
to the full node-stats
output. You can
include many additional metrics, but rather than consulting the documentation, let’s just ask the cat
API what is available.
You can do this by adding ?help
to any API:
GET /_cat/nodes?help id|
id,nodeId|
unique node id pid|
p|
process id host|
h|
host name ip|
i|
ip address port|
po|
bound transport port version|
v|
es version build|
b|
es buildhash
jdk|
j|
jdk version disk.avail|
d,disk,diskAvail|
available disk space heap.percent|
hp,heapPercent|
used heap ratio heap.max|
hm,heapMax|
max configured heap ram.percent|
rp,ramPercent|
used machine memory ratio ram.max|
rm,ramMax|
total machine memory load|
l|
most recent load avg uptime|
u|
node uptime node.role|
r,role,dc,nodeRole|
d:data node, c:client node master|
m|
m:master-eligible, *:current master ... ...
(Note that the output has been truncated for brevity).
The first column shows the full name, the second column shows the short name,
and the third column offers a brief description about the parameter. Now that
we know some column names, we can ask for those explicitly by using the ?h
parameter:
GET /_cat/nodes?v&
h
=
ip,port,heapPercent,heapMax ip port heapPercent heapMax 192.168.1.1319300
53
990.7mb
Because the cat
API tries to behave like *nix utilities, you can pipe the output
to other tools such as sort
grep
or awk
. For example, we can find the largest
index in our cluster by using the following:
% curl'localhost:9200/_cat/indices?bytes=b'
|
sort -rnk8 yellow test_names5
1
3476004
0
376324705
376324705 yellow .marvel-2014.08.191
1
263878
0
160777194
160777194 yellow .marvel-2014.08.151
1
234482
0
143020770
143020770 yellow .marvel-2014.08.091
1
222532
0
138177271
138177271 yellow .marvel-2014.08.181
1
225921
0
138116185
138116185 yellow .marvel-2014.07.261
1
173423
0
132031505
132031505 yellow .marvel-2014.08.211
1
219857
0
128414798
128414798 yellow .marvel-2014.07.271
1
75202
0
56320862
56320862 yellow wavelet5
1
5979
0
54815185
54815185 yellow .marvel-2014.07.281
1
57483
0
43006141
43006141 yellow .marvel-2014.07.211
1
31134
0
27558507
27558507 yellow .marvel-2014.08.011
1
41100
0
27000476
27000476 yellow kibana-int5
1
2
0
17791
17791 yellow t5
1
7
0
15280
15280 yellow website5
1
12
0
12631
12631 yellow agg_analysis5
1
5
0
5804
5804 yellow v25
1
2
0
5410
5410 yellow v15
1
2
0
5367
5367 yellow bank1
1
16
0
4303
4303 yellow v5
1
1
0
2954
2954 yellow p5
1
2
0
2939
2939 yellow b0001_0723201412385
1
1
0
2923
2923 yellow ipaddr5
1
1
0
2917
2917 yellow v2a5
1
1
0
2895
2895 yellow movies5
1
1
0
2738
2738 yellow cars5
1
0
0
1249
1249 yellow wavelet25
1
0
0
615
615
By adding ?bytes=b
, we disable the human-readable formatting on numbers and
force them to be listed as bytes. This output is then piped into sort
so that
our indices are ranked according to size (the eighth column).
Unfortunately, you’ll notice that the Marvel indices are clogging up the results,
and we don’t really care about those indices right now. Let’s pipe the output
through grep
and remove anything mentioning Marvel:
% curl'localhost:9200/_cat/indices?bytes=b'
|
sort -rnk8|
grep -v marvel yellow test_names5
1
3476004
0
376324705
376324705 yellow wavelet5
1
5979
0
54815185
54815185 yellow kibana-int5
1
2
0
17791
17791 yellow t5
1
7
0
15280
15280 yellow website5
1
12
0
12631
12631 yellow agg_analysis5
1
5
0
5804
5804 yellow v25
1
2
0
5410
5410 yellow v15
1
2
0
5367
5367 yellow bank1
1
16
0
4303
4303 yellow v5
1
1
0
2954
2954 yellow p5
1
2
0
2939
2939 yellow b0001_0723201412385
1
1
0
2923
2923 yellow ipaddr5
1
1
0
2917
2917 yellow v2a5
1
1
0
2895
2895 yellow movies5
1
1
0
2738
2738 yellow cars5
1
0
0
1249
1249 yellow wavelet25
1
0
0
615
615
Voila! After piping through grep
(with -v
to invert the matches), we get
a sorted list of indices without Marvel cluttering it up.
This is just a simple example of the flexibility of cat
at the command line.
Once you get used to using cat
, you’ll see it like any other *nix tool and start
going crazy with piping, sorting, and grepping. If you are a system admin and spend
any time SSH’d into boxes, definitely spend some time getting familiar
with the cat
API.
18.216.117.191