A Guide on Monitoring Elasticsearch Performance

Elasticsearch is a leading search engine that works well with different data types, including numerical, textual, structured, and unstructured data. It’s a core component of the Elastic Stack (aka the ELK Stack), an all-in-one solution for data ingestion, transformation, storage, processing, and visualization.

Elasticsearch lies at the heart of several IT infrastructures. We can track its performance and health metrics in real-time to ensure it is smooth running and free of any bottlenecks. In the following article, we will discuss Elasticsearch’s architecture, its importance, and some of the key metrics to monitor.

What is Elasticsearch?

Elasticsearch is an open-source, distributed, highly scalable, fault-tolerant search and analytics engine. It is built on top of Apache Lucene, an open-source Java library with powerful search and indexing features.

Elasticsearch has a RESTful interface with API clients for many programming languages, including Java, Python, .NET, PHP, Ruby, C++, Rust, and more. REST APIs are available for data search, aggregation, ingestion, and management.

The Elasticsearch DSL (domain-specific language) is a query language based on JSON. It offers numerous search features, including phrase matching, wildcards, regex, and geo queries. The SQL module allows you to execute custom queries against an SQL database and store the results in Elasticsearch.

Logstash, another component of the ELK Stack, is a data processing pipeline used to automate the ingestion and transformation of multi-source data at scale. It enables Elasticsearch to aggregate and ingest data from multiple sources.

Elasticsearch stores data as JSON documents. This data is indexed to allow fast search and retrieval. An index in the Elasticsearch world plays the same role as that of a table in the relational database world.

Just like tables contain columns and rows, an index includes types containing documents with fields. For example, an index named Users, may contain three types: Partners, Vendors, and Employees.

Each type will contain documents belonging to it. For example, Partners may contain JSON documents for all your partners. This data categorization not only makes Elasticsearch fast, but also makes it easy to query data. To retrieve the document for a partner named Alice, a user can send an http request using the following structure:

http://server-ip-and-port/Users/Partners/_search?q=Alice

A shard is the basic building block of an Elasticsearch cluster. Each index is divided into one or more shards. You can imagine a shard as a tiny, self-contained search engine responsible for indexing and processing queries for a subset of data stored in the cluster.

Elasticsearch uses an inverted index data structure that allows you to search for words inside JSON documents.An inverted index maintains a list of all unique words and documents containing each word.

Elasticsearch use cases

Elasticsearch provides efficient indexing and searching capabilities that can meet various business needs.

Feature-rich search engines

Elasticsearch’s powerful search features enable users to build intuitive search engines for multiple use cases. Whether you want to implement a search bar for your online store, or an internal document store for your employees, Elasticsearch is the way to go.

Elasticsearch’s inverted indices enable fast full-text searches across millions of documents. You can build aggregations based on terms, date ranges, and more to achieve faceted navigation. The type-ahead suggester displays similar results to the user as they type. Fuzzy searching allows for misspellings while searching.

Log analysis

The ELK Stack is often used to aggregate, transform, search, and analyze logs. Logstash loads and transforms logs from multiple sources. Elasticsearch indexes logs and allows you to analyze them. With the ELK Stack, you can look for anomalies, filter for errors, match patterns, and perform system-wide debugging from a central place.

Kibana, the visualization component of the stack, creates graphs so users can analyze trends visually. Kibana also provides the ability to create triggers that execute automated workflows and to set up contextualized alerts to help you resolve issues quickly.

Real-time metric analysis

Elasticsearch is a top choice for performing real-time analysis of application and infrastructure performance. You can aggregate and index various types of metric data in a central location and track in real-time using Elasticsearch’s fast querying capabilities.

For example, you might set up a Logstash pipeline to fetch CPU usage data from different application servers and store it in Elasticsearch. You can also create a customized dashboard on Kibana that fetches these statistics and displays them as graphs and charts.

Automated crawling

The Elastic Web Crawler is an indexing tool that automates the indexing of your website content. It periodically crawls your website, identifies new content, and indexes it in Elasticsearch. Any changes you make to your website are automatically propagated to Elasticsearch.

This eradicates the need to ingest content manually. It also enables a better search experience by making new content searchable instantly.

Business analytics

Whether you want to monitor website activity, perform sentiment analysis, track your business KPIs, or analyze financial data, the ELK Stack will be a great choice. The stack enables you to aggregate, ingest, and process data from different sources, including social media feeds, enterprise applications, and marketing tools.

Use Elasticsearch and Kibana to generate actionable insights and create fact-based reports for your team. On-demand forecasting allows you to apply machine learning to historical data and predict future trends.

Why is it important to monitor Elasticsearch?

Monitoring the health and performance of Elasticsearch is important for the following reasons:

Keep your searches fast

Nobody likes a slow search bar. Users expect to see search results appear instantly, sometimes even before they finish typing. Elasticsearch can help deliver this experience – but only if it’s functioning properly.

Monitoring an Elasticsearch instance enables you to track its health and performance. For example, you can track request-response metrics to ensure that Elasticsearch responds to requests at an acceptable rate and with minimum latency.

Predict malfunctions and avoid downtime

Programming errors, misconfigurations, or scalability issues can cause malfunctions or bottlenecks. For instance, poorly structured queries may lead to slow operations that decrease the overall throughput of the instance. Or inadequate resources may cause spikes in CPU usage during peak hours. Periodic monitoring can help detect and debug such issues.

Deliver maximum performance

As Elasticsearch often acts as the backbone of IT infrastructures, monitoring performance metrics equips you with insights to optimize the performance of Elasticsearch as well as the larger system. For example, if a decline in response rate coincides with an increase in slow operation logs, you can conclude that some operations are taking too long to execute.

Keep Elasticsearch secure

Monitoring Elasticsearch’s audit logs helps detect security events, such as authentication failures, refused connections, and insufficient permissions. You can also specify your criteria for logging events in the audit log. Periodic monitoring ensures you don’t overlook security-critical events and protects Elasticsearch from unauthorized access.

Stay up to date with latest changes

Metrics like the number of indices and nodes, document counts, and search rate allow you to monitor the cluster state in real-time. This way, you can detect anomalies or fluctuations in key performance metrics and take remedial action.

Key Elasticsearch performance metrics

Elasticsearch exposes several metrics that can be used to track the performance of its key areas and elements.

Health metrics

The cluster health API provides a basic overview of the current health of the Elasticsearch cluster. It can be accessed via:

curl -XGET '[server-ip-and-port]/_cluster/health?pretty';

A sample response is as follows:

{ 
  "cluster_name" : "test", 
  "status" : "yellow", 
  "timed_out" : false, 
  "number_of_nodes" : 10, 
  "number_of_data_nodes" : 10, 
  "active_primary_shards" : 10, 
  "active_shards" : 10, 
  "relocating_shards" : 0, 
  "initializing_shards" : 0, 
  "unassigned_shards" : 0, 
  "delayed_unassigned_shards": 0, 
  "number_of_pending_tasks" : 2, 
  "number_of_in_flight_fetch": 2, 
  "task_max_waiting_in_queue_millis": 10 
}

The possible values for the status field are green, yellow, and red. Green indicates that all shards are assigned. Yellow means that all primary shards are assigned, but some replica shards remain unassigned. Red indicates that one or more primary shards are unassigned, making some data unavailable.

The number_of_nodes indicates the total number of nodes, whereas the number_of_data_nodes represents the number of dedicated data nodes. The different _shards fields indicate the number of active, relocating, initializing, and unassigned shards.

If the value of relocating_shards is greater than zero, the cluster is moving data shards to restore balance. This typically occurs when a node is added or removed or when a failed node is restarted.

The number_of_pending_tasks field is a measure of cluster-level changes that haven’t been implemented. The number_of_in_flight_fetch metric represents the number of unfinished fetches.

Correlating different health metrics allows administrators to gauge how a cluster is performing. For example, if the relocating_shards metric is regularly more than zero even though new nodes are not being added, it means that specific nodes are repeatedly failing. Or if the number_of_pending_tasks field remains more than zero several hours after cluster initialization, it means that something is wrong within the cluster.

CPU usage and memory metrics

The Elasticsearch Service console presents several metrics related to CPU and memory. For example:

CPU usage: This metric represents the percentage usage of the CPU resources by the Elasticsearch cluster. Ideally, there shouldn’t be any drastic fluctuations in this metric.
CPU credits: CPU credits allow additional CPU resources to be assigned to a cluster to boost performance temporarily. This metric indicates a cluster’s outstanding CPU credits, represented as seconds of CPU time.
Memory pressure per node: This metric indicates the total size of the JVM heap over time. According to the Elasticsearch documentation, memory pressure should remain below 75% for heaps sized 8 GB or more. For heaps less than 8 GB, the threshold is 85%. If memory pressure exceeds these thresholds, you may need to either increase the node’s memory or reduce memory usage.
Garbage collection overhead per node: This metric tracks the memory overhead incurred because of JVM’s garbage collection. A sudden spike in this metric indicates the presence of a memory leak in the system.

The hot threads API can help identify blocked processes contributing to high CPU usage. To retrieve hot threads for all the cluster’s nodes, use the following:

curl -XGET 'http://[server-ip-and-port]/_nodes/hot_threads';

To retrieve hot threads for a specific node, use the following:

curl -XGET 'http://[server-ip-and-port]/_nodes/[node_id]/hot_threads';

Unlike other Elasticsearch APIs, the hot threads API doesn’t return a JSON. Instead, it returns formatted text that includes information about the node and the percentage of CPU usage by the hot threads.

Node metrics

Tracking node metrics is crucial to ensure overall optimal performance of an Elasticsearch cluster. The nodes stats API returns several node statistics related to the operating system, file stores, JVM, and more. It can be invoked as:

curl -XGET 'http://[server-ip-and-port]/_nodes/stats';

Some of the metrics included in the response are:

os -> cpu: This property is nested inside the os property. It contains CPU usage metrics, like percent and load_average.
os -> memory: This property is nested inside the os property. It contains memory usage metrics, like total_in_bytes and free_in_bytes.
thread pool: This property contains several metrics related to the node’s thread pool, e.g., threads, queue, active, and rejected.
follow_stats: This object contains multiple shard-level metrics related to the follower indices.

How to monitor Elasticsearch using monitoring tools

In the following sections, we’ll discuss how to track key Elasticsearch metrics using monitoring tools.

Using the Elastic Stack to aggregate and visualize monitoring data

Metricbeat is a lightweight data shipper that is a part of the Elastic Stack. With Metricbeat, you can collect metric data from production Elasticsearch clusters and load it to an Elasticsearch cluster dedicated to monitoring. The loaded data can then be visualized using Kibana. Here are the steps:

Install a Metricbeat instance on each cluster node.
Enable the elasticsearch module inside Metricbeat for each node.
Set the appropriate settings for the elasticsearch module for all nodes. You can tweak the module’s scope, username, password, fetching period, and SSL settings.
Specify the location of the dedicated monitoring server in the metricbeat.yml configuration file.
Start the Metricbeat instance on each node to collect and report metrics to the dedicated monitoring server.
Configure Kibana to load and display data from the monitoring server.
Visit the Kibana UI from the browser (the default URL is http://localhost:5601/) and select Stack monitoring from the main page.
On the Stack monitoring page, you will see an overview of the cluster, along with information related to nodes, indices, and instances.

Using the Site24x7 plugin to monitor Elasticsearch

Site24x7’s Elasticsearch plugin can also be used to monitor Elasticsearch in real-time. It offers visibility into key metrics related to sharding, JVM, cluster status, and memory and CPU usage. You can install the plugin using these steps:

Download the latest version of the Site24x7 Linux agent on the target server.
Download the relevant Elasticsearch plugin from the GitHub repository. Available options are: elasticsearch.py, elasticsearchcluster.py, and elasticsearchnodes.py.
Set appropriate values for HOST, USERNAME, PORT, and PASSWORD in a configuration file. A sample configuration file is available here.
Depending on the plugin you chose, create a folder with the name 'elasticsearch,’ 'elasticsearchcluster,’ or 'elasticsearchnodes' inside the directory '/opt/site24x7/monagent/plugins/'. Move the plugin file inside the new folder.
If you are using the elasticsearch.py plugin, create an empty JSON file named 'counter.json' inside the directory '/opt/site24x7/monagent/plugins/elasticsearch.’
Within five minutes, the Site24x7 agent should automatically execute the plugin and start reporting data to the Site24x7 data center.
To view performance data from the Site24x7 web client, log in to Site24x7, go to Server -> Plugin Integrations, and select the plugin monitor.

Conclusion

Elasticsearch is a leading search and analytics engine that can aggregate data from diverse sources. Powerful search features, flexible deployment options, extensive security controls, and out-of-the-box scalability make it an ideal fit for many business use cases. In this article, we aimed to share a comprehensive guide to monitoring an Elasticsearch instance.

Sorry to hear that. Let us know how we can improve the article.

Guide to Elasticsearch monitoring