A definitive guide to monitoring Apache Kafka

Apache Kafka is a distributed event streaming platform that excels in fault tolerance, reliable delivery, high availability, and performance. Whether you want to perform real-time analysis of big data, build a scalable pub-sub infrastructure, or aggregate statistics from heterogeneous sources, Apache Kafka should be your platform of choice.

In this article, we will share a comprehensive guide to Apache Kafka monitoring. We will discuss its resilient architecture, share why it’s important to monitor it, and explore some of its most important metrics.

Why is Kafka considered resilient to broker failures?

Kafka achieves resilience and fault tolerance through data partitioning and partition replication. A partition represents the smallest storage unit in the Kafka world. Data partitioning splits data into small units and is stored in a partition using brokers, ensuring there isn’t a single point of failure.

Partition replication goes further by copying topic-partition data to replica brokers that may exist on different servers. Replication is controlled via a configurable parameter known as the replication factor. Possible values for the replication factor range from 1 to N. It can also be set on a per-topic basis. Kafka implements replication by default; i.e., if the replication factor is not specified, it’s assumed to be one.

Each partition in Kafka has one leader and 0 or more followers. The leader and follower partitions are spread across brokers, which are in turn spread across nodes. A broker holding a leader partition is known as the leader broker for that partition, whereas a broker with a follower partition is known as the follower broker.

All new event messages are always written to the leader broker. It’s the responsibility of the leader to propagate new data to all its followers. Conversely, read requests can be served by the leader or the followers. If the leader goes down, any of the followers can be promoted to take its place. But who is responsible to monitor the liveness of the leader? This is where the controller broker enters the scene.

The controller broker

Every Kafka cluster has a designated controller broker responsible for monitoring broker liveness and electing leaders. To be considered alive, all brokers must keep active connections with the controller. Follower brokers are additionally required to replicate data from the leader at an acceptable pace (determined via the replica.lag.time.max.ms parameter).

The controller maintains a set of follower brokers that are in sync with the leader. This set is known as the in-sync replicas (ISR). If a follower broker goes down, the controller broker will detect it based on the loss of connection and remove it from the ISR. If a follower falls too far behind the leader – i.e. their replication lag time exceeds the configured replica.lag.time.max.ms value – they are also removed from the ISR.

When a leader broker goes down, the controller elevates a broker from its ISR to become the new leader. If the dead leader rejoins the cluster, the controller may reelect it as the leader, and demote its successor back to a follower.

This refined leader-follower mechanism allows Kafka to ensure high levels of resilience. To guarantee reliable message delivery, Kafka only delivers messages to consumers that have been marked as committed. A message is only marked as committed if all replicas of the ISR have appended it to their logs.

Why is it important to monitor Kafka?

Even though Kafka is largely a self-correcting, fault-tolerant system, it’s still important to monitor its health and performance. Here are a few reasons why:

Predict malfunctions and avoid downtime

Malfunctions can arise due to misconfigurations, logical errors, or scalability issues. For example, if you specify too small a time-out for the connection between the controller and other brokers, the cluster may never reach a sustainable state. Or if you configure too large a request rate quota, your CPU and memory usage may spike under high load.

An example of a logical error is a producer writing all its data to the same partition even when order preservation isn’t required. Malfunctions or bottlenecks like these can be hard to detect without periodic monitoring.

Monitoring the cluster will help you keep track of the cluster’s operational performance at any point. For example, if you notice that the ISR for several topics is constantly changing, you can surmise that the cluster is not performing as expected.

Optimize your configurations

Kafka has a highly configurable ecosystem. To extract maximum performance from a Kafka cluster, you need to choose the correct configuration settings based on business needs and operational requirements.

A great way to optimize configurations is by tweaking settings and monitoring changes in cluster performance. For instance, you may specify different compression algorithms for your topics and monitor how they impact performance and throughput. Or you may toggle the idle connection time-out to measure its relation to CPU usage.

Monitoring a system’s behavior with different parameters will enable you to identify the ideal configuration set for the cluster. Once you have the desired configuration set, you can scale your resources accordingly.

Instant troubleshooting

Troubleshooting a layered, multi-component system like Kafka can be complicated without logs and monitoring tools. You can quickly establish context about the issue and predict the root cause by correlating logged events with key metrics from monitoring tools.

For example, if a decline in response rate coincides with a rise in consumer time-outs, you can deduce that something is wrong in the broker layer of the system. Or if a sudden rise in a producer’s buffer memory coincides with an exponential rise in the system storage, you can infer that your consumers are not reading as fast as expected.

Ensure maximum performance of the larger system

Kafka acts as the backbone of several multi-component architectures. It helps applications communicate with each other, facilitates large-scale message processing in real time, and allows you to build hybrid messaging databases.

In such systems, an issue in the Kafka cluster can lead to performance degradation of the entire ecosystem. Therefore, it’s imperative to regularly monitor Kafka’s health, performance, and scalability metrics in relation to the larger system.

Keep your cluster secure

Periodic monitoring lets you detect any security incidents, misconfigurations, or vulnerabilities. For example, you may receive alerts when security-critical operations are performed or when an important security control is disabled.

Moreover, monitoring lets you ensure that security best practices are being followed across the board, e.g. authentication, encryption at rest and in transit, and safe storage of cryptographic keys.

Key metrics to monitor Kafka

Apache Kafka exposes metrics via JMX that offer insights about key components of the cluster. Jolokia is a JMX-HTTP bridge that can be used to retrieve these metrics for aggregation and processing. To enable Jolokia-based retrieval, download the Jolokia JVM agent, obtain the process ID of the Kafka broker, and attach the agent to the running broker process.

Broker metrics

Since everything in the cluster flows through brokers, they are arguably the most important Kafka element to monitor. Metrics related to brokers are available on a per-cluster, per-topic, and per-broker basis. The following table contains a list of the most important broker metrics, along with the MBeans that expose them.

Metric MBean Name Description
Incoming message rate kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec,topic=([-.\w]+) Measures the incoming message rate per topic. Excluding the topic from the MBean name will return the message in rate for all topics.
Under-replicated partitions kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions The value of this metric is increased when a broker goes down. In a healthy system, the value of this metric should be zero.
Active controller count kafka.controller:type=KafkaController,name=ActiveControllerCount The value of this metric should not be more than 1.
Produce request rate kafka.server:type=BrokerTopicMetrics,name=TotalProduceRequestsPerSec,topic=([-.\w]+) The rate of requests originated by producers to the broker, per topic. Excluding the topic name returns overall produce rate.
Fetch request rate kafka.server:type=BrokerTopicMetrics,name=TotalFetchRequestsPerSec,topic=([-.\w]+) The rate of fetch requests from the consumers or followers of the broker. Excluding the topic name returns overall fetch rate.
Incoming byte rate from other brokers kafka.server:type=BrokerTopicMetrics,name=ReplicationBytesInPerSec,topic=([-.\w]+) The rate at which replication data is being received by this broker, from other brokers, on a per-topic basis. Excluding the topic name returns overall incoming byte rate.
Offline partitions count kafka.controller:type=KafkaController,name=OfflinePartitionsCount A count of partitions that don’t have an active leader. As offline partitions can’t be used for reading or writing, this metric should always have a value of zero.
Total partitions count kafka.controller:type=KafkaController,name=GlobalPartitionCount This metric indicates the total number of partitions, across all topics.
Total time in milliseconds kafka.network:type=RequestMetrics,name=TotalTimeMs,request={Produce|FetchConsumer} Total time taken to process Produce and FetchConsumer requests.
Total queue time in milliseconds kafka.network:type=RequestMetrics,name=RequestQueueTimeMs,request={Produce|FetchConsumer} Total time the Produce and FetchConsumer requests spent on queue.
Requests waiting in the produce purgatory kafka.server:type=DelayedOperationPurgatory,name=PurgatorySize,delayedOperation=Produce Requests held in the produce purgatory; i.e., requests that haven’t yet been satisfied but have also not errored out.
Requests waiting in the fetch purgatory kafka.server:type=DelayedOperationPurgatory,name=PurgatorySize,delayedOperation=Fetch Requests held in the fetch purgatory; i.e., requests that haven’t yet been satisfied but have also not errored out.

A sample HTTP request to retrieve broker metrics via Jolokia is as follows (put your desired MBean and attribute name after /read/):

curl 
http://localhost:8778/jolokia/read/kafka.server:type=BrokerTopicMetrics,name=TotalFetchRequestsPerSec,topic=([-.\w]+)

Producer metrics

Metric MBean Name Description
Average batch size (batch-size-avg) kafka.producer:type=producer-metrics,client-id=([-.w]+) The average number of bytes the producer has sent, per partition, per request.
Average compression rate (compression-rate-avg) kafka.producer:type=producer-metrics,client-id=([-.w]+) The average compression rate of the record batches.
Available bytes in the buffer (buffer-available-bytes) kafka.producer:type=producer-metrics,client-id=([-.\w]+) Amount of buffer memory available to the producer that hasn’t yet been allocated.
Threads waiting for buffer memory (waiting-threads) kafka.producer:type=producer-metrics,client-id=([-.\w]+) The count of threads currently waiting for buffer memory allocation to post their records. In a healthy cluster, this value shouldn’t be too high.
Record error rate (record-error-rate) kafka.producer:type=producer-metrics,client-id=([-.\w]+) The average number of errors encountered per second. This number should be as close to zero as possible.
Total record errors (record-error-total) kafka.producer:type=producer-metrics,client-id=([-.\w]+) The total number of requests that resulted in an error.
Total record retries (record-retry-total) kafka.producer:type=producer-metrics,client-id=([-.\w]+) The total number of requests that were retried.
Average record size (record-size-avg) kafka.producer:type=producer-metrics,client-id=([-.\w]+) Average size of an inserted record.
Byte rate (byte-rate) kafka.producer:type=producer-topic-metrics,client-id="{client-id}",topic="{topic}" The average number of bytes per second, per topic.
Total bytes (byte-total) kafka.producer:type=producer-topic-metrics,client-id="{client-id}",topic="{topic}" Total number of bytes sent for a topic.

A sample HTTP request to retrieve producer metrics via Jolokia is as follows (put your desired MBean and attribute name after /read/):

curl 
http://localhost:8778/jolokia/read/kafka.producer:type=producer-topic-metrics,client-id="{client-id}",topic="{topic}"

Consumer metrics

Consumer metrics let you keep tabs on the data consumption side of things. It’s crucial to ensure that all consumers are actively receiving and processing event messages. A bottleneck or malfunction in the consumption layer can degrade performance, and eventually lead to memory exhaustion. Focus on the following key consumer metrics:

Metric MBean Name Description
Max poll interval (time-between-poll-max) kafka.consumer:type=consumer-metrics,client-id=([-.\w]+) The maximum delay between two consecutive calls to the poll() function.
Average poll interval (time-between-poll-avg) kafka.consumer:type=consumer-metrics,client-id=([-.\w]+) The average delay between two consecutive calls to the poll() function.
Seconds since last poll (last-poll-seconds-ago) kafka.consumer:type=consumer-metrics,client-id=([-.\w]+) The number of seconds since the last call to the poll() function.
Total commits (commit-total) kafka.consumer:type=consumer-coordinator-metrics,client-id=([-.\w]+) The total number of commit calls generated by the consumer group.
Number of commits (commit-rate) kafka.consumer:type=consumer-coordinator-metrics,client-id=([-.\w]+) Number of commits made by a consumer group, per second.
Assigned partitions (assigned-partitions) kafka.consumer:type=consumer-coordinator-metrics,client-id=([-.\w]+) The total number of partitions currently assigned to this consumer group.
Heartbeat rate (heartbeat-rate) kafka.consumer:type=consumer-coordinator-metrics,client-id=([-.\w]+) The number of heartbeats generated by this consumer group, per second.
Total group joins (join-total) kafka.consumer:type=consumer-coordinator-metrics,client-id=([-.\w]+) The total number of times a consumer has joined the group.
Max sync time (sync-time-max) kafka.consumer:type=consumer-coordinator-metrics,client-id=([-.\w]+) Max time taken for a group sync
Average fetch latency (fetch-latency-avg) kafka.consumer:type=consumer-fetch-manager-metrics,client-id="{client-id}" The average latency of fetch requests by this consumer.
Average fetch size (fetch-size-avg) kafka.consumer:type=consumer-fetch-manager-metrics,client-id="{client-id}" The average size of data fetched by this consumer.
Record consumption rate (records-consumed-rate) kafka.consumer:type=consumer-fetch-manager-metrics,client-id="{client-id}" The average rate at which this consumer has processed records, per second.

A sample HTTP request to retrieve consumer metrics via Jolokia is as follows (put your desired MBean and attribute name after /read/):

curl 
http://localhost:8778/jolokia/read/kafka.consumer:type=consumer-metrics,client-id=([-.\w]+)

Streaming metrics

Kafka streams provide a means for producers and consumers to write and read data to and from brokers. Monitoring streams allows you to detect and resolve any bottlenecks across the event streaming layer of the cluster.

Metric MBean Name Description
State (state) kafka.streams:type=stream-metrics,client-id=([-.\w]+) The state of the Kafka streams client.
Failed stream threads (failed-stream-threads) kafka.streams:type=stream-metrics,client-id=([-.\w]+) The number of threads of the streams client that failed to execute. Ideally, this metric should have a value of zero.
Average commit latency (commit-latency-avg) kafka.streams:type=stream-thread-metrics,thread-id=([-.\w]+) The average time it takes to commit a message, in milliseconds. In a healthy cluster, this value shouldn’t drastically change over time.
Maximum processing time (process-latency-max) kafka.streams:type=stream-thread-metrics,thread-id=([-.\w]+) The maximum time spent on processing, in milliseconds.
Average number of commits (commit-rate) kafka.streams:type=stream-thread-metrics,thread-id=([-.\w]+) The per-second commit rate of event messages.
Poll rate (poll-rate) kafka.streams:type=stream-thread-metrics,thread-id=([-.\w]+) The average number of records polled by a consumer for an iteration per second.
Task creation rate (task-created-rate) kafka.streams:type=stream-thread-metrics,thread-id=([-.\w]+) The average rate at which tasks are being created, per second.
Average record lateness per task (record-lateness-avg) kafka.streams:type=stream-task-metrics,thread-id=([-.\w]+),task-id=([-.\w]+) The average lateness of records for the specified stream task.
Total dropped records per task (dropped-records-total) kafka.streams:type=stream-task-metrics,thread-id=([-.\w]+),task-id=([-.\w]+) Total number of dropped records for the specified task.

A sample HTTP request to fetch streaming metrics via Jolokia is as follows (put your desired MBean and attribute name after /read/):

curl 
http://localhost:8778/jolokia/read/kafka.streams:type=stream-task-metrics,thread-id=([-.\w]+),task-id=([-.\w]+)

Connect metrics

The Kafka connect component is responsible for integrating Kafka with the rest of the world. It allows data to flow into Kafka from multiple sources and from Kafka towards multiple destinations. There are Kafka connectors available for various technologies, including MySQL, Oracle, SQL Server, Elastic, Amazon S3, and Hadoop. Focus on the following connect metrics to ensure smooth data flow in and out of Kafka.

Metric MBean Name Description
Total connector count (connector-count) kafka.connect:type=connect-worker-metrics The total number of connectors running inside this connect worker process.
Startup failure percentage (connector-startup-failure-percentage) kafka.connect:type=connect-worker-metrics The average percentage of failure encountered by connectors while starting within this connect worker process.
Total task count (task-count) kafka.connect:type=connect-worker-metrics The total number of tasks executed within this connect worker.
Task startup failure (task-startup-failure-total) kafka.connect:type=connect-worker-metrics The total number of failed attempts to create a task within this worker process.
Total successful tasks (task-startup-success-total) kafka.connect:type=connect-worker-metrics The total number of tasks that were successfully started within this worker process.
Paused connector tasks (connector-paused-task-count) kafka.connect:type=connect-worker-metrics,connector="{connector}" The total number of paused tasks created by the connector within this worker process.
Running connector tasks (connector-running-task-count) kafka.connect:type=connect-worker-metrics,connector="{connector}" The total number of running tasks created by the connector within this worker process.
Unassigned connector tasks (connector-unassigned-task-count) kafka.connect:type=connect-worker-metrics,connector="{connector}" The total number of unassigned tasks created by the connector within this worker process.
Average batch size (batch-size-avg) kafka.connect:type=connector-task-metrics,connector="{connector}",task="{task}" The average size of the batches processed by this connector.
Maximum batch size (batch-size-max) kafka.connect:type=connector-task-metrics,connector="{connector}",task="{task}" The maximum batch size processed by this connector.
Task pause time (pause-ratio) kafka.connect:type=connector-task-metrics,connector="{connector}",task="{task}" The time spent by this connector task in the “paused” state.
Status (status) kafka.connect:type=connector-task-metrics,connector="{connector}",task="{task}" The current status of this connector task.

A sample HTTP request to fetch connect metrics via Jolokia is as follows (put your desired MBean and attribute name after /read/):

curl 
http://localhost:8778/jolokia/read/kafka.connect:type=connector-task-metrics,connector="{connector}",task="{task}"

Conclusion

Apache Kafka is a powerful tool for building real-time, event-driven data streaming and processing applications. Monitoring a Kafka cluster is vital to maintaining its health, achieving high availability, quickly troubleshooting issues, and increasing throughput. This article aimed to share a comprehensive guide to monitor an Apache Kafka cluster; we hope it has served its purpose!

Was this article helpful?

Related Articles

Write For Us

Write for Site24x7 is a special writing program that supports writers who create content for Site24x7 "Learn" portal. Get paid for your writing.

Write For Us

Write for Site24x7 is a special writing program that supports writers who create content for Site24x7 “Learn” portal. Get paid for your writing.

Apply Now
Write For Us