Apache Flink monitoring

Ensure reliable stream and batch data processing

Apache Flink is a powerful, distributed stream-processing framework designed for real-time and batch data workloads. It excels at high-throughput, low-latency data processing and supports event-time semantics, fault tolerance, and stateful computations at scale. From financial transactions to log analytics and IoT pipelines, Flink is built to deliver reliable, near-instant data insights.

Monitoring Flink is crucial to keep streaming applications responsive, ensure jobs are executed efficiently, and maintain overall cluster stability. By tracking key metrics like job states, task performance, TaskManager resources, and JobManager configurations, organizations can detect bottlenecks early, optimize execution, and ensure uninterrupted processing of continuous data flows.

Monitor the health and performance of your Flink clusters

Track overall job and task performance with metrics like total jobs, running tasks, finished tasks, failed tasks, and canceled tasks to ensure stable execution.

Monitor TaskManager efficiency by analyzing memory usage, task slots, process size, and bind hosts for optimal workload distribution.

Gain visibility into JobManager performance by monitoring execution failover strategy, RPC address/port, memory (heap, off-heap, metaspace, overhead), and process size.

Set alerts for abnormal states such as task failures, high memory usage, or slow scheduling to detect problems before they affect critical data pipelines.

Visualize Flink metrics in custom dashboards, along with Hadoop, Kafka, and ecosystem components, for full-stack observability.

Automate recovery actions to handle failed tasks or unresponsive managers, minimizing downtime and improving data reliability.

Metric name	Description
Total Jobs	The total number of jobs being managed by the Flink server
Total Tasks Running	The total number of tasks currently in the running state
Total Tasks Canceling	The total number of tasks currently in the canceling state
Total Tasks Canceled	The total number of tasks that have been canceled
Total Tasks	The total number of tasks, including all states (running, canceling, canceled, etc.)
Total Tasks Created	The total number of tasks created by the Flink server
Total Tasks Scheduled	The total number of tasks currently in the scheduled state
Total Tasks Deploying	The total number of tasks currently in the deploying state
Total Tasks Reconciling	The total number of tasks currently in the reconciling state
Total Tasks Finished	The total number of tasks that have been successfully completed
Total Tasks Initializing	The total number of tasks currently in the initializing state
Total Tasks Failed	The total number of tasks that have failed
Blob Server Port	The port used for the blob server
TaskManager Memory Process Size	The memory size allocated to the TaskManager process
TaskManager Bind Host	The IP address or hostname that the TaskManager binds to
JobManager Execution Failover Strategy	The failover strategy used for job execution in the JobManager
JobManager RPC Address	The address of the JobManager RPC server
JobManager Memory Off-Heap Size	The size of off-heap memory allocated to the JobManager
JobManager Memory JVM Overhead Min	The minimum JVM overhead memory allocated to the JobManager
JobManager Memory Process Size	The total memory size allocated to the JobManager process
Web Temporary Directory	The temporary directory used for the Flink web dashboard
JobManager RPC Port	The port used for the JobManager RPC server
Query Server Port	The port used by the query server
REST Bind Address	The IP address or hostname to which the REST API binds
JobManager Bind Host	The IP address or hostname to which the JobManager binds
Default Parallelism	The default parallelism level used for Flink jobs
TaskManager Number of Task Slots	The number of task slots available per TaskManager
REST Address	The IP address or hostname of the REST API
JobManager Memory JVM Metaspace Size	The size of JVM metaspace memory allocated to the JobManager
JobManager Memory Heap Size	The amount of heap memory allocated to the JobManager
JobManager Memory JVM Overhead Max	The maximum JVM overhead memory allocated to the JobManager
JVM Version	The version of the JVM used by the Flink server
JVM Architecture	The architecture of the JVM
Refresh Interval	The refresh interval (in milliseconds) for monitoring metrics
Timezone Name	The name of the timezone in which the Flink server is running
Timezone Offset	The timezone offset from UTC
Flink Version	The version of Apache Flink currently running on the server
Flink Revision	The revision of the Flink source code used in the current deployment

Apache Flink monitoring

Ensure reliable stream and batch data processing

Monitor the health and performance of your Flink clusters

Get started with Site24x7's Apache Flink monitoring

Performance Metrics

General

Setup

Quick installation

Standard installation

Prerequisites

Plugin installation

Over 13,000 customers trust Site24x7

Check out our 100+ plugin integrations or build your own