Apache Flink monitoring

Ensure continuous data processing by monitoring Apache Flink jobs, tasks, and cluster resources to keep streaming applications reliable and efficient.

Start 30-day free trial Try now, sign up in 30 seconds
Apache Flink monitoring

Ensure reliable stream and batch data processing

Apache Flink is a powerful, distributed stream-processing framework designed for real-time and batch data workloads. It excels at high-throughput, low-latency data processing and supports event-time semantics, fault tolerance, and stateful computations at scale. From financial transactions to log analytics and IoT pipelines, Flink is built to deliver reliable, near-instant data insights.

Monitoring Flink is crucial to keep streaming applications responsive, ensure jobs are executed efficiently, and maintain overall cluster stability. By tracking key metrics like job states, task performance, TaskManager resources, and JobManager configurations, organizations can detect bottlenecks early, optimize execution, and ensure uninterrupted processing of continuous data flows.

Monitor the health and performance of your Flink clusters

Track HBase cluster health

Track overall job and task performance with metrics like total jobs, running tasks, finished tasks, failed tasks, and canceled tasks to ensure stable execution.

Monitor Hbase resources

Monitor TaskManager efficiency by analyzing memory usage, task slots, process size, and bind hosts for optimal workload distribution.

Analyze Hbase logs

Gain visibility into JobManager performance by monitoring execution failover strategy, RPC address/port, memory (heap, off-heap, metaspace, overhead), and process size.

Set alerts

Set alerts for abnormal states such as task failures, high memory usage, or slow scheduling to detect problems before they affect critical data pipelines.

HBase custom dashboards

Visualize Flink metrics in custom dashboards, along with Hadoop, Kafka, and ecosystem components, for full-stack observability.

HBase IT automation

Automate recovery actions to handle failed tasks or unresponsive managers, minimizing downtime and improving data reliability.

Get started with Site24x7's Apache Flink monitoring

Performance Metrics

General

Metric name Description
Total Jobs The total number of jobs being managed by the Flink server
Total Tasks Running The total number of tasks currently in the running state
Total Tasks Canceling The total number of tasks currently in the canceling state
Total Tasks Canceled The total number of tasks that have been canceled
Total Tasks The total number of tasks, including all states (running, canceling, canceled, etc.)
Total Tasks Created The total number of tasks created by the Flink server
Total Tasks Scheduled The total number of tasks currently in the scheduled state
Total Tasks Deploying The total number of tasks currently in the deploying state
Total Tasks Reconciling The total number of tasks currently in the reconciling state
Total Tasks Finished The total number of tasks that have been successfully completed
Total Tasks Initializing The total number of tasks currently in the initializing state
Total Tasks Failed The total number of tasks that have failed
Blob Server Port The port used for the blob server
TaskManager Memory Process Size The memory size allocated to the TaskManager process
TaskManager Bind Host The IP address or hostname that the TaskManager binds to
JobManager Execution Failover Strategy The failover strategy used for job execution in the JobManager
JobManager RPC Address The address of the JobManager RPC server
JobManager Memory Off-Heap Size The size of off-heap memory allocated to the JobManager
JobManager Memory JVM Overhead Min The minimum JVM overhead memory allocated to the JobManager
JobManager Memory Process Size The total memory size allocated to the JobManager process
Web Temporary Directory The temporary directory used for the Flink web dashboard
JobManager RPC Port The port used for the JobManager RPC server
Query Server Port The port used by the query server
REST Bind Address The IP address or hostname to which the REST API binds
JobManager Bind Host The IP address or hostname to which the JobManager binds
Default Parallelism The default parallelism level used for Flink jobs
TaskManager Number of Task Slots The number of task slots available per TaskManager
REST Address The IP address or hostname of the REST API
JobManager Memory JVM Metaspace Size The size of JVM metaspace memory allocated to the JobManager
JobManager Memory Heap Size The amount of heap memory allocated to the JobManager
JobManager Memory JVM Overhead Max The maximum JVM overhead memory allocated to the JobManager
JVM Version The version of the JVM used by the Flink server
JVM Architecture The architecture of the JVM
Refresh Interval The refresh interval (in milliseconds) for monitoring metrics
Timezone Name The name of the timezone in which the Flink server is running
Timezone Offset The timezone offset from UTC
Flink Version The version of Apache Flink currently running on the server
Flink Revision The revision of the Flink source code used in the current deployment

Setup

Quick installation

If you're using Linux servers, use the Apache Flink plugin installer that checks the prerequisites and installs the plugin with a bash script. You don't need to manually set up the plugin if you're using the installer.

Execute the command below in the terminal to run the installer and follow the instructions displayed on-screen:
wget https://raw.githubusercontent.com/site24x7/plugins/master/apache_flink/installer/Site24x7ApacheFlinkPluginInstaller.sh && sudo bash Site24x7ApacheFlinkPluginInstaller.sh

Standard installation

If you're not using Linux servers or want to install the plugin manually, follow the steps below.

Prerequisites

  • Download and install the Site24x7 server monitoring agent (Linux | Windows) in the network or on the specific host in which the Apache instance is running.
  • Ensure you have Python 3 or a higher version installed in your server.

Plugin installation

  • Create a folder named apache_flink.
  • Download the apache_flink.py and the apache_flink.cfg files from our GitHub repository and place them in the apache_flink folder.
    wget https://raw.githubusercontent.com/site24x7/plugins/master/apache_flink/apache_flink.py && sed -i "1s|^.*|#! $(which python3)|" apache_flink.py
    wget https://raw.githubusercontent.com/site24x7/plugins/master/apache_flink/apache_flink.cfg
  • To check if the plugin is working, execute the command below with appropriate arguments, and check for a valid JSON output with applicable metrics and their corresponding values.
    python3 apache_flink.py --host localhost --port 8081
  • Add the applicable configurations in the apache_flink.cfg file.
    [Flink]
    host="localhost"
    port="8081"
  • Follow the steps in this article to learn how to run the Python script on Windows Server. You don't need to do this for Linux.
  • Move the apache_flink folder to the Site24x7 server monitoring plugins directory:
    For Linux: /opt/site24x7/monagent/plugins/
    For Windows: C:\Program Files (x86)\Site24x7\WinAgent\monitoring\plugins\

The agent will automatically execute the plugin within five minutes and display performance data in Site24x7.

To view the plugin monitor and associated performance charts:

  • Log in to Site24x7.
  • Navigate to Plugins and click the required monitor.

Over 13,000 customers trust Site24x7

Brand logos of our various customers

Discover how you can leverage plugin integrations to get visibility into the blind spots of your IT ecosystem.

Download now

Check out our 100+ plugin integrations or build your own

Choose from our list of more than 100 ready-to-use plugins and monitor your full stack of applications across web servers, databases, load balancers, and more.

See all plugin integrations