Help Docs

Databricks Jobs Monitoring

Site24x7's Databricks Jobs monitoring gives you full visibility into the execution health of your Databricks jobs — from high-level run outcomes and task success rates to cluster-level performance and duration breakdowns. Detect failures, track slowdowns, and investigate task reattempts without leaving your monitoring console.

Utilize Site24x7 Databricks Jobs monitoring to:

  • Track job run outcomes and failure rates across all your Databricks workspaces in one place.
  • Identify slow jobs using run duration trends and top-slowest-run analysis.
  • Drill into task-level execution details to pinpoint exactly where a job is failing or stalling.
  • Correlate run trigger types with performance patterns to optimize scheduling and resource allocation.

To access the Databricks Jobs monitor:

  1. Log in to your Site24x7 account.
  2. In the left navigation pane, go to Databricks > Databricks Workspace Monitor.
  3. Click the Databricks Jobs tab to view all discovered Databricks Job monitors.
  4. Click any monitor in the list to navigate to its detailed monitor page.

Summary page

The Databricks Jobs monitor summary page gives you an at-a-glance view of job execution health. The following metrics are displayed on the summary page:

Metric Description Unit
Number of Tasks Total number of tasks defined within the job. Count
Number of Job Clusters Number of job clusters used by this job during the selected period. Count
Number of Runs Total runs of the job within the selected time window. Count
Runs Passed Number of runs that were completed successfully. Count
Runs Failed Number of runs that ended in a failed state. Count
Tasks Passed Number of individual tasks that completed successfully across all runs. Count
Tasks Failed Number of individual tasks that failed across all runs. Count
Tasks Pending Number of tasks currently in a pending or queued state. Count
Tasks Running Number of tasks currently in an active running state. Count
Total Task Executions Total count of all task executions including passed, failed, pending, and running tasks. Count
Run Success (%) Percentage of total runs that were completed successfully. Percent
Run Failure (%) Percentage of total runs that ended in a failed state. Percent
Task Success (%) Percentage of total task executions that were completed successfully. Percent
Task Failure (%) Percentage of total task executions that failed. Percent

Job Runs Details

The Job Runs Details view can be viewed in this tab and also as log events in Site24x7 AppLogs. Each run is stored as a searchable log entry under the log type Databricks Job Runs, allowing you to query, filter, and analyse run history alongside your other application logs.

To access Job Runs Details via AppLogs, go to Apps > AppLogs > Search and use the query:

logtype="Databricks Job Runs"

You can filter results using the following General Fields in the left panel: workspace, Run Name, Life Cycle State, Result State, and Trigger. The following columns are available in the results table:

Column Description Unit
Monitor The name of the Databricks Job monitor in Site24x7 that this run belongs to. Text
Time The timestamp at which Site24x7 collected this run record. Text
Workspace The Databricks workspace in which the job run was executed. Text
Job Id The unique identifier of the Databricks job. Text
Run Id The unique identifier of this specific job run. Text
Run Name The display name of the job run as configured in Databricks. Text
Life Cycle State The current or final lifecycle state of the run (for example, TERMINATED, RUNNING, PENDING). Text
Result State The outcome of the run once it reached a terminal state (for example, SUCCESS, FAILED, CANCELLED, TIMEDOUT). Blank if the run has not yet completed. Text
State Message An optional message providing additional context about the run's current or final state, such as an error description on failure. Text
Start Time The date and time at which the run began execution. Timestamp
End Time The date and time at which the run completed or was terminated. Timestamp
Run Duration The total elapsed time from start to end for this run. Seconds
Trigger The mechanism that initiated the run (for example, ONE_TIME for manual runs, RUN_JOB_TASK for runs triggered by another job task, or a schedule-based trigger). Text

Task Runs Details

The Task Runs Details can be viewed in this tab and also as log events in Site24x7 AppLogs. Use this view to investigate which tasks within a run succeeded, failed, were blocked, or are still running, and to identify reattempts and cluster assignments at the task level.

The following columns are available in the Task Runs Details table:

Column Description Unit
Monitor The name of the Databricks Job monitor in Site24x7 that this task run belongs to. Text
Time The timestamp at which Site24x7 collected this task run record. Text
workspace The Databricks workspace in which the task was executed. Text
Job Id The unique identifier of the parent Databricks job. Text
Run Id The unique identifier of the parent job run that this task belongs to. Text
Task Run Id The unique identifier of this individual task execution within the parent run. Text
Task Key The key name of the task as defined in the job configuration (for example, Run_1, Test, Start, Stop). Text
Life Cycle State The current or final lifecycle state of the task (for example, TERMINATED, RUNNING, BLOCKED, PENDING). A BLOCKED state indicates the task is waiting on an upstream dependency to complete. Text
Result State The outcome of the task once it reached a terminal state (for example, SUCCESS, FAILED, CANCELLED). Blank if the task has not yet completed. Text
Start Time The date and time at which this task began execution. Timestamp
End Time The date and time at which this task completed or was terminated. Timestamp
Cluster Id The identifier of the cluster on which this task was executed. Useful for correlating task failures with specific cluster behaviour. Text
Attempt Number The attempt sequence number for this task execution. A value of 0 indicates the first attempt. Values greater than 0 indicate the task was retried after a previous failure. Text

Runs Analysis tab

The Runs Analysis tab provides a comprehensive view of job run behavior over time. Use this tab to identify failure patterns, understand run duration trends, and investigate cancelled or timed-out runs.

The following charts and views are available:

View Description Unit
Run Lifecycle States Distribution of runs across all lifecycle states (for example, pending, running, terminated) at the time of collection. Text
Run Results Over Time Time-series chart showing how run outcomes (succeeded, failed, cancelled, timed out) have trended across the selected period. Chart
Failed Runs List or chart of runs that ended in a failed state, with details to help identify recurring failure patterns. Count
Cancelled/Timed Out Runs View of runs that were cancelled manually or terminated due to timeout, to help identify scheduling or resource issues. Count
Avg Run Duration Trend Time-series chart showing how the average run duration has changed over time, useful for detecting performance regressions. Seconds
Top Slowest Runs Ranked list of the slowest individual runs by total duration within the selected period. Text
Runs by Trigger Type Breakdown of runs by how they were triggered (for example, scheduled, manual, file arrival). Count
Runs by Type Distribution of runs by job type (for example, notebook, Python script, JAR, dbt). Count

Run Duration Analysis

The Run Duration Analysis table at the bottom of the tab breaks down duration by run name, helping you pinpoint the most time-consuming runs and their duration components:

Column Description Unit
Run Name The display name of the individual job run. Text
Count Number of times this run was executed within the selected period. Count
Avg Setup Duration Average time spent in the setup phase (for example, cluster provisioning) before execution begins. Seconds
Avg Execution Duration Average time spent actively executing the run's tasks, excluding setup and cleanup. Seconds
Avg Cleanup Duration Average time spent in the cleanup phase after execution completes (for example, cluster termination). Seconds

Tasks Analysis tab

The Tasks Analysis tab provides task-level visibility into job execution. Use this tab to identify which specific tasks are failing, which clusters they run on, and whether reattempts are masking underlying reliability issues.

The following charts and views are available:

View Description Unit
Task Lifecycle States Distribution of tasks across all lifecycle states at the time of collection. -
Task Results Over Time Time-series chart showing how task outcomes (succeeded, failed, skipped) have trended across the selected period. -
Failed Tasks List or chart of tasks that failed, with details to identify patterns across runs or cluster types. Count
Tasks by Cluster Breakdown of task executions by the cluster on which they ran, useful for identifying cluster-level performance or reliability issues. Count
Task Reattempts View of tasks that were retried after an initial failure, helping identify flaky tasks or resource contention issues. Count

Task Duration Analysis

The Task Duration Analysis table at the bottom of the tab breaks down duration by task name:

Column Description Unit
Task Name The display name of the individual task within the job. Text
Count Number of times this task was executed within the selected period. Count
Avg Setup Duration Average time spent in the setup phase before the task begins execution. Seconds
Avg Execution Duration Average time spent actively executing the task, excluding setup and cleanup. Seconds
Avg Cleanup Duration Average time spent in the cleanup phase after the task completes. Seconds

Setting up alerts: Threshold configuration

To set thresholds for all Databricks Job monitors:

  1. In Site24x7, click Admin in the left navigation pane.
  2. Select Configuration Profiles from the left pane, then select Threshold and Availability from the drop-down menu.
  3. Click Add Threshold Profile in the top-right corner.
  4. For Monitor Type, select Databricks Jobs. You can now set threshold values for the metrics listed above.

IT Automation

Site24x7 offers IT Automation tools that automatically resolve performance degradation issues. These tools react to events proactively rather than waiting for manual intervention, automating repetitive tasks and remediating threshold breaches. How to configure IT Automation for a monitor

Configuration Rules

With Site24x7's Configuration Rules, you can automate configuration settings across your Databricks Job monitors and create custom rules to track configuration changes continuously. How to add Configuration Rules

Was this document helpful?

Would you like to help us improve our documents? Tell us what you think we could do better.


We're sorry to hear that you're not satisfied with the document. We'd love to learn what we could do to improve the experience.


Thanks for taking the time to share your feedback. We'll use your feedback to improve our online help resources.

Shortlink has been copied!