Cron jobs are to data centers what nails are to a house: you probably don't think about them very often, but they are all over the place, and they play a critical role in holding everything together. If they don't do their job properly, everything falls apart.
Unlike nails, though, cron jobs can be fickle. A number of things can go wrong with them, from failure to start, to timeouts, to performance issues, and beyond. When a cron job that your operating system or apps depend on doesn't work properly, it can result in significant consequences for your operations and your customers.
That's why monitoring cron jobs is a critical requirement for any IT environment that includes Unix-like systems that use cron.
Cron is a utility for Linux, and other Unix-like operating systems, that automatically performs tasks according to configurations set by IT admins. A widely used tool available since the 1970s, cron is installed by default on most Unix-like operating systems available today–including those used to host workloads in a data center.
A task that is configured to run via cron is known as a cron job. Usually, cron jobs are Bash scripts that perform maintenance tasks, such as rotating log files, clearing out temp directories, or updating data caches.
Cron jobs are typically configured in a file known as crontab, which is located
/etc/crontab and can be edited only by the root user (or a user with sudo privileges).
Individual users can also add their own crontabs.
Most operating systems come with some
/etc/crontab entries by default, and it's common
IT admins to add custom entries to help automate maintenance tasks. As a result, even a midsize
could havethousands of cron jobs spread across its servers, all quietly doing their part to keep the
center running smoothly.
The tasks that cron jobs handle might feel mundane, but they're critical to keeping operating systems and applications performing efficiently.
If a cron job that rotates log files or clears out temporary directories doesn't run as scheduled, the files could take up so much space that your server runs out of disk capacity. Eventually, a server that runs out of space will fail.
Likewise, failure to run a cron job that updates data caches could harm the performance of an application that depends on the cache, because the cache data will be out of date when the app accesses it.
To avoid problems like these, you need to monitor cron jobs to make sure they execute properly. You can't ensure healthy operations for your operating system or applications without keeping track of cron operations.
By default, you typically won't receive any alerts about cron failures. Operating systems usually don't send any kind of notification about issues with cron beyond making an entry in your syslog file. By default, cron will effectively fail silently, leaving IT admins with few insights about potential issues unless they hear about performance problems from end users.
Complicating matters further is that cron is a local utility, meaning each server in your data center will have its own crontab (or crontabs, if there are user crontabs in addition to a root crontab). Each server will also log data about cron operations locally.
To check on your cron jobs manually, log into each server separately and look in its syslog files. Or, find a way to aggregate the syslog files from across your data center, then identify which cron entries in the aggregated logs correspond to which servers. Either way, you're in for a messy, inefficient experience if you have multiple servers and try to monitor cron manually.
There are a variety of factors to track for each cron job in your servers' crontab files. Helpful metrics to monitor include:
In addition to these metrics, you should also track the health of cron as a whole by measuring data points such as:
How many cron jobs run in a given period? : Tracking the total number of running cron jobs helps you gauge the effectiveness of cron in general.
Cron failure and response trends : Tracking the total number of cron job failures or slowdowns in response over time provides visibility into the health of overall cron operations.
How many resources does cron consume in a given period? : Knowing the collective load that cron jobs place on your server will help you avoid undercutting server performance. You may also decide to reschedule some cron jobs if you find that they are placing a heavy load on the server at inopportune times, such as times of day when user traffic is high.
Monitoring cron jobs and overall cron performance is only half the battle. Equally important is having an efficient, automated solution in place to notify IT admins instantly when something goes wrong with any cron job on any server in your data center.
Cron itself doesn't go out of its way to tell you about issues and it doesn't have any native notification features beyond writing to log files, so you'll need to implement a separate utility for cron alerts.
Site24x7 solves this problem by enabling you to configure cron alerts that will be sent by SMS, email, and/or your favorite real-time collaboration app, like Slack or HipChat. Incident management platforms like PagerDuty are supported, too.
Cron is a powerful tool, and it plays a critical role in most modern data centers. However, one of its weaknesses is that it fails silently and offers little in the way of automated monitoring or alerting.
If you don't invest in a solution that lets you monitor cron automatically across your entire data center, then you suffer from a critical visibility gap within your data center. You can close that gap with a monitoring tool that can offer an easy-to-deploy cron monitoring solution along with advanced alerting and deep insights into cron performance.