Using sensible thresholds for alerting and investigation is a tricky facet in modern infrastructure monitoring. Since static thresholds can never adapt to the system’s dynamic behavior, it can’t learn what’s normal and what’s not. It leads to too many false alarms and missed alarms. At Site24x7, we're at the verge of a new era of self-learning monitoring solutions based on artificial intelligence and machine learning. With the launch of our AI-powered anomaly and outlier detection, you can basically do away with static thresholds and ensure you receive accurate alerts even before the actual issue surfaces in your stack. Anomaly dashboard lets you distinguish any normal and abnormal trends in your metric, forecast potential outages, and fix issues before it starts affecting your customers.
Using Anomaly data for DevOps productivity
Anomaly detection is a set of robust techniques and predictive machine learning models applied to find unusual behaviors and/or states in systems and their observable signals. Site24x7 uses an AI-based anomaly framework to detect unusual spikes or aberrations in your monitored critical metrics, such as response time, CPU usage, memory usage etc. DevOps technicians can use the anomaly data for better operational efficiency. Here are a few ways by which the operations teams can leverage the anomaly data for faster incident resolution:
- It can detect abnormal metric values in order to find otherwise undetected issues in your stack. An example is the detection of high memory usage of your server, which may lead to a potential busy or idle server in the future.
- It can sense drastic changes in an important metric or process, so you can examine the situation to find potential issues.
- It can reduce the need to set or recalibrate thresholds across a variety of different monitors.
- It can reduce the perimeter of search space while trying to diagnose a problem in your stack.
Here's how we detect an anomaly in Site24x7
Our AI-based anomaly detection method ensures accounting of:
- Seasonality, where the metric pattern structure keeps recurring more or less the same way, week after week.
- Trend, where the overall metric pattern direction (rise or fall) repeats itself uniformly for a prolonged period.
- Robustness, where the metric is immune to any insignificant performance spikes.
Detection of anomalies involve these three steps:
- Mathematical modeling: Anomaly engine uses univariate algorithm for location-based monitors. It uses Robust Principle Component Analysis (RPCA) algorithm to gain a considerable level of robustness in reporting anomalies. The anomaly engine spots anomalous behavior in data recorded for a single metric of interest, usually response time attribute for all Internet Services Monitors. For agent-based monitors like server and application monitors, where multiple attributes are to be tracker, the anomaly engine works on such combinations using Matrix sketching algorithm.
- Anomaly event generation: This stage performs all the heavy weight data crunching. For the univariate algorithm, the anomaly engine uses the last four weeks' respective day's hourly 95th percentile metric values for training. The anomaly engine collects metrics every 15 minutes from data collection agents and compares your KPIs against the trained data. On the contrary, the multivariate algorithm uses the last four weeks' hourly 95th Percentile values from corelated attributes for training. We use the 95th percentile value, as it snips off the top five percentage highest values, and this results in the removal of any redundant spikes from the training data. Learn how the anomaly event generation happens.
- Domain scoring: Scoring is a method used to measure anomaly severity. Anomaly engine assigns unique scores to events. Based on factors like domain scores, dependencies, and increased gravity of detected anomalies, anomalies get labeled as: confirmed, likely or info. Read more about domain scoring here.
Anomaly reporting with graphs and flexible alerting
By accessing the time series graph in the Anomaly Dashboard and the events timeline graph in the monitor's summary page, you can closely introspect the metric in question by verifying the monitor’s evaluation window. Read this article to know how to interpret the anomaly data. You can also share the exported anomaly dashboard with users in CSV or PDF.
Alerting is an integral part of the anomaly detection. Sporadic anomaly alerts from a single monitor are usually grouped and delivered to the users at the end of an hour via email. To set up anomaly email alerts in Site24x7, login to the web client and access Admin > Users & Alerts > Add Users > Advanced Settings; select Email as your preferred mode for alerting.
Get started now!
If you already have a paid Site24x7 monitor setup, then you can head straight to detecting anomalous metrics from the Anomaly Dashboard. Anomaly detection is just the beginning; expect to see more features lining up for AIOps soon. Feel free to leave your feedback in the comments section below. If you have any questions, please get in touch with us at email@example.com.