Go to All Forums

Monitor Status on Performance Threshold breach

Option to configure "Down" severity for performance threshold breach has been dropped. Performance threshold breaches will be reported as Trouble Status only.

Kindly share your feedback regarding this.
Like (0) Reply
Replies (10)

I think that a little more intelligence should be added to the monitors, specifically when alerting Trouble events.
For example, if I have a threshold that alerts me if CPU usage exceeds 85%, I would like the monitor to evaluate if it is a peak event for the moment or if it is a prolonged situation before emitting the trouble alert.

If the server's average utilization for the last 15 days has been 10% and has had a couple peak events of 87% during that same period, it should not send the trouble alert because it could be considered one of those peak events, nut, if the issue continues for a larger period of time compared to other peak events, then it alerts along with its analysis.

What do you think?

Dennis
Like (0) Reply

Dennis,

The server agent doesn't send an immediate alert if the threshold exceeds the given value. 

If the threshold exceeds the current poll, the system will compare the last five polls,   if it finds the threshold violation in any of the two polls of the last five poll and only if the average of the last five polls violates the threshold with the fifth poll, the system will  trigger a trouble alert.

However, we would consider your suggestions and add it to our road map.

Regards,
Elango.
Like (0) Reply

Following on from this, We should be able to specify how long the CPU or memory should be above the threshold before alerting.  We should be able to set the CPU threshold to 95% but only alert us if it stays on or above the threshold for our specified period or time giving us full granular control of our alerting.
Like (0) Reply

Robert,

Yes we understand the importance of this and will let you configure the threshold period to be checked before alerting.

Keep giving your feedback.

Raghavan
Like (0) Reply

Hi Robert,

An additional info about this requirement:

We are working on an Anomaly detection platform, where we will be able to compare the performance values against the seasonal average of the metric in last 4 weeks.  This will be performed automatically without any configuration from your side. Currently it is enabled for our testing accounts only. Currently this feature is in the testing phase, once we have stable feature, we will open it up for the small set of customers. 

Hopefully that should help in proper alerting for thresholds.

Raghavan
Like (0) Reply

Though that sounds good, We need to be able to specify when an alert is triggered rather than the system making that decision for us.  We are a SaaS provider with bespoke services and processes that can be very demanding at times and idle at other times and we need to be able to adjust the alerting accordingly.  Our platform is always evolving and the monitoring may need to change to accommodate this to stop false positive alerting.

The Platform is doing what it's supposed to be doing and alerting when the thresholds are violated so that is not in question but without control over when it alerts, Site24x7 can become pretty useless to us in all honesty.  The only thing worse than no alerting is too much alerting as real incidents get lost in the sea of false positives.
Like (0) Reply

Would be nice to have a time based threshold option that can be either an individual feature or part of the new anomaly features.

Something like:

Mem usage is x% for more than x minutes
Disk usage spiked more than x% in last x minutes
Disk usage remains over threshold for more than x minutes (this will avoid normal spikes from creating alerts)

Make the criteria a field that is set by customers and maybe a grayed number beside it that would be a suggestion based on performance over time as Site24x7 self learns the product it is monitoring, thus allowing customers to make better decisions when setting their thresholds.
Like (0) Reply

Was thinking about this.. Will site thresholds be allowed to have this feature?

So like  Site response-time is over X (ms) for over C (time)
Like (0) Reply

Hello framirez,

Sorry for the delay in implementing this.  We are currently working on this feature to support the following enhancements:
  • Trouble alarm only after 'x' continuous threshold breach at attribute level
  • Trouble alarm only after 'x' continuous time period of threshold breach at attribute level
This will be available soon.  Will keep you updated once it is released.

Raghavan
Like (0) Reply

Hi Robert/framirez,

We have introduced 'Strategy' option in our threshold configuration that gives you more control over Trouble alerts and avoid false positives. 

Apart from configuring the thresholds for the attributes, you can specify the number of polls or time interval in which the threshold must be continuously breached, or the average value of the attribute over certain number of polls or time interval as shown in below screenshot: 



Poll count : When the set threshold value is breached continuously for the "Poll count", monitor status changes to trouble.

Poll Avg : When the average of the attribute values, for the number of polls configured, exceeds the threshold value, monitor status changes to trouble.

Time Range : When the set threshold value is breached continuously, for all the polls, during the time duration configured, monitor status changes to trouble.

Avg Time : When the average of the attribute values, for the time duration configured, exceeds the threshold value, monitor status changes to trouble.

We look forward to your feedback on the same. 
Like (0) Reply

Was this post helpful?