Hi Brian, Tom,
We would like take this opportunity to point of some of the instances when data collection may not take place and ways in which we have handled them and will handle in future. I'm sure this will be helpful for others as well. (These have been already communicated to Brian by Muralee our support lead, listing it out here for the benefit of the community at large).
1. Configuration errors
Configuration error occurs at the time of adding the monitor. When we add a monitor and if the resource is unavailable or if the data collection returns an error, the monitor is marked as in 'Configuration Error'. Data collection will not happen when the monitor is in Configuration error state and hence the monitors are shown as last polled at the time of discovery/adding the monitor. One the cases was for EC2 for which we fixed as Ananth said.
2. On-premise Poller data collection
When on-premise pollers are down, no data collection takes place for those monitors which are data collected by that poller. When a poller is down, the monitors configured for the specific poller are retained in the last known status(in Brian's case 'Up') till the poller comes up and starts the data collection. This is done to avoid false alerts as the monitored resource could still be up, but the only the poller is down. However, when a poller is down we send alerts to the user alert groups configured for the poller. As a first troubleshooting step check access permissions for domains and port access mentioned in this KBase. If access are given, poller logs would help us troubleshoot faster.
3. For Domain Expiry monitors & SSL monitors
Generally when there is an exception, data collection will not happen for that data point. In Brian's case some of the domains reported a 'DOMAIN_EXPIRY_PARSING_ERROR.' exception, where site24x7 did not recieve any data regarding the domain. During DC, Site24x7 will use the registry and registrar details to fetch the domain information. In this case only the registry information was available and registrar information wasn't available. This resulted in a parsing error, because of which the data did not get stored in our database and the last poll showed 4 months ago. There was also a SSLHandshakeException for some of the SSL Certificate monitors.
Also, domain expiry monitors can get "Connection reset" error. This means that the target server has reset the connection initiated by the monitoring servers. Some of the reasons could be,
- High Web traffic
- Target server overload
- Temporary network issues in between the network.
We'll further investigate on this issue and rectify it and will also update here.
4. Suspended monitors
For any monitor that is suspended no data collection takes place.
A note on the current status API usage. suspended_required=false is the default value. Switch the parameter to true only if needed as it would return monitors including those that are suspended.
Thank you Brian for bringing to our notice the problems faced. This has only made our product better. We are not only taking more preventive measures in our data collection engine so that there are minimal to no disruption to the monitoring system but also capturing it proactively so that end users not affected.
Site24x7, Product Manager