Go to All Forums

Root Cause Analysis for the Monitor

when a server goes down and the comes back up I like the "Root Cause Analysis for the Monitor" email but could some work go into it so that the "Down Reason" is more useful. I have noticed on when a server is rebooted the down reason was "Site24x7WindowsAgent could not establish communication with the Server".

Would it be possible for this root cause option could check the server out to see what's happened (including the event log) to see why it was off?   

Replies (2)

Re: Root Cause Analysis for the Monitor

Weirdly, I've just asked the very same thing today...

I restarted a server, and 24x7 captured the server going offline, then coming back again. It then sent me an email with the following RCA "Site24x7WindowsAgent could not establish communication with the Server".

By restarting the server, I triggered the following event in the System log: Event ID 1074: The process Explorer.EXE has initiated the restart of computer <servername> on behalf of user <domain\user>"

Could the RCA not be tuned to look for this occurrence, and then report the RCA as '<servername> restarted on request by '<domain\user>'?

Reply 0

Re: Re: Root Cause Analysis for the Monitor

Hi ,

Thanks for bringing this up. 

There are a few scenarios that would happen in your server environment during a server's shutdown/restart. Site24x7 will cover the following scenarios and help you analyze the situation.


Scenario 1 : Proper Shutdown/Restart of a Server

The Site24x7 Windows agent captures this scenario instantly and notifies the user. Later, when the server is UP again,based on the downtime marked in Site24x7, the agent will query the event logs, attach it to the RCA and send it to the user. 


Scenario 2 : Force Shutdown/Restart of a Server  

As this is a forced scenario, the Site24x7 Windows agent will not be able to notify on the shutdown/restart. If the downtime of the server goes beyond 2 minutes, then we will mark the server as DOWN and alert the user.

Later when the server is UP, based on the downtime triggered, the agent will collect the RCA report including the event logs and send it to the user. 

 

Scenario 3 : Abruptly Stopped/Crash of a Server

In this scenario, the Site24x7 Windows agent will not be notifying about the shutdown/restart of the server. If the downtime of the server goes beyond 2 minutes, then we will mark the server as DOWN and alert the user. 

Later, when the server is UP, based on the downtime triggered in Site24x7, the agent will collect the RCA report and send it to the user.

 

Sample RCA Report - Please read our help documentation on this

 

 

 

50 secs timeout set for Event log query -

After the server is UP, based on the Site24x7 Downtime of that server, the Windows agent will query the event logs and check the events 6008,1076,41,1074. Currently, we have set 50 secs as the default timeout for Event Log Query. In cases where the server is taking a longer time to execute the event log query, it might be timed out and the actual RCA report would not be generated. 

So, please have a look at the above cases and analyze the issue further. Request you to share the logs or contact support@site24x7.com to resolve the issue.

 

 

Regards,

Muralikrishnan

 

 

Reply 0