Analyze root cause of a downtime with Site24x7 RCA report
Site24x7's Root Cause Analysis (RCA) Report offers valuable insights on the various issues in your IT infrastructure. In this post, we will see how to interpret an RCA report and how to put RCA to optimal use in identifying and addressing performance and network issues.
RCA is different from a conventional downtime alert. Conventional downtime alerts contain details like start time of the instance and Traceroute if available, which is sufficient for you to know what has happened. However, the critical question as to why it happened remains unanswered. You need to know what caused the issue to immediately get down to working on it.
Root Cause Analysis automatically generates a plethora of information to arrive at a definite conclusion as to what triggered a downtime. RCA intends to determine the root cause of specific downtime or performance issues. In short, RCA aims to answer questions like what went wrong, how it went wrong and why it went wrong.
Interpret your RCA report.
We will take an example of a simple 'website not reachable' scenario and try to interpret what the report says. A normal RCA report will comprise of the following details:
- Checks from Primary location and re-checks from the Secondary locations.
- Ping Analysis
- DNS Analysis
- ICMP or TCP based Traceroute Analysis
- MTR Report
Monitor Details & Location Details:
This section shows you the current status of your website when polled from the Primary and Secondary monitoring locations. This will have downtime details, duration, and location wise reason for the outage.
First Check - First check is done from the Primary Location
This screenshot is the exact error returned when the monitoring stations tried to connect to your website. This kind of screenshot acts as a proof or evidence of what exactly happened when the site was trying to reach the remote server. In our example, the site returned ‘Connection Timed Out’ error.
2. Ping Analysis and Traceroute Analysis
Status: Server Unreachable due to timeout in Hop 16
Ping Analysis can be used to gather valuable insights like the number of packets sent, packet loss and response time. Traceroute analysis lets you diagnose network issues and helps you identify any vulnerability in your network. Traceroute is a simple tool to show the actual pathway to a remote server. This can be anything from a website that you attempt to connect to a remote device on your intranet network. All traceroutes except for Ping monitor utilizes TCP protocol. Ping Monitor generates ICMP based traceroute.
3. Test: Domain Analysis
Status: Domain resolved properly
This is a complete health check of your Name Servers and Email Servers. It retrieves records like SOA details and Name Server performance.
4. MTR Report
MTR, also known as My Traceroute combines the functionality of the Traceroute and Ping in a single interface. Combining these functionalities, MTR constantly polls your remote server and allows you to see how the latency and performance changes over time. Since the output is constantly updated in MTR, it allows you to collect actual performance trends and averages, and gives you a clear picture of the network performance over a varied time period.
Re-Checks from Secondary location
The same set of tests will be conducted from all your configured secondary locations as well. This is to confirm the actual downtime.
This is where the RCA report tells you the probable reason for the outage instance based on the above results. The conclusion reached in this case is “Connection to the server got dropped in the Hop 16”. This is obviously something to do with the network. Armed with this conclusion, you can immediately get down fixing the issue.
For Website, REST API, and REST API Transaction Monitor type, Site24x7 generates a trouble or Down RCA report. On the contrary, RCA reports are not available for SSL, Website Defacement, and Domain Expiry Monitor. We would love to hear from you on how the RCA has been helpful in identifying and troubleshooting issues. Share your valuable feedback in the comments.
Please navigate to Admin > Configuration Profiles > Notification Profile. You would see an option Root Cause Analysis Report by email when the monitor is down. Please disable this option