Operations Issue at 12 Oct 2015, 21:52 PST

13-Oct-2015 08:40 PM UTC

We faced intermittent problems starting 12 Oct 2015, 21:52 PST / Oct 13, 2015 10:22 AM IST due to a problem with our database server cluster.

Data collection and polling was intermittent at this time for many customers and you may observe some points missing in the charts in the web console. The Site24x7 Web Console too was not reachable.

Our engineers are working on this problem and we will update here once we have found the root cause.

We sincerely apologize for the inconvenience this is causing.

- Gibu

Like (3) Reply

Replies (1)

Srinivasa Raghavan

by Srinivasa Raghavan

14-Oct-2015 03:01 AM UTC

Please find the RCA for today's outage:

On Oct 12th, around 21:50 PM PDT we were provisioning user space for new users in the system. This resulted in an unexpected slowness in our primary data cluster thus degrading the performance of the whole system. To resolve the performance problem, we stopped the provisioning of user space to reduce load on our system. This again did not help. The problem got worsened due to heavy load due to backlogs on our servers.

Finally, we killed the main DB server in the cluster and swapped it with the slave to resolve the performance problem.

Timeline of Events

At 21:50 PM PDT User space provisioning started

At 21:55 PM PDT Received alert on the performance problem

We had the following issues during this period:

Our Web Console was inaccessible intermittently,

Monitoring was happening at a slower pace,

Server Monitoring data were not persisted immediately.

At 00:00 AM PDT Stopped the hung user provisioning thread and restarted all the application servers.

At 00:10 AM PDT All monitoring started working properly in expected poll frequency. The Site24x7 Web Console had accessibility issues intermittently, still.

At 3:00 AM PDT Swapped the slave DB server to master in our primary data cluster.

Issue with client servers and server agents had started stabilizing now.

At 5:30 AM PDT Completed all the backlogs and service was completely stable.

We sincerely apologize for the inconvenience caused due to this.

Get back to us for any further clarifications.

Like (0) Reply

Was this post helpful?

Operations Issue at 12 Oct 2015, 21:52 PST

Statistics

Tags