find the RCA for today's outage:
On Oct 12th, around 21:50 PM
PDT we were provisioning user space for new users in the system.
This resulted in an unexpected slowness in our primary data cluster
thus degrading the performance of the whole system. To resolve the
performance problem, we stopped the provisioning of user space to
reduce load on our system. This again did not help. The problem got
worsened due to heavy load due to backlogs on our servers.
we killed the main DB server in the cluster and swapped it with the
slave to resolve the performance problem.
21:50 PM PDT User space provisioning started
At 21:55 PM PDT Received alert on the performance
We had the
following issues during this period:
Our Web Console was inaccessible
Monitoring was happening at
a slower pace,
Server Monitoring data
were not persisted immediately.
00:00 AM PDT Stopped the hung user provisioning thread and restarted
all the application servers.
00:10 AM PDT All monitoring started working properly in expected poll
frequency. The Site24x7 Web Console had accessibility issues
At 3:00 AM PDT Swapped the slave DB server to master in our
primary data cluster.
Issue with client servers and server agents had started
5:30 AM PDT Completed all the backlogs and service was completely stable.
sincerely apologize for the inconvenience caused due to this.
back to us for any further clarifications.