5 lessons from the October 2021 Facebook outage
On October 4, 2021, Facebook services went off the grid gradually, and then suddenly at 15:39 UTC. It took nearly six hours to restore service to normal. With over 3.5 billion users facing a lengthy downtime using one or multiple products from Facebook, Inc. (now known as Meta Platforms, Inc.) conversations flooded the internet about what caused the downtime issues on the American social networking service. This article attempts to outline the events that led to the outage, and help organizations large and small learn from the breakdown.
During regular network maintenance activities, Facebook engineers applied a patch to the network routers in its backbone network, unintentionally shutting them down.
The audit commands that usually prevent these mistakes contained a bug, making this an ineffective fix.
Facebook operates its own backbone network that stores all its data, and routes it to the internet through various entry gates. This faulty configuration change to the backbone routers interrupted all internal communications.
This resulted in a cascading effect on its intranet, and one by one, the network became unhealthy, stopped relaying its presence to the internet, and eventually, all of the company's apps and services including internal access points, went off grid.
As a result, a facility that responded to DNS queries itself became unreachable. DNS resolving errors skyrocketed, and in a matter of minutes, one by one, all the entries to Facebook's content worldwide were virtually unreachable. Facebook was suddenly off the grid, and the domain was even listed as "available" for sale for a short time.
What was the initial response?
"Our internal tools and systems complicated [our IT teams'] attempts to diagnose quickly and resolve the problem," explained Santosh Janardhan, VP for infrastructure at Facebook Inc., adding, "[Our IT team] identified the root cause as a faulty configuration and ruled out any malicious activity or data breach."
Why and how did this major outage happen?
The Border Gateway Protocol (BGP) is the postal system of the internet, where, through routing protocols, companies such as Facebook can announce their autonomous systems with the other internet companies. In other words, the BGP helps networks choose the best way to reach any other network, like a postal service.
The internet is a network of networks, so it is vital for the peers to announce themselves frequently to stay in the DNS pathways that enable users worldwide to reach its servers. Inside Facebook is a vast network that the company calls its backbone network, which is the company's long-term investment and development of its own intranet that spans the globe, linking its data centers using fiber networks.
The facilities connect to each other over this backbone network through routers. In these routers on October 4, a routine maintenance job unintentionally took down all the connections in the backbone network. An analogy is when the kitchen gets cut off from the restaurant, resulting in impatient and hungry diners demanding meals.
How did the recovery look like?
During its repair journey, engineers found it hard to access its data centers since entries were blocked due to network failure, and the internal repair tools were unusable. As a last resort, personnel were deployed to the data centers to debug physically and restart the systems. This process was designed to be challenging from a security perspective, so it took more time to resolve it. After the IT team fixed it, the network was up and running again.
What can we learn from the outage?
Look beyond the obvious to find the root causeThere is an IT joke that when a website goes down, the usual suspect is the DNS. However, in this case, the DNS outage was only a symptom: The root cause was the broken connection between the BGP peering with the peering routers. Peering is the method two networks use connect to exchange traffic directly, without the need for a third-party carrier. The configuration changes in these Facebook peering routers led to a break in the healthy network routes, and when they remained broken for a period, the routes themselves became dysfunctional, and practically non-existent.
Prepare for COVID-19 remote work challengesIn their recovery attempt, remote workers tried gaining access to peering routers to implement fixes. However, attempts were often thwarted due to the logistical challenge of people getting locked out of the data centers where fixes needed to be implemented. The pandemic-led workforce shortage was a great challenge for Facebook Inc, as it had sparse resources already in deployment when the incident happened. With employees getting locked out due to security issues, the available technical personnel shortage prolonged the time taken to fix errors and network operations.
Aim for a decentralized managementThe near-trillion dollar enterprise suffered an outage on all of its apps due to a centralization policy framework that rendered it vulnerable. Experts say that a decentralized control of app assets with a clear demarcation between same company-owned apps with its own cadence might have helped avoid a complete outage. Calls for decentralized management have gained favor in the past few years for more reasons than antitrust. This approach ensures that not all eggs are in a single basket, instead providing options for fixing issues at various isolated levels, rather than affecting whole systems at once which can make troubleshooting unmanageable.
During the downtime, perhaps due to Instagram's different architecture roots, it could pass the TCP/TLS connection requests successfully. The other services, WhatsApp, Facebook, and Messenger, were returning a "502 bad gateway" error. Still, due to the way Instagram was connected to the Facebook backbone, the website didn't load for the end users. This led to the opinion among IT experts that if a large company as Facebook splits its divisions and manages them individually, it can avoid a complete outage that paralyzes the whole ecosystem.
Invest in robust contingency planningWith great power comes great risk. Though Facebook conducted exercises, such as its "storm drills" that were developed to prepare its infrastructure to withstand sudden spikes in user requests or power consumption, the company still found itself in a tough spot because of the unprecedented nature of the error. Investing in more risk-based contingency planning benefits organizations during critical situations
Enable an independent communication systemWhen all communication channels breakdown at your data center, initiating an isolated, third-party instant communication system is important. When your data centers are down, Site24x7's StatusIQ, an advisory webpage hosted outside of your premises, saves the day by promptly alerting and informing site visitors. This is like having a HAM radio or a walkie-talkie available during crisis moments.
A dependable communication channel enables your staff to communicate without glitches with each other and to inform customers about the downtime and recovery plan. StatusIQ operates independent of public clouds, which ensures that it stays up even when your chips are down. Additional support is provided by Site24x7 through its the fluid integration with Cliq, the business and team chat app built with rich communication features.
Site24x7's DNS monitoring solution helps you look up the DNS status of your websites from more than 110 global locations. It helps you eliminate potential domain resolution errors on your critical servers, ensuring you stay on top of outages and performance concerns.
Site24x7's Network Configuration Management helps IT administrators efficiently backup network router configurations so they can be restored immediately as necessary.
Website monitoring is a zero-sum game, as user experience and an organization's brand value is often instantly impacted when IT infrastructures are down or attacked. For comprehensive, advanced end-to-end website monitoring capability, visit Site24x7's website monitoring suite. Site24x7 ensures top availability of all your websites to visitors across the globe, and helps webmasters and IT administrators gain the proactive edge to restore services and thwart attacks to ensure the best user experience.