Web application performance issues usually don't come from just one part of the stack. If you see latency spikes but there’s no server bottleneck, database slowdown, or CDN issue, the problem is often in the routing layer. Border Gateway Protocol (BGP) route flapping is a particularly damaging and hard-to-spot cause of these intermittent slowdowns, and it happens outside the application itself.
This article covers what BGP route flapping is, how it affects web application performance, and what engineering and network teams can do to find and fix it.
BGP keeps the internet connected by letting autonomous systems (ASes) exchange routing information over persistent TCP sessions on port 179. A route flap occurs when a network path keeps switching on and off rapidly, leaving routers constantly scrambling to find a stable route.
Each flap sends an alert to every network using that route, which then updates its records and notifies its own neighbors. This chain reaction can generate thousands of updates per minute across the internet—consuming processing power and memory on routers far removed from where the problem started.
The main difference between a route flap and a normal route withdrawal is how often it happens. If a link goes down and stays down, routers find a new path and traffic settles. But if a link keeps going up and down every few seconds or minutes, that’s a flap. This constant change stops the network from reaching a stable state.
Knowing the causes helps you tell a flap apart from other routing issues and makes it easier to read monitoring data correctly.
Unstable physical links are the most common cause. An intermittently failing transceiver, a degraded fiber connection, or a faulty SFP module can cause a physical interface to toggle its line protocol up and down. Since BGP sessions run over TCP, any interface reset on the path between peers tears down the TCP connection, which immediately resets the BGP state machine and withdraws all routes learned from that peer.
Maximum transmission unit (MTU) mismatches are another common but less obvious cause. BGP update messages can be large, especially when sharing full internet routing tables. If a device along the path has a lower MTU than expected and ICMP is blocked (so Path MTU Discovery (PMTYD) doesn’t work), large update packets get dropped without warning. The TCP connection then stalls, keepalives time out, and the session resets. This creates a flap that happens every time the session comes back and tries to send the full routing table again.
Another common cause is when the hold timer expires because keepalive messages are lost. BGP peers usually send keepalives every 60 seconds, with a hold timer set to 180 seconds. If no keepalive or update message is received within the hold timer window, then the session drops. The session doesn't count individual missed keepalives; it simply expires if the timer isn't refreshed within the configured interval. Transient congestion, router CPU overload, or packet loss can all prevent keepalives from arriving in time, causing the session to reset. If the underlying condition is temporary, the session re-establishes quickly—and that recovery cycle is what produces the flap.
Configuration mistakes and routing policy errors—like filters that block valid announcements, redistribution loops, or wrong next-hop settings—can make routes switch back and forth between being installed and withdrawn as routers keep trying and failing to process them.
In bigger networks, instability with route reflectors and internal BGP (IBGP) can make a single session flap spread. If a route reflector’s session goes down, all its clients lose their IBGP-learned routes at once, causing a wave of route withdrawals and re-advertisements inside the network.
The effects of BGP route flapping on application performance are real and measurable. Application and infrastructure teams can see and track these specific failures.
During a route flap, routers temporarily have inconsistent forwarding tables; some have removed the affected prefix, others haven't yet. Packets destined for that prefix get dropped or forwarded into a black hole, causing connection timeouts, failed API calls, and dropped WebSocket connections.
If flapping is continuous, the network never fully settles and packet loss becomes chronic.
When a preferred route is withdrawn, BGP falls back to the next-best path—which may be longer, pass through a distant exchange, or traverse a congested provider. Round-trip time jumps until the preferred route returns and the network reconverges.
For latency-sensitive applications—real-time APIs, financial transactions, video calls, multiplayer games—even a 50ms spike causes noticeable problems. If flapping is ongoing, latency doesn't just spike and recover; it becomes unpredictable, making performance measurement and SLA compliance nearly impossible.
BGP flapping hits TCP hard. Packet loss from a route flap triggers retransmissions, congestion window reduction, and slow start—slowing the connection well after the route stabilizes.
Long-lived connections (HTTP/2, gRPC) can stay degraded for tens of seconds post-flap. Short-lived connections are more brittle—a flap during a TLS handshake can kill it entirely, forcing the client to restart from scratch.
A single flapping prefix can generate hundreds of BGP UPDATE messages per minute, forcing every receiving router to reprocess paths, update tables, and propagate changes to peers—consuming significant CPU.
When control-plane load spikes, data-plane forwarding suffers and packets get delayed or dropped across all traffic through that router. Additionally, flapping can degrade applications that have nothing to do with the affected route, simply by exhausting shared router resources.
BGP doesn't contain instability. A single flapping route triggers withdrawals and re-advertisements that cascade across every peer using that route, and beyond. Consider the 2019 incident where a misconfigured ISP leaked more than 20,000 route prefixes—roughly two percent of the internet—which shows how quickly an UPDATE storm can become a global problem, hitting routers and applications far removed from the source.
That's what makes BGP flapping uniquely dangerous at scale: the blast radius extends well beyond your own network.
The core challenge is detection: application monitoring can identify symptoms—latency spikes, timeouts, error rates—but can't attribute them to routing instability without dedicated BGP monitoring. Effective detection requires watching for frequent prefix withdrawals and re-advertisements, unexpected AS path changes, session resets, and UPDATE message spikes.
Route dampening helps limit propagation by suppressing flapping prefixes until they stabilize. This protects downstream peers from UPDATE storms.
The best approach is to combine BGP monitoring with application performance data. If you see a latency spike at the same time as a route change, you can quickly and clearly diagnose the problem. This cuts down the time to resolve issues from hours to minutes and lets you escalate to your upstream provider right away. The earlier a flap is caught, the narrower the blast radius—which is why BGP route flapping detection and mitigation need to work together, not in sequence.
Start by stabilizing physical links. Since interface flapping is the most common cause of BGP route flapping, checking the health of the physical layer—like transceiver signal levels, CRC error counts, and interface reset logs—can fix most flapping problems before you need to change any BGP settings.
Adjust BGP timers carefully. Lowering keepalive and hold timers helps detect failures faster but also makes the system more sensitive to brief packet loss. Using Bidirectional Forwarding Detection (BFD) for quick failure detection while keeping BGP timers more conservative gives you fast convergence without dropping sessions during short congestion events.
Make sure the MTU is consistent across the whole path. Checking end-to-end MTU between BGP peers—especially in setups with MPLS, GRE tunnels, or different hardware—removes a common cause of silent session resets. If you can’t guarantee full PMTUD, setting TCP MSS clamping on router interfaces is a reliable workaround.
Use route dampening carefully. Turning on BGP dampening with conservative settings (like those in RIPE-229) helps protect your network from outside flapping, but doesn’t block real route changes for too long.
Set up RPKI and prefix filtering. These steps mainly stop route leaks and hijacking, but they also reduce invalid route announcements, lower CPU load on border routers, and limit the impact of flapping.
Monitor the BGP state continuously. Making BGP monitoring a core part of your observability—alongside infrastructure metrics, application monitoring, and logs—is one of the most effective changes you can make. Real-time insight into prefix reachability, session state, and AS path changes lets teams spot BGP route flapping before it causes outages and link network events to application problems without extra manual work.
If your organization runs latency-sensitive or high-availability applications, you need to understand how BGP flapping affects performance. The network path isn’t just a passive channel—it’s a key part of reliability. Making BGP observability a top priority, along with infrastructure metrics and application monitoring, is the best way to close that gap.
Border Gateway Protocol (BGP) route flapping occurs when a network path repeatedly advertises and withdraws the same route in rapid succession. Each state change forces every router using that path to update its forwarding table and notify its neighbors—generating a cascade of BGP UPDATE messages that consumes router CPU and memory well beyond where the problem started. If the instability is continuous, the network never reaches a stable forwarding state, and the effects show up as chronic packet loss, unpredictable latency, and intermittent connection failures in applications that depend on that path.
The most common cause of BGP route flapping is an unstable physical link; a degraded transceiver, a faulty SFP module, or an intermittently failing fiber connection that causes a network interface to toggle up and down repeatedly. Since BGP sessions run over TCP, any interface reset tears down the session and immediately withdraws all routes learned from that peer. Other causes include MTU mismatches that silently drop large BGP update packets, hold timer expiry due to congestion or CPU overload preventing keepalives from being delivered in time, misconfigured routing policies that cause routes to cycle between installed and withdrawn states, and route reflector instability in larger IBGP deployments where a single session failure triggers a wave of withdrawals across all clients simultaneously.
When BGP route flapping occurs, the impact is visible at the application layer before the root cause is identified. Inconsistent forwarding tables during reconvergence cause packet loss, connection timeouts, failed API calls, and dropped persistent connections. Latency increases when BGP falls back to a less preferred path. For latency-sensitive applications—such as real-time APIs, financial transactions, and video calls—even a 50ms spike can affect performance.
BGP route flapping and a BGP route leak are distinct failure modes that produce different symptoms and require different responses. A route flap is an instability problem—a path repeatedly going up and down, causing packet loss and latency spikes as routers constantly reconverge. A route leak is a propagation problem—a network advertising routes to a neighbor that should never have received them, causing traffic to flow through unintended paths. Route flapping is usually caused by physical link instability, timer misconfigurations, or MTU issues. Route leaks are usually caused by missing export filters or incorrect routing policy. Both can cause widespread disruption, but a flap degrades performance while a leak misdirects traffic—sometimes into the hands of an attacker.
Detection of BGP route flapping requires dedicated BGP monitoring; application monitoring alone can identify the symptoms but cannot attribute them to routing instability without visibility into the control plane. Effective detection means watching for frequent prefix withdrawals and re-advertisements for the same prefixes, unexpected AS path changes on stable routes, BGP session resets that correlate with application performance degradation, and spikes in BGP UPDATE message volume from specific peers. The most actionable approach is correlating BGP events with application performance data—if a latency spike or error rate increase aligns with a route change event, the diagnosis becomes clear quickly. Without that correlation, flapping incidents can take hours to attribute correctly.
BGP route dampening is a mechanism that suppresses flapping prefixes by assigning a penalty each time a route is withdrawn and re-advertised. When the accumulated penalty crosses a configured threshold, the route is suppressed—held down even when it becomes available—until the penalty decays below a reuse threshold. This protects downstream peers and the broader internet from UPDATE storms caused by persistent instability. Dampening is most appropriate as a protective mechanism against external instability rather than as a fix for instability within your own network.
BGP uses two timers to manage session health: the keepalive timer (typically set to 60 seconds) and the hold timer (typically set to 180 seconds). If no keepalive or update message is received within the hold timer window, the session drops regardless of how many individual keepalives were missed—the timer simply expires if it isn't refreshed within the configured interval. Lowering these timers means failures are detected faster, but the session becomes more sensitive to transient congestion, increasing the risk of unnecessary session resets that, in turn, cause flapping.
Bidirectional Forwarding Detection (BFD) is a lightweight protocol that detects link failures in milliseconds—faster than BGP's native keepalive mechanism. While BGP's default hold timer takes up to 180 seconds to detect a failed session, BFD can detect a forwarding path failure in under a second and immediately signal BGP to tear down the affected session and reconverge. This matters for BGP stability because it separates the speed of failure detection from BGP timer sensitivity. Without BFD, achieving fast convergence means lowering BGP keepalive and hold timers, which makes sessions more prone to resetting during brief congestion.