How SD-WAN Failover Actually Works: Link Aggregation, FEC, and Path Selection
When a retail POS terminal freezes mid-transaction because a WAN link dropped, nobody cares about your architecture diagram. They care about how fast you recover. In this post, we break down exactly what happens inside Hopbox SD-WAN when a link degrades or fails — and why our failover consistently hits sub-second recovery times.
Link Health Monitoring
Section titled “Link Health Monitoring”Failover starts with detection. You cannot switch to a backup path if you do not know the primary is degraded. We run continuous health probes on every WAN interface:
ICMP probes are the baseline — a ping every 500ms to a known upstream target. But ICMP alone is unreliable. Some ISPs deprioritize ICMP, and a link can pass ping checks while dropping TCP traffic.
HTTP probes give a fuller picture. We issue lightweight HTTP HEAD requests to multiple endpoints, measuring round-trip time and success rate. This catches scenarios where an ISP’s gateway responds to ping but is dropping routed traffic upstream.
Jitter and loss thresholds are where it gets interesting. A link does not need to be “down” to be unusable. We track:
- Packet loss over a rolling 10-second window
- Jitter (variation in latency between consecutive probes)
- Latency relative to the link’s established baseline
# Example: mwan3 health check configuration on OpenWrtconfig interface 'wan1' option enabled '1' option initial_state 'online' option family 'ipv4' option track_ip '8.8.8.8 1.1.1.1' option track_method 'ping' option reliability '1' option count '3' option size '56' option max_ttl '60' option check_quality '1' option failure_latency '1000' option failure_loss '40' option recovery_latency '500' option recovery_loss '5' option timeout '4' option interval '5' option failure_interval '1' option recovery_interval '3' option down '5' option up '3'The key parameters here: failure_loss at 40% and failure_latency at 1000ms trigger a link state change. But notice failure_interval drops to 1 second — once we suspect degradation, we probe more aggressively.
Multi-Link Aggregation: Bonding vs Load Balancing
Section titled “Multi-Link Aggregation: Bonding vs Load Balancing”There is a common conflation between link bonding and load balancing. They solve different problems.
Link bonding (e.g., Linux bonding driver in mode 4 / 802.3ad) aggregates multiple physical links into a single logical interface. This requires cooperation from the other end — your ISP or a bonding concentrator. True bonding gives you combined throughput and transparent failover, but it demands infrastructure on both sides.
Policy-based load balancing distributes flows across multiple independent WAN links. Each link maintains its own IP, its own gateway, its own routing table. This is what most SD-WAN deployments actually use, because you rarely control both endpoints.
Hopbox uses policy-based routing with mwan3 on OpenWrt. Each WAN interface gets its own routing table, and traffic is distributed based on rules:
# mwan3 policy: primary WAN with failover to secondaryconfig policy 'balanced' list use_member 'wan1_m1_w3' list use_member 'wan2_m1_w1'
config member 'wan1_m1_w3' option interface 'wan1' option metric '1' option weight '3'
config member 'wan2_m1_w1' option interface 'wan2' option metric '1' option weight '1'This gives WAN1 three times the traffic share of WAN2. When WAN1 fails health checks, all traffic shifts to WAN2 — no session disruption for new connections, and existing connections fail over based on the policy’s last_resort behavior.
Forward Error Correction (FEC)
Section titled “Forward Error Correction (FEC)”FEC is borrowed from telecommunications and applied to WAN traffic. The concept: send redundant data alongside your real packets so the receiver can reconstruct lost packets without retransmission.
Consider a link with 5% packet loss. Without FEC, TCP retransmits every lost packet — adding at least one RTT of delay per loss event. With FEC, we send, for example, 5 redundancy packets for every 20 data packets. The receiver can reconstruct any 5 lost packets from that group without waiting for retransmission.
The trade-off is bandwidth. FEC adds overhead proportional to the redundancy ratio. On a 10 Mbps link with 20% FEC overhead, you effectively have 8 Mbps of usable throughput — but with near-zero retransmission delay.
# Conceptual FEC overhead vs recovery rate## FEC Ratio Bandwidth Overhead Max Recoverable Loss# 5:20 20% ~20%# 3:20 13% ~13%# 1:10 9% ~9%## Higher FEC = more overhead but smoother experience on lossy linksWe dynamically adjust FEC ratios based on measured link quality. A clean fiber link gets minimal FEC. A 4G backup with 3% baseline loss gets aggressive FEC. This is recalculated every few seconds based on the rolling loss metrics from our health probes.
Path Selection Algorithms
Section titled “Path Selection Algorithms”When you have two or more WAN links, every packet (or flow) needs a decision: which path?
Static weighted routing is the simplest approach. Assign weights, distribute proportionally. It works but ignores real-time conditions.
Latency-weighted selection routes traffic to the lowest-latency link. This is critical for POS and VoIP traffic where every millisecond matters. We continuously measure per-link latency and update routing weights:
# Pseudocode for latency-weighted path selectionfor each flow: if flow.type == "realtime": # POS, VoIP select link with lowest_latency elif flow.type == "bulk": # firmware updates, backups select link with highest_available_bandwidth else: select link by weighted_round_robin(link_weights)Cost-aware routing adds another dimension. In India, many sites have a primary broadband link (cheap, best-effort) and a 4G/5G backup (metered, expensive). Cost-aware path selection keeps traffic on the broadband link unless quality degrades below the threshold — then spills over to cellular, but only for critical traffic classes.
Failover Timing
Section titled “Failover Timing”The sequence when WAN1 drops:
- T+0ms: Last successful probe response received.
- T+500ms: Next probe sent, no response.
- T+1000ms: Second probe sent, no response.
failure_intervalkicks in — probes now every 1 second. - T+1000–3000ms: Consecutive failures accumulate. After
downthreshold (e.g., 5 failures), link state transitions to “offline.” - T+3000ms:
mwan3updates routing rules. New flows route via WAN2. Existing tracked connections are re-routed based on policy.
Total detection-to-failover: roughly 3–5 seconds with conservative settings. For aggressive configurations (probe every 200ms, down threshold of 3), we can achieve sub-second failover — but at the cost of false positives on jittery links.
# Verify current mwan3 link statusroot@hopbox:~# mwan3 statusInterface status: interface wan1 is online and target 8.8.8.8 is online interface wan2 is online and target 1.1.1.1 is online
Current ipv4 policies: balanced: wan1_m1_w3 (online) wan2_m1_w1 (online)Graceful vs Hard Failover
Section titled “Graceful vs Hard Failover”Not all failovers are link-down events. Sometimes a link degrades slowly — latency creeping up, sporadic loss. Hard failover (link declared dead, all traffic moved) works for outages but is too aggressive for degradation.
Graceful failover gradually shifts traffic. As WAN1’s quality score drops, its routing weight decreases proportionally. Traffic organically moves to WAN2 without a discrete switchover event. This avoids the “flapping” problem where a link oscillates between up and down states, causing repeated failovers.
# Link quality score calculation (simplified)quality_score = (1 - loss_rate) * (1 - jitter_normalized) * (1 / latency_normalized)
# Weight adjustmenteffective_weight = configured_weight * quality_scoreWhat This Means in Practice
Section titled “What This Means in Practice”For a retail store running POS over Hopbox SD-WAN:
- Primary link drops completely: Failover to secondary in 3–5 seconds. POS transaction in progress may need retry, but the next transaction succeeds immediately.
- Primary link degrades (5% loss, 200ms latency spike): FEC absorbs the loss, path selection biases toward the secondary link. POS traffic may shift within seconds — no human intervention needed.
- Both links degrade: FEC on both paths, quality-weighted distribution, alerts to NOC dashboard via Prometheus.
The goal is not zero downtime — that requires redundancy at every layer. The goal is that network issues at the WAN edge do not cascade into business-impacting outages. Sub-second awareness, sub-5-second recovery, and zero manual intervention for the store staff.