Skip to content

Monitoring 900+ Sites with Prometheus and Grafana

When you have a handful of servers, a single Prometheus instance and a couple of Grafana panels will do. When you have 900+ SD-WAN sites spread across India — each with multiple WAN links, a CPE device, and a tunnel overlay — the monitoring architecture needs real thought.

This post walks through how we built our observability stack at Hopbox: the architecture decisions, the metrics that matter, the dashboards our NOC team actually uses, and the scaling challenges we hit along the way.

Each Hopbox site has:

  • 1 SD-WAN appliance (x86, running OpenWrt)
  • 2-4 WAN uplinks (fiber, broadband, 4G backup)
  • WireGuard tunnels to our hub infrastructure
  • Local DNS resolution via PowerDNS Recursor

We need to know, in near-real-time:

  • Is the site reachable?
  • Are all WAN links healthy?
  • What is the latency, jitter, and packet loss on each link?
  • Is bandwidth utilization approaching capacity?
  • Are tunnels up?
  • Is DNS resolution working?

At 900+ sites, that translates to tens of thousands of time series.

Architecture: Federation over Remote Write

Section titled “Architecture: Federation over Remote Write”

We evaluated two approaches:

  1. Remote write — each site pushes metrics to a central Thanos/Cortex/Mimir cluster.
  2. Prometheus federation — regional Prometheus instances scrape local targets, and a global Prometheus federates from them.

We went with federation, for a few reasons:

  • Our sites are grouped into regional clusters anyway (hub-and-spoke topology).
  • Federation lets each regional Prometheus operate independently — if the central instance goes down, regional monitoring continues.
  • We avoid the write-path complexity of remote-write receivers at scale.

The topology looks like this:

[Site CPE] --metrics--> [Regional Prometheus] --federation--> [Global Prometheus] --> [Grafana]
|
[Alertmanager]

Each regional Prometheus instance scrapes all CPE devices in its region:

# prometheus.yml (regional instance)
global:
scrape_interval: 30s
evaluation_interval: 30s
scrape_configs:
- job_name: 'hopbox-cpe'
file_sd_configs:
- files:
- '/etc/prometheus/targets/cpe-*.json'
refresh_interval: 5m
- job_name: 'snmp-wan-links'
metrics_path: /snmp
params:
module: [if_mib]
file_sd_configs:
- files:
- '/etc/prometheus/targets/snmp-*.json'
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: snmp-exporter:9116
rule_files:
- '/etc/prometheus/rules/*.yml'
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']

The global instance federates key aggregated metrics:

# prometheus.yml (global instance)
scrape_configs:
- job_name: 'federation-regional'
honor_labels: true
metrics_path: /federate
params:
'match[]':
- '{job="hopbox-cpe"}'
- '{job="snmp-wan-links"}'
- '{__name__=~"hopbox:.*"}' # pre-aggregated recording rules
static_configs:
- targets:
- 'prom-north.internal:9090'
- 'prom-south.internal:9090'
- 'prom-west.internal:9090'
- 'prom-east.internal:9090'

We use three primary sources of metrics:

Every Hopbox device runs node_exporter (compiled for x86 OpenWrt). This gives us system-level telemetry:

  • CPU usage, memory, filesystem usage
  • Network interface byte/packet counters
  • System uptime

For WAN link metrics from upstream provider equipment and the CPE’s own interfaces, we use the Prometheus SNMP exporter with the if_mib module:

  • ifHCInOctets / ifHCOutOctets — bandwidth utilization
  • ifOperStatus — link up/down
  • ifSpeed — negotiated link speed

This is the most important one. A lightweight Go binary running on each CPE performs active probes:

  • ICMP ping to regional hub (latency, packet loss)
  • UDP jitter measurement (mimicking VoIP traffic patterns)
  • HTTP probe to known endpoint (full reachability test)
  • DNS resolution timing

It exposes metrics like:

# HELP hopbox_wan_latency_ms WAN link latency in milliseconds
# TYPE hopbox_wan_latency_ms gauge
hopbox_wan_latency_ms{site="MUM-0142",link="wan0",isp="fiber"} 4.2
hopbox_wan_latency_ms{site="MUM-0142",link="wan1",isp="broadband"} 12.8
# HELP hopbox_wan_packet_loss_ratio WAN link packet loss ratio (0-1)
# TYPE hopbox_wan_packet_loss_ratio gauge
hopbox_wan_packet_loss_ratio{site="MUM-0142",link="wan0",isp="fiber"} 0.001
hopbox_wan_packet_loss_ratio{site="MUM-0142",link="wan1",isp="broadband"} 0.023
# HELP hopbox_wan_jitter_ms WAN link jitter in milliseconds
# TYPE hopbox_wan_jitter_ms gauge
hopbox_wan_jitter_ms{site="MUM-0142",link="wan0",isp="fiber"} 0.8
hopbox_wan_jitter_ms{site="MUM-0142",link="wan1",isp="broadband"} 3.4

Raw metrics at 30-second intervals across 900 sites create a lot of data. We use recording rules to pre-aggregate at the regional level:

rules/hopbox-aggregation.yml
groups:
- name: hopbox_wan_aggregation
interval: 1m
rules:
- record: hopbox:wan_latency:avg5m
expr: avg_over_time(hopbox_wan_latency_ms[5m])
- record: hopbox:wan_packet_loss:avg5m
expr: avg_over_time(hopbox_wan_packet_loss_ratio[5m])
- record: hopbox:sites_up:count
expr: count(up{job="hopbox-cpe"} == 1)
- record: hopbox:sites_down:count
expr: count(up{job="hopbox-cpe"} == 0)
- record: hopbox:wan_bandwidth_utilization:ratio
expr: |
rate(ifHCInOctets{job="snmp-wan-links"}[5m]) * 8
/ on(instance, ifIndex) ifSpeed{job="snmp-wan-links"}

We keep alerting rules focused and actionable. Here are the critical ones:

rules/hopbox-alerts.yml
groups:
- name: hopbox_critical
rules:
- alert: SiteDown
expr: up{job="hopbox-cpe"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Site {{ $labels.site }} is unreachable"
description: "CPE at {{ $labels.instance }} has been down for 5 minutes."
- alert: AllWANLinksDown
expr: |
count by (site) (hopbox_wan_link_up == 0)
== count by (site) (hopbox_wan_link_up)
for: 2m
labels:
severity: critical
annotations:
summary: "All WAN links down at {{ $labels.site }}"
- alert: HighPacketLoss
expr: hopbox:wan_packet_loss:avg5m > 0.05
for: 10m
labels:
severity: warning
annotations:
summary: "High packet loss on {{ $labels.link }} at {{ $labels.site }}"
description: "Packet loss is {{ $value | humanizePercentage }} over 5m average."
- alert: HighLatency
expr: hopbox:wan_latency:avg5m > 100
for: 10m
labels:
severity: warning
annotations:
summary: "High latency on {{ $labels.link }} at {{ $labels.site }}"

Alertmanager routes critical alerts to PagerDuty and our internal Slack channel; warnings go to Slack only.

We maintain three primary dashboards:

The big-screen dashboard. It shows:

  • Total sites up/down (stat panels)
  • Sites with active alerts (table)
  • Map of India with site locations, color-coded by health (Geomap panel)
  • Top 10 worst-performing links (bar gauge)

One dashboard per region. Shows:

  • All sites in the region as a table with sortable columns (latency, loss, bandwidth)
  • Trend graphs for aggregated regional metrics
  • ISP-level aggregation (average latency by provider)

Linked from the NOC overview and regional dashboards. For a single site:

  • All WAN links with real-time latency, jitter, packet loss graphs
  • Bandwidth utilization per link (stacked area)
  • Tunnel status
  • CPE system metrics (CPU, memory, uptime)
  • Event annotations (config pushes, firmware updates)

With ~50,000 active time series and a 30-second scrape interval, each regional Prometheus generates roughly:

  • 50K series x 2 samples/min x 60 min x 24 hr = ~144M samples/day per region

We retain 15 days locally on each regional instance (SSD storage, ~40GB per region) and 90 days on the global instance with downsampled data.

For long-term storage, we are evaluating Thanos sidecar to push blocks to S3-compatible object storage.

  1. Scrape timeouts at scale. When a regional instance scrapes 200+ CPEs, some over slow links, scrape timeouts cascade. We increased scrape_timeout to 20s and staggered scrape intervals using hash_mod relabeling.

  2. Cardinality explosion. Labels like site, link, isp, and interface multiply quickly. We keep label sets intentionally small and avoid high-cardinality labels like IP addresses in metric labels.

  3. Target discovery. Static config files don’t scale. We built a sidecar that queries the Hopbox API for active sites and writes Prometheus file-based service discovery JSON files, refreshed every 5 minutes.

  4. Dashboard performance. Grafana querying 900+ sites on a global dashboard can be slow. Pre-aggregated recording rules are essential — never query raw metrics on the global dashboard.

  5. Alert fatigue. With 900 sites, a nationwide ISP outage can fire hundreds of alerts simultaneously. We use Alertmanager’s group_by and group_wait aggressively, and route ISP-wide incidents to a separate channel.

We are working on:

  • Thanos for long-term metric storage
  • SLO-based alerting (burn rate alerts instead of threshold alerts)
  • Automated anomaly detection for link quality degradation
  • Exposing select metrics to customers via the Hopbox Cloud dashboard

Monitoring at scale is never “done” — but having a solid Prometheus federation architecture, well-defined metrics, and dashboards that match your operational workflow gets you most of the way there. If you are running a similar distributed infrastructure, we’d love to hear how you approach observability.

v1.7.9