Monitoring 900+ Sites with Prometheus and Grafana
When you have a handful of servers, a single Prometheus instance and a couple of Grafana panels will do. When you have 900+ SD-WAN sites spread across India — each with multiple WAN links, a CPE device, and a tunnel overlay — the monitoring architecture needs real thought.
This post walks through how we built our observability stack at Hopbox: the architecture decisions, the metrics that matter, the dashboards our NOC team actually uses, and the scaling challenges we hit along the way.
The Problem
Section titled “The Problem”Each Hopbox site has:
- 1 SD-WAN appliance (x86, running OpenWrt)
- 2-4 WAN uplinks (fiber, broadband, 4G backup)
- WireGuard tunnels to our hub infrastructure
- Local DNS resolution via PowerDNS Recursor
We need to know, in near-real-time:
- Is the site reachable?
- Are all WAN links healthy?
- What is the latency, jitter, and packet loss on each link?
- Is bandwidth utilization approaching capacity?
- Are tunnels up?
- Is DNS resolution working?
At 900+ sites, that translates to tens of thousands of time series.
Architecture: Federation over Remote Write
Section titled “Architecture: Federation over Remote Write”We evaluated two approaches:
- Remote write — each site pushes metrics to a central Thanos/Cortex/Mimir cluster.
- Prometheus federation — regional Prometheus instances scrape local targets, and a global Prometheus federates from them.
We went with federation, for a few reasons:
- Our sites are grouped into regional clusters anyway (hub-and-spoke topology).
- Federation lets each regional Prometheus operate independently — if the central instance goes down, regional monitoring continues.
- We avoid the write-path complexity of remote-write receivers at scale.
The topology looks like this:
[Site CPE] --metrics--> [Regional Prometheus] --federation--> [Global Prometheus] --> [Grafana] | [Alertmanager]Regional Prometheus Configuration
Section titled “Regional Prometheus Configuration”Each regional Prometheus instance scrapes all CPE devices in its region:
# prometheus.yml (regional instance)global: scrape_interval: 30s evaluation_interval: 30s
scrape_configs: - job_name: 'hopbox-cpe' file_sd_configs: - files: - '/etc/prometheus/targets/cpe-*.json' refresh_interval: 5m
- job_name: 'snmp-wan-links' metrics_path: /snmp params: module: [if_mib] file_sd_configs: - files: - '/etc/prometheus/targets/snmp-*.json' relabel_configs: - source_labels: [__address__] target_label: __param_target - source_labels: [__param_target] target_label: instance - target_label: __address__ replacement: snmp-exporter:9116
rule_files: - '/etc/prometheus/rules/*.yml'
alerting: alertmanagers: - static_configs: - targets: ['alertmanager:9093']Global Federation Configuration
Section titled “Global Federation Configuration”The global instance federates key aggregated metrics:
# prometheus.yml (global instance)scrape_configs: - job_name: 'federation-regional' honor_labels: true metrics_path: /federate params: 'match[]': - '{job="hopbox-cpe"}' - '{job="snmp-wan-links"}' - '{__name__=~"hopbox:.*"}' # pre-aggregated recording rules static_configs: - targets: - 'prom-north.internal:9090' - 'prom-south.internal:9090' - 'prom-west.internal:9090' - 'prom-east.internal:9090'Scrape Targets and Exporters
Section titled “Scrape Targets and Exporters”We use three primary sources of metrics:
1. Node Exporter on the CPE
Section titled “1. Node Exporter on the CPE”Every Hopbox device runs node_exporter (compiled for x86 OpenWrt). This gives us system-level telemetry:
- CPU usage, memory, filesystem usage
- Network interface byte/packet counters
- System uptime
2. SNMP Exporter for WAN Links
Section titled “2. SNMP Exporter for WAN Links”For WAN link metrics from upstream provider equipment and the CPE’s own interfaces, we use the Prometheus SNMP exporter with the if_mib module:
ifHCInOctets/ifHCOutOctets— bandwidth utilizationifOperStatus— link up/downifSpeed— negotiated link speed
3. Custom WAN Probe Exporter
Section titled “3. Custom WAN Probe Exporter”This is the most important one. A lightweight Go binary running on each CPE performs active probes:
- ICMP ping to regional hub (latency, packet loss)
- UDP jitter measurement (mimicking VoIP traffic patterns)
- HTTP probe to known endpoint (full reachability test)
- DNS resolution timing
It exposes metrics like:
# HELP hopbox_wan_latency_ms WAN link latency in milliseconds# TYPE hopbox_wan_latency_ms gaugehopbox_wan_latency_ms{site="MUM-0142",link="wan0",isp="fiber"} 4.2hopbox_wan_latency_ms{site="MUM-0142",link="wan1",isp="broadband"} 12.8
# HELP hopbox_wan_packet_loss_ratio WAN link packet loss ratio (0-1)# TYPE hopbox_wan_packet_loss_ratio gaugehopbox_wan_packet_loss_ratio{site="MUM-0142",link="wan0",isp="fiber"} 0.001hopbox_wan_packet_loss_ratio{site="MUM-0142",link="wan1",isp="broadband"} 0.023
# HELP hopbox_wan_jitter_ms WAN link jitter in milliseconds# TYPE hopbox_wan_jitter_ms gaugehopbox_wan_jitter_ms{site="MUM-0142",link="wan0",isp="fiber"} 0.8hopbox_wan_jitter_ms{site="MUM-0142",link="wan1",isp="broadband"} 3.4Key Metrics and Recording Rules
Section titled “Key Metrics and Recording Rules”Raw metrics at 30-second intervals across 900 sites create a lot of data. We use recording rules to pre-aggregate at the regional level:
groups: - name: hopbox_wan_aggregation interval: 1m rules: - record: hopbox:wan_latency:avg5m expr: avg_over_time(hopbox_wan_latency_ms[5m])
- record: hopbox:wan_packet_loss:avg5m expr: avg_over_time(hopbox_wan_packet_loss_ratio[5m])
- record: hopbox:sites_up:count expr: count(up{job="hopbox-cpe"} == 1)
- record: hopbox:sites_down:count expr: count(up{job="hopbox-cpe"} == 0)
- record: hopbox:wan_bandwidth_utilization:ratio expr: | rate(ifHCInOctets{job="snmp-wan-links"}[5m]) * 8 / on(instance, ifIndex) ifSpeed{job="snmp-wan-links"}Alerting
Section titled “Alerting”We keep alerting rules focused and actionable. Here are the critical ones:
groups: - name: hopbox_critical rules: - alert: SiteDown expr: up{job="hopbox-cpe"} == 0 for: 5m labels: severity: critical annotations: summary: "Site {{ $labels.site }} is unreachable" description: "CPE at {{ $labels.instance }} has been down for 5 minutes."
- alert: AllWANLinksDown expr: | count by (site) (hopbox_wan_link_up == 0) == count by (site) (hopbox_wan_link_up) for: 2m labels: severity: critical annotations: summary: "All WAN links down at {{ $labels.site }}"
- alert: HighPacketLoss expr: hopbox:wan_packet_loss:avg5m > 0.05 for: 10m labels: severity: warning annotations: summary: "High packet loss on {{ $labels.link }} at {{ $labels.site }}" description: "Packet loss is {{ $value | humanizePercentage }} over 5m average."
- alert: HighLatency expr: hopbox:wan_latency:avg5m > 100 for: 10m labels: severity: warning annotations: summary: "High latency on {{ $labels.link }} at {{ $labels.site }}"Alertmanager routes critical alerts to PagerDuty and our internal Slack channel; warnings go to Slack only.
Grafana Dashboards
Section titled “Grafana Dashboards”We maintain three primary dashboards:
1. NOC Overview
Section titled “1. NOC Overview”The big-screen dashboard. It shows:
- Total sites up/down (stat panels)
- Sites with active alerts (table)
- Map of India with site locations, color-coded by health (Geomap panel)
- Top 10 worst-performing links (bar gauge)
2. Regional Drill-Down
Section titled “2. Regional Drill-Down”One dashboard per region. Shows:
- All sites in the region as a table with sortable columns (latency, loss, bandwidth)
- Trend graphs for aggregated regional metrics
- ISP-level aggregation (average latency by provider)
3. Per-Site Detail
Section titled “3. Per-Site Detail”Linked from the NOC overview and regional dashboards. For a single site:
- All WAN links with real-time latency, jitter, packet loss graphs
- Bandwidth utilization per link (stacked area)
- Tunnel status
- CPE system metrics (CPU, memory, uptime)
- Event annotations (config pushes, firmware updates)
Retention and Storage
Section titled “Retention and Storage”With ~50,000 active time series and a 30-second scrape interval, each regional Prometheus generates roughly:
- 50K series x 2 samples/min x 60 min x 24 hr = ~144M samples/day per region
We retain 15 days locally on each regional instance (SSD storage, ~40GB per region) and 90 days on the global instance with downsampled data.
For long-term storage, we are evaluating Thanos sidecar to push blocks to S3-compatible object storage.
Scaling Challenges and Lessons Learned
Section titled “Scaling Challenges and Lessons Learned”-
Scrape timeouts at scale. When a regional instance scrapes 200+ CPEs, some over slow links, scrape timeouts cascade. We increased
scrape_timeoutto 20s and staggered scrape intervals usinghash_modrelabeling. -
Cardinality explosion. Labels like
site,link,isp, andinterfacemultiply quickly. We keep label sets intentionally small and avoid high-cardinality labels like IP addresses in metric labels. -
Target discovery. Static config files don’t scale. We built a sidecar that queries the Hopbox API for active sites and writes Prometheus file-based service discovery JSON files, refreshed every 5 minutes.
-
Dashboard performance. Grafana querying 900+ sites on a global dashboard can be slow. Pre-aggregated recording rules are essential — never query raw metrics on the global dashboard.
-
Alert fatigue. With 900 sites, a nationwide ISP outage can fire hundreds of alerts simultaneously. We use Alertmanager’s
group_byandgroup_waitaggressively, and route ISP-wide incidents to a separate channel.
What’s Next
Section titled “What’s Next”We are working on:
- Thanos for long-term metric storage
- SLO-based alerting (burn rate alerts instead of threshold alerts)
- Automated anomaly detection for link quality degradation
- Exposing select metrics to customers via the Hopbox Cloud dashboard
Monitoring at scale is never “done” — but having a solid Prometheus federation architecture, well-defined metrics, and dashboards that match your operational workflow gets you most of the way there. If you are running a similar distributed infrastructure, we’d love to hear how you approach observability.