Monitoring 900+ Sites with Prometheus and Grafana

Mar 26, 2026

Sahil Dhiman

GNU/Linux Network Engineer

When you have a handful of servers, a single Prometheus instance and a couple of Grafana panels will do. When you have 900+ SD-WAN sites spread across India — each with multiple WAN links, a CPE device, and a tunnel overlay — the monitoring architecture needs real thought.

This post walks through how we built our observability stack at Hopbox: the architecture decisions, the metrics that matter, the dashboards our NOC team actually uses, and the scaling challenges we hit along the way.

The Problem

Each Hopbox site has:

1 SD-WAN appliance (x86, running OpenWrt)
2-4 WAN uplinks (fiber, broadband, 4G backup)
WireGuard tunnels to our hub infrastructure
Local DNS resolution via PowerDNS Recursor

We need to know, in near-real-time:

Is the site reachable?
Are all WAN links healthy?
What is the latency, jitter, and packet loss on each link?
Is bandwidth utilization approaching capacity?
Are tunnels up?
Is DNS resolution working?

At 900+ sites, that translates to tens of thousands of time series.

Architecture: Federation over Remote Write

We evaluated two approaches:

Remote write — each site pushes metrics to a central Thanos/Cortex/Mimir cluster.
Prometheus federation — regional Prometheus instances scrape local targets, and a global Prometheus federates from them.

We went with federation, for a few reasons:

Our sites are grouped into regional clusters anyway (hub-and-spoke topology).
Federation lets each regional Prometheus operate independently — if the central instance goes down, regional monitoring continues.
We avoid the write-path complexity of remote-write receivers at scale.

The topology looks like this:

[Site CPE] --metrics--> [Regional Prometheus] --federation--> [Global Prometheus] --> [Grafana]
                                |
                          [Alertmanager]

Regional Prometheus Configuration

Each regional Prometheus instance scrapes all CPE devices in its region:

# prometheus.yml (regional instance)
global:
  scrape_interval: 30s
  evaluation_interval: 30s

scrape_configs:
  - job_name: 'hopbox-cpe'
    file_sd_configs:
      - files:
          - '/etc/prometheus/targets/cpe-*.json'
        refresh_interval: 5m

  - job_name: 'snmp-wan-links'
    metrics_path: /snmp
    params:
      module: [if_mib]
    file_sd_configs:
      - files:
          - '/etc/prometheus/targets/snmp-*.json'
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: snmp-exporter:9116

rule_files:
  - '/etc/prometheus/rules/*.yml'

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

Global Federation Configuration

The global instance federates key aggregated metrics:

# prometheus.yml (global instance)
scrape_configs:
  - job_name: 'federation-regional'
    honor_labels: true
    metrics_path: /federate
    params:
      'match[]':
        - '{job="hopbox-cpe"}'
        - '{job="snmp-wan-links"}'
        - '{__name__=~"hopbox:.*"}'  # pre-aggregated recording rules
    static_configs:
      - targets:
          - 'prom-north.internal:9090'
          - 'prom-south.internal:9090'
          - 'prom-west.internal:9090'
          - 'prom-east.internal:9090'

Scrape Targets and Exporters

We use three primary sources of metrics:

1. Node Exporter on the CPE

Every Hopbox device runs node_exporter (compiled for x86 OpenWrt). This gives us system-level telemetry:

CPU usage, memory, filesystem usage
Network interface byte/packet counters
System uptime

2. SNMP Exporter for WAN Links

For WAN link metrics from upstream provider equipment and the CPE’s own interfaces, we use the Prometheus SNMP exporter with the if_mib module:

ifHCInOctets / ifHCOutOctets — bandwidth utilization
ifOperStatus — link up/down
ifSpeed — negotiated link speed

3. Custom WAN Probe Exporter

This is the most important one. A lightweight Go binary running on each CPE performs active probes:

ICMP ping to regional hub (latency, packet loss)
UDP jitter measurement (mimicking VoIP traffic patterns)
HTTP probe to known endpoint (full reachability test)
DNS resolution timing

It exposes metrics like:

# HELP hopbox_wan_latency_ms WAN link latency in milliseconds
# TYPE hopbox_wan_latency_ms gauge
hopbox_wan_latency_ms{site="MUM-0142",link="wan0",isp="fiber"} 4.2
hopbox_wan_latency_ms{site="MUM-0142",link="wan1",isp="broadband"} 12.8

# HELP hopbox_wan_packet_loss_ratio WAN link packet loss ratio (0-1)
# TYPE hopbox_wan_packet_loss_ratio gauge
hopbox_wan_packet_loss_ratio{site="MUM-0142",link="wan0",isp="fiber"} 0.001
hopbox_wan_packet_loss_ratio{site="MUM-0142",link="wan1",isp="broadband"} 0.023

# HELP hopbox_wan_jitter_ms WAN link jitter in milliseconds
# TYPE hopbox_wan_jitter_ms gauge
hopbox_wan_jitter_ms{site="MUM-0142",link="wan0",isp="fiber"} 0.8
hopbox_wan_jitter_ms{site="MUM-0142",link="wan1",isp="broadband"} 3.4

Key Metrics and Recording Rules

Raw metrics at 30-second intervals across 900 sites create a lot of data. We use recording rules to pre-aggregate at the regional level:

groups:
  - name: hopbox_wan_aggregation
    interval: 1m
    rules:
      - record: hopbox:wan_latency:avg5m
        expr: avg_over_time(hopbox_wan_latency_ms[5m])

      - record: hopbox:wan_packet_loss:avg5m
        expr: avg_over_time(hopbox_wan_packet_loss_ratio[5m])

      - record: hopbox:sites_up:count
        expr: count(up{job="hopbox-cpe"} == 1)

      - record: hopbox:sites_down:count
        expr: count(up{job="hopbox-cpe"} == 0)

      - record: hopbox:wan_bandwidth_utilization:ratio
        expr: |
          rate(ifHCInOctets{job="snmp-wan-links"}[5m]) * 8
          / on(instance, ifIndex) ifSpeed{job="snmp-wan-links"}

Alerting

We keep alerting rules focused and actionable. Here are the critical ones:

groups:
  - name: hopbox_critical
    rules:
      - alert: SiteDown
        expr: up{job="hopbox-cpe"} == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Site {{ $labels.site }} is unreachable"
          description: "CPE at {{ $labels.instance }} has been down for 5 minutes."

      - alert: AllWANLinksDown
        expr: |
          count by (site) (hopbox_wan_link_up == 0)
          == count by (site) (hopbox_wan_link_up)
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "All WAN links down at {{ $labels.site }}"

      - alert: HighPacketLoss
        expr: hopbox:wan_packet_loss:avg5m > 0.05
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High packet loss on {{ $labels.link }} at {{ $labels.site }}"
          description: "Packet loss is {{ $value | humanizePercentage }} over 5m average."

      - alert: HighLatency
        expr: hopbox:wan_latency:avg5m > 100
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High latency on {{ $labels.link }} at {{ $labels.site }}"

Alertmanager routes critical alerts to PagerDuty and our internal Slack channel; warnings go to Slack only.

Grafana Dashboards

We maintain three primary dashboards:

1. NOC Overview

The big-screen dashboard. It shows:

Total sites up/down (stat panels)
Sites with active alerts (table)
Map of India with site locations, color-coded by health (Geomap panel)
Top 10 worst-performing links (bar gauge)

2. Regional Drill-Down

One dashboard per region. Shows:

All sites in the region as a table with sortable columns (latency, loss, bandwidth)
Trend graphs for aggregated regional metrics
ISP-level aggregation (average latency by provider)

3. Per-Site Detail

Linked from the NOC overview and regional dashboards. For a single site:

All WAN links with real-time latency, jitter, packet loss graphs
Bandwidth utilization per link (stacked area)
Tunnel status
CPE system metrics (CPU, memory, uptime)
Event annotations (config pushes, firmware updates)

Retention and Storage

With ~50,000 active time series and a 30-second scrape interval, each regional Prometheus generates roughly:

50K series x 2 samples/min x 60 min x 24 hr = ~144M samples/day per region

We retain 15 days locally on each regional instance (SSD storage, ~40GB per region) and 90 days on the global instance with downsampled data.

For long-term storage, we are evaluating Thanos sidecar to push blocks to S3-compatible object storage.

Scaling Challenges and Lessons Learned

Scrape timeouts at scale. When a regional instance scrapes 200+ CPEs, some over slow links, scrape timeouts cascade. We increased scrape_timeout to 20s and staggered scrape intervals using hash_mod relabeling.
Cardinality explosion. Labels like site, link, isp, and interface multiply quickly. We keep label sets intentionally small and avoid high-cardinality labels like IP addresses in metric labels.
Target discovery. Static config files don’t scale. We built a sidecar that queries the Hopbox API for active sites and writes Prometheus file-based service discovery JSON files, refreshed every 5 minutes.
Dashboard performance. Grafana querying 900+ sites on a global dashboard can be slow. Pre-aggregated recording rules are essential — never query raw metrics on the global dashboard.
Alert fatigue. With 900 sites, a nationwide ISP outage can fire hundreds of alerts simultaneously. We use Alertmanager’s group_by and group_wait aggressively, and route ISP-wide incidents to a separate channel.

What’s Next

We are working on:

Thanos for long-term metric storage
SLO-based alerting (burn rate alerts instead of threshold alerts)
Automated anomaly detection for link quality degradation
Exposing select metrics to customers via the Hopbox Cloud dashboard

Monitoring at scale is never “done” — but having a solid Prometheus federation architecture, well-defined metrics, and dashboards that match your operational workflow gets you most of the way there. If you are running a similar distributed infrastructure, we’d love to hear how you approach observability.

v1.7.9