Automating 900+ Network Devices with Ansible

Mar 26, 2026

Sahil Dhiman

GNU/Linux Network Engineer

At Hopbox, we manage over 900 SD-WAN appliances deployed across India. Each device runs OpenWrt, has 2-4 WAN links, maintains WireGuard tunnels, and runs local DNS resolution. Keeping all of them consistent, updated, and correctly configured is a non-trivial operational challenge.

Ansible is the backbone of our device automation. This post covers why we chose it, how we structure our inventory and playbooks, and the hard-won lessons from running Ansible against hundreds of embedded Linux devices.

Why Ansible for Network Devices

We evaluated several options — SaltStack, custom scripts over SSH, a purpose-built agent — and landed on Ansible for these reasons:

Agentless. Our CPE devices run OpenWrt on constrained hardware (2-4GB RAM, mSATA SSD). We don’t want a persistent agent consuming resources. Ansible needs only SSH and Python (or raw mode for devices without Python).
SSH-based. Every Hopbox device is reachable over its WireGuard tunnel via SSH. No additional ports, no additional daemons.
Idempotent (mostly). Ansible’s declarative model means we can re-run playbooks safely. In practice, idempotency on OpenWrt requires some care — more on that below.
Ecosystem. The ansible.netcommon and community.general collections provide modules for network config templating, file management, and service control that work well on Linux-based network devices.

Dynamic Inventory from the Hopbox API

Static inventory files don’t work when devices are provisioned and decommissioned regularly. We wrote a custom dynamic inventory script that queries the Hopbox Cloud API:

#!/usr/bin/env python3
"""Ansible dynamic inventory from Hopbox API."""

import json
import os
import requests

HOPBOX_API = os.environ.get("HOPBOX_API_URL", "https://api.hopbox.net")
HOPBOX_TOKEN = os.environ["HOPBOX_API_TOKEN"]


def get_inventory():
    headers = {"Authorization": f"Bearer {HOPBOX_TOKEN}"}
    resp = requests.get(f"{HOPBOX_API}/v1/devices", headers=headers)
    resp.raise_for_status()
    devices = resp.json()["devices"]

    inventory = {"_meta": {"hostvars": {}}, "all": {"children": []}}

    # Group by region
    regions = {}
    for device in devices:
        region = device["region"]
        if region not in regions:
            regions[region] = []
        regions[region].append(device)

        inventory["_meta"]["hostvars"][device["hostname"]] = {
            "ansible_host": device["tunnel_ip"],
            "ansible_user": "root",
            "ansible_ssh_private_key_file": "/etc/ansible/keys/hopbox-automation",
            "site_id": device["site_id"],
            "wan_links": device["wan_links"],
            "firmware_version": device["firmware_version"],
            "hardware_model": device["hardware_model"],
        }

    for region, devices in regions.items():
        group_name = f"region_{region}"
        inventory[group_name] = {
            "hosts": [d["hostname"] for d in devices]
        }
        inventory["all"]["children"].append(group_name)

    return inventory


if __name__ == "__main__":
    print(json.dumps(get_inventory(), indent=2))

This gives us groups like region_north, region_south, etc., with per-host variables including tunnel IP, site ID, current firmware version, and hardware model.

Playbook Structure

Our Ansible repository is structured as:

ansible/
  inventory/
    hopbox_inventory.py
  playbooks/
    firmware-upgrade.yml
    config-push.yml
    dns-zone-sync.yml
    wireguard-rekey.yml
    diagnostics.yml
  roles/
    hopbox-base/
    hopbox-wan/
    hopbox-dns/
    hopbox-tunnel/
  templates/
    network.j2
    dhcp.j2
    firewall.j2
    wireguard.j2
  group_vars/
    all.yml
    region_north.yml
    region_south.yml
  ansible.cfg

Firmware Upgrades: Rolling Updates

Firmware upgrades are the highest-risk operation. A bad firmware push to 900 devices simultaneously would be catastrophic. We use a staged rolling update strategy:

---
- name: Hopbox firmware upgrade (rolling)
  hosts: "{{ target_group | default('canary') }}"
  serial: "{{ batch_size | default(10) }}"
  max_fail_percentage: 5
  gather_facts: false

  vars:
    firmware_url: "https://releases.hopbox.net/firmware/{{ firmware_version }}/hopbox-{{ hardware_model }}.img.gz"
    firmware_checksum: "sha256:{{ firmware_sha256 }}"

  tasks:
    - name: Check current firmware version
      ansible.builtin.command: cat /etc/hopbox-version
      register: current_version
      changed_when: false

    - name: Skip if already on target version
      ansible.builtin.meta: end_host
      when: current_version.stdout == firmware_version

    - name: Download firmware image
      ansible.builtin.get_url:
        url: "{{ firmware_url }}"
        dest: /tmp/firmware.img.gz
        checksum: "{{ firmware_checksum }}"
        timeout: 300

    - name: Verify available disk space
      ansible.builtin.shell: |
        available=$(df /tmp | tail -1 | awk '{print $4}')
        if [ "$available" -lt 102400 ]; then
          echo "INSUFFICIENT_SPACE"
          exit 1
        fi
      changed_when: false

    - name: Apply firmware via sysupgrade
      ansible.builtin.command: >
        sysupgrade -v /tmp/firmware.img.gz
      async: 300
      poll: 0
      register: sysupgrade_job

    - name: Wait for device to come back online
      ansible.builtin.wait_for_connection:
        delay: 60
        timeout: 300

    - name: Verify new firmware version
      ansible.builtin.command: cat /etc/hopbox-version
      register: new_version
      changed_when: false
      failed_when: new_version.stdout != firmware_version

    - name: Run post-upgrade health check
      ansible.builtin.uri:
        url: "http://{{ ansible_host }}:8080/health"
        return_content: true
      register: health
      failed_when: health.json.status != "ok"
      retries: 3
      delay: 10

The rollout process:

Canary group (5 devices across different regions) — deploy, monitor for 24 hours.
Early adopters (50 devices) — deploy in batches of 10, with max_fail_percentage: 5.
Full fleet — deploy in batches of 20-50, monitored via Grafana dashboards.

If any batch exceeds the failure threshold, Ansible halts and we investigate.

Configuration Templating

OpenWrt uses UCI (Unified Configuration Interface) for configuration. We template the key config files using Jinja2:

{# templates/network.j2 #}
config interface 'loopback'
    option device 'lo'
    option proto 'static'
    option ipaddr '127.0.0.1'
    option netmask '255.0.0.0'

config interface 'lan'
    option device 'br-lan'
    option proto 'static'
    option ipaddr '{{ lan_ip | default("192.168.1.1") }}'
    option netmask '{{ lan_netmask | default("255.255.255.0") }}'

{% for link in wan_links %}
config interface 'wan{{ loop.index0 }}'
    option device '{{ link.device }}'
    option proto '{{ link.proto }}'
{% if link.proto == 'static' %}
    option ipaddr '{{ link.ipaddr }}'
    option netmask '{{ link.netmask }}'
    option gateway '{{ link.gateway }}'
{% endif %}
    option metric '{{ link.metric | default(loop.index0 * 10) }}'
{% endfor %}

The corresponding playbook pushes the template and restarts networking:

---
- name: Push network configuration
  hosts: "{{ target_group }}"
  serial: 20
  gather_facts: false

  tasks:
    - name: Template network configuration
      ansible.builtin.template:
        src: templates/network.j2
        dest: /etc/config/network
        mode: '0644'
      register: network_config

    - name: Template firewall configuration
      ansible.builtin.template:
        src: templates/firewall.j2
        dest: /etc/config/firewall
        mode: '0644'
      register: firewall_config

    - name: Restart networking if config changed
      ansible.builtin.command: /etc/init.d/network restart
      when: network_config.changed

    - name: Reload firewall if config changed
      ansible.builtin.command: /etc/init.d/firewall reload
      when: firewall_config.changed

    - name: Wait for connectivity after restart
      ansible.builtin.wait_for_connection:
        delay: 10
        timeout: 120
      when: network_config.changed

DNS Zone Sync

Every Hopbox device runs a local PowerDNS Recursor for DNS resolution. We sync zone overrides and blocklists via Ansible:

---
- name: Sync DNS zone configuration
  hosts: all
  serial: 50
  gather_facts: false

  tasks:
    - name: Sync forward zone overrides
      ansible.builtin.copy:
        src: "files/dns/forward-zones.conf"
        dest: /etc/pdns-recursor/forward-zones.conf
        mode: '0644'
      register: forward_zones

    - name: Sync blocklist
      ansible.builtin.copy:
        src: "files/dns/blocklist.lua"
        dest: /etc/pdns-recursor/blocklist.lua
        mode: '0644'
      register: blocklist

    - name: Restart recursor if config changed
      ansible.builtin.command: /etc/init.d/pdns-recursor restart
      when: forward_zones.changed or blocklist.changed

Idempotency Challenges with OpenWrt

Ansible’s strength is idempotency, but OpenWrt presents some challenges:

UCI vs flat files. OpenWrt’s uci commands are the “correct” way to manage configuration, but Ansible’s template module writes flat files. We chose flat files because they’re easier to template and diff, but this means we bypass UCI’s internal state. We run uci commit as a post-task to keep UCI in sync.
No systemd. OpenWrt uses init.d scripts, not systemd. Standard Ansible service module works with some configuration, but we often use command for reliability.
Minimal Python. Some of our older devices don’t have Python installed. For those, we use ansible.builtin.raw for basic commands and ensure Python is installed as a bootstrap step.
Package state. opkg (OpenWrt’s package manager) doesn’t have robust state management. We maintain a list of required packages and use a simple shell task to install missing ones rather than relying on the opkg module.

Error Handling and Rollback

Network automation failures can take a site offline. Our safety nets:

Pre-flight checks. Every playbook starts with connectivity verification and a config backup.
Config backups. Before any config change, we archive /etc/config/ to a timestamped tarball on the device and pull a copy to our central backup store.
Automatic rollback. For critical config changes (network, firewall), we use a “dead man’s switch” — a cron job scheduled 5 minutes in the future that restores the backup. The playbook cancels the cron job only after verifying connectivity post-change.
max_fail_percentage. Every playbook that touches the fleet has a failure threshold. If too many devices fail, the run stops.

# Rollback dead man's switch pattern
- name: Schedule automatic rollback in 5 minutes
  ansible.builtin.cron:
    name: "config-rollback"
    minute: "{{ '%M' | strftime(ansible_date_time.epoch | int + 300) }}"
    hour: "{{ '%H' | strftime(ansible_date_time.epoch | int + 300) }}"
    job: "/usr/local/bin/hopbox-config-restore.sh"
    state: present

# ... apply config changes ...

- name: Cancel rollback after successful verification
  ansible.builtin.cron:
    name: "config-rollback"
    state: absent

CI/CD Pipeline for Playbooks

We don’t run playbooks directly from laptops. All automation goes through a CI pipeline:

Lint. ansible-lint checks for best practices and common mistakes.
Dry run. --check --diff against a test group of devices.
Canary deploy. Apply to canary devices, run health checks.
Approval gate. Manual approval required before fleet-wide deployment.
Rolling deploy. Apply to the fleet in batches with monitoring.

# .gitlab-ci.yml (simplified)
stages:
  - lint
  - dry-run
  - canary
  - deploy

lint:
  stage: lint
  script:
    - ansible-lint playbooks/

dry-run:
  stage: dry-run
  script:
    - ansible-playbook playbooks/$PLAYBOOK --check --diff -l test_devices

canary:
  stage: canary
  script:
    - ansible-playbook playbooks/$PLAYBOOK -l canary
    - ./scripts/run-health-checks.sh canary
  when: manual

deploy:
  stage: deploy
  script:
    - ansible-playbook playbooks/$PLAYBOOK -l all --forks 50
  when: manual

Lessons Learned

Never push to the entire fleet at once. Always use serial and max_fail_percentage. Always.
Test on real hardware. VMs don’t catch OpenWrt-specific issues. We keep a rack of test devices that mirror production hardware.
Backup before every change. Disk is cheap. Downtime isn’t.
Ansible vault for secrets. WireGuard keys, API tokens, and SSH keys are all managed via Ansible Vault. No plaintext secrets in the repository.
Keep playbooks simple. Complex logic belongs in scripts on the device, not in 200-line Jinja2 templates. Ansible should orchestrate, not compute.

Ansible isn’t a silver bullet for network automation, especially on embedded Linux. But its agentless model, SSH-based execution, and declarative approach make it the right tool for managing a fleet of OpenWrt-based SD-WAN devices at our scale. The key is investing in guardrails — rolling updates, automatic rollbacks, health checks, and CI pipelines — so that automation accelerates your operations without amplifying your mistakes.

v1.7.9