Automating 900+ Network Devices with Ansible
At Hopbox, we manage over 900 SD-WAN appliances deployed across India. Each device runs OpenWrt, has 2-4 WAN links, maintains WireGuard tunnels, and runs local DNS resolution. Keeping all of them consistent, updated, and correctly configured is a non-trivial operational challenge.
Ansible is the backbone of our device automation. This post covers why we chose it, how we structure our inventory and playbooks, and the hard-won lessons from running Ansible against hundreds of embedded Linux devices.
Why Ansible for Network Devices
Section titled “Why Ansible for Network Devices”We evaluated several options — SaltStack, custom scripts over SSH, a purpose-built agent — and landed on Ansible for these reasons:
- Agentless. Our CPE devices run OpenWrt on constrained hardware (2-4GB RAM, mSATA SSD). We don’t want a persistent agent consuming resources. Ansible needs only SSH and Python (or raw mode for devices without Python).
- SSH-based. Every Hopbox device is reachable over its WireGuard tunnel via SSH. No additional ports, no additional daemons.
- Idempotent (mostly). Ansible’s declarative model means we can re-run playbooks safely. In practice, idempotency on OpenWrt requires some care — more on that below.
- Ecosystem. The
ansible.netcommonandcommunity.generalcollections provide modules for network config templating, file management, and service control that work well on Linux-based network devices.
Dynamic Inventory from the Hopbox API
Section titled “Dynamic Inventory from the Hopbox API”Static inventory files don’t work when devices are provisioned and decommissioned regularly. We wrote a custom dynamic inventory script that queries the Hopbox Cloud API:
#!/usr/bin/env python3"""Ansible dynamic inventory from Hopbox API."""
import jsonimport osimport requests
HOPBOX_API = os.environ.get("HOPBOX_API_URL", "https://api.hopbox.net")HOPBOX_TOKEN = os.environ["HOPBOX_API_TOKEN"]
def get_inventory(): headers = {"Authorization": f"Bearer {HOPBOX_TOKEN}"} resp = requests.get(f"{HOPBOX_API}/v1/devices", headers=headers) resp.raise_for_status() devices = resp.json()["devices"]
inventory = {"_meta": {"hostvars": {}}, "all": {"children": []}}
# Group by region regions = {} for device in devices: region = device["region"] if region not in regions: regions[region] = [] regions[region].append(device)
inventory["_meta"]["hostvars"][device["hostname"]] = { "ansible_host": device["tunnel_ip"], "ansible_user": "root", "ansible_ssh_private_key_file": "/etc/ansible/keys/hopbox-automation", "site_id": device["site_id"], "wan_links": device["wan_links"], "firmware_version": device["firmware_version"], "hardware_model": device["hardware_model"], }
for region, devices in regions.items(): group_name = f"region_{region}" inventory[group_name] = { "hosts": [d["hostname"] for d in devices] } inventory["all"]["children"].append(group_name)
return inventory
if __name__ == "__main__": print(json.dumps(get_inventory(), indent=2))This gives us groups like region_north, region_south, etc., with per-host variables including tunnel IP, site ID, current firmware version, and hardware model.
Playbook Structure
Section titled “Playbook Structure”Our Ansible repository is structured as:
ansible/ inventory/ hopbox_inventory.py playbooks/ firmware-upgrade.yml config-push.yml dns-zone-sync.yml wireguard-rekey.yml diagnostics.yml roles/ hopbox-base/ hopbox-wan/ hopbox-dns/ hopbox-tunnel/ templates/ network.j2 dhcp.j2 firewall.j2 wireguard.j2 group_vars/ all.yml region_north.yml region_south.yml ansible.cfgFirmware Upgrades: Rolling Updates
Section titled “Firmware Upgrades: Rolling Updates”Firmware upgrades are the highest-risk operation. A bad firmware push to 900 devices simultaneously would be catastrophic. We use a staged rolling update strategy:
---- name: Hopbox firmware upgrade (rolling) hosts: "{{ target_group | default('canary') }}" serial: "{{ batch_size | default(10) }}" max_fail_percentage: 5 gather_facts: false
vars: firmware_url: "https://releases.hopbox.net/firmware/{{ firmware_version }}/hopbox-{{ hardware_model }}.img.gz" firmware_checksum: "sha256:{{ firmware_sha256 }}"
tasks: - name: Check current firmware version ansible.builtin.command: cat /etc/hopbox-version register: current_version changed_when: false
- name: Skip if already on target version ansible.builtin.meta: end_host when: current_version.stdout == firmware_version
- name: Download firmware image ansible.builtin.get_url: url: "{{ firmware_url }}" dest: /tmp/firmware.img.gz checksum: "{{ firmware_checksum }}" timeout: 300
- name: Verify available disk space ansible.builtin.shell: | available=$(df /tmp | tail -1 | awk '{print $4}') if [ "$available" -lt 102400 ]; then echo "INSUFFICIENT_SPACE" exit 1 fi changed_when: false
- name: Apply firmware via sysupgrade ansible.builtin.command: > sysupgrade -v /tmp/firmware.img.gz async: 300 poll: 0 register: sysupgrade_job
- name: Wait for device to come back online ansible.builtin.wait_for_connection: delay: 60 timeout: 300
- name: Verify new firmware version ansible.builtin.command: cat /etc/hopbox-version register: new_version changed_when: false failed_when: new_version.stdout != firmware_version
- name: Run post-upgrade health check ansible.builtin.uri: url: "http://{{ ansible_host }}:8080/health" return_content: true register: health failed_when: health.json.status != "ok" retries: 3 delay: 10The rollout process:
- Canary group (5 devices across different regions) — deploy, monitor for 24 hours.
- Early adopters (50 devices) — deploy in batches of 10, with
max_fail_percentage: 5. - Full fleet — deploy in batches of 20-50, monitored via Grafana dashboards.
If any batch exceeds the failure threshold, Ansible halts and we investigate.
Configuration Templating
Section titled “Configuration Templating”OpenWrt uses UCI (Unified Configuration Interface) for configuration. We template the key config files using Jinja2:
{# templates/network.j2 #}config interface 'loopback' option device 'lo' option proto 'static' option ipaddr '127.0.0.1' option netmask '255.0.0.0'
config interface 'lan' option device 'br-lan' option proto 'static' option ipaddr '{{ lan_ip | default("192.168.1.1") }}' option netmask '{{ lan_netmask | default("255.255.255.0") }}'
{% for link in wan_links %}config interface 'wan{{ loop.index0 }}' option device '{{ link.device }}' option proto '{{ link.proto }}'{% if link.proto == 'static' %} option ipaddr '{{ link.ipaddr }}' option netmask '{{ link.netmask }}' option gateway '{{ link.gateway }}'{% endif %} option metric '{{ link.metric | default(loop.index0 * 10) }}'{% endfor %}The corresponding playbook pushes the template and restarts networking:
---- name: Push network configuration hosts: "{{ target_group }}" serial: 20 gather_facts: false
tasks: - name: Template network configuration ansible.builtin.template: src: templates/network.j2 dest: /etc/config/network mode: '0644' register: network_config
- name: Template firewall configuration ansible.builtin.template: src: templates/firewall.j2 dest: /etc/config/firewall mode: '0644' register: firewall_config
- name: Restart networking if config changed ansible.builtin.command: /etc/init.d/network restart when: network_config.changed
- name: Reload firewall if config changed ansible.builtin.command: /etc/init.d/firewall reload when: firewall_config.changed
- name: Wait for connectivity after restart ansible.builtin.wait_for_connection: delay: 10 timeout: 120 when: network_config.changedDNS Zone Sync
Section titled “DNS Zone Sync”Every Hopbox device runs a local PowerDNS Recursor for DNS resolution. We sync zone overrides and blocklists via Ansible:
---- name: Sync DNS zone configuration hosts: all serial: 50 gather_facts: false
tasks: - name: Sync forward zone overrides ansible.builtin.copy: src: "files/dns/forward-zones.conf" dest: /etc/pdns-recursor/forward-zones.conf mode: '0644' register: forward_zones
- name: Sync blocklist ansible.builtin.copy: src: "files/dns/blocklist.lua" dest: /etc/pdns-recursor/blocklist.lua mode: '0644' register: blocklist
- name: Restart recursor if config changed ansible.builtin.command: /etc/init.d/pdns-recursor restart when: forward_zones.changed or blocklist.changedIdempotency Challenges with OpenWrt
Section titled “Idempotency Challenges with OpenWrt”Ansible’s strength is idempotency, but OpenWrt presents some challenges:
-
UCI vs flat files. OpenWrt’s
ucicommands are the “correct” way to manage configuration, but Ansible’stemplatemodule writes flat files. We chose flat files because they’re easier to template and diff, but this means we bypass UCI’s internal state. We runuci commitas a post-task to keep UCI in sync. -
No systemd. OpenWrt uses init.d scripts, not systemd. Standard Ansible
servicemodule works with some configuration, but we often usecommandfor reliability. -
Minimal Python. Some of our older devices don’t have Python installed. For those, we use
ansible.builtin.rawfor basic commands and ensure Python is installed as a bootstrap step. -
Package state.
opkg(OpenWrt’s package manager) doesn’t have robust state management. We maintain a list of required packages and use a simple shell task to install missing ones rather than relying on theopkgmodule.
Error Handling and Rollback
Section titled “Error Handling and Rollback”Network automation failures can take a site offline. Our safety nets:
- Pre-flight checks. Every playbook starts with connectivity verification and a config backup.
- Config backups. Before any config change, we archive
/etc/config/to a timestamped tarball on the device and pull a copy to our central backup store. - Automatic rollback. For critical config changes (network, firewall), we use a “dead man’s switch” — a cron job scheduled 5 minutes in the future that restores the backup. The playbook cancels the cron job only after verifying connectivity post-change.
max_fail_percentage. Every playbook that touches the fleet has a failure threshold. If too many devices fail, the run stops.
# Rollback dead man's switch pattern- name: Schedule automatic rollback in 5 minutes ansible.builtin.cron: name: "config-rollback" minute: "{{ '%M' | strftime(ansible_date_time.epoch | int + 300) }}" hour: "{{ '%H' | strftime(ansible_date_time.epoch | int + 300) }}" job: "/usr/local/bin/hopbox-config-restore.sh" state: present
# ... apply config changes ...
- name: Cancel rollback after successful verification ansible.builtin.cron: name: "config-rollback" state: absentCI/CD Pipeline for Playbooks
Section titled “CI/CD Pipeline for Playbooks”We don’t run playbooks directly from laptops. All automation goes through a CI pipeline:
- Lint.
ansible-lintchecks for best practices and common mistakes. - Dry run.
--check --diffagainst a test group of devices. - Canary deploy. Apply to canary devices, run health checks.
- Approval gate. Manual approval required before fleet-wide deployment.
- Rolling deploy. Apply to the fleet in batches with monitoring.
# .gitlab-ci.yml (simplified)stages: - lint - dry-run - canary - deploy
lint: stage: lint script: - ansible-lint playbooks/
dry-run: stage: dry-run script: - ansible-playbook playbooks/$PLAYBOOK --check --diff -l test_devices
canary: stage: canary script: - ansible-playbook playbooks/$PLAYBOOK -l canary - ./scripts/run-health-checks.sh canary when: manual
deploy: stage: deploy script: - ansible-playbook playbooks/$PLAYBOOK -l all --forks 50 when: manualLessons Learned
Section titled “Lessons Learned”- Never push to the entire fleet at once. Always use
serialandmax_fail_percentage. Always. - Test on real hardware. VMs don’t catch OpenWrt-specific issues. We keep a rack of test devices that mirror production hardware.
- Backup before every change. Disk is cheap. Downtime isn’t.
- Ansible vault for secrets. WireGuard keys, API tokens, and SSH keys are all managed via Ansible Vault. No plaintext secrets in the repository.
- Keep playbooks simple. Complex logic belongs in scripts on the device, not in 200-line Jinja2 templates. Ansible should orchestrate, not compute.
Ansible isn’t a silver bullet for network automation, especially on embedded Linux. But its agentless model, SSH-based execution, and declarative approach make it the right tool for managing a fleet of OpenWrt-based SD-WAN devices at our scale. The key is investing in guardrails — rolling updates, automatic rollbacks, health checks, and CI pipelines — so that automation accelerates your operations without amplifying your mistakes.