Skip to content

From Alert to Resolution: How AI-Assisted Provisioning Reduces MTTR

Network operations centers have a well-known problem: alert fatigue. Monitoring systems generate hundreds of alerts per day. Most are noise. The ones that matter get buried. And when a real issue surfaces, the resolution workflow is almost entirely manual — SSH into the device, read logs, correlate with other data sources, form a hypothesis, apply a change, verify.

What if the system that detected the problem could also propose the fix?

We have been building exactly that. This post describes our AI-assisted provisioning workflow — how it works, what it can do today, and the guardrails that keep it safe.

The Problem: Manual Troubleshooting Does Not Scale

Section titled “The Problem: Manual Troubleshooting Does Not Scale”

Consider a typical NOC workflow when a WAN link degrades:

  1. Monitoring system fires a threshold alert (e.g., packet loss > 1%).
  2. NOC engineer acknowledges the alert.
  3. Engineer logs into the device or dashboard, checks link metrics.
  4. Engineer checks the other WAN links at the same site.
  5. Engineer decides whether to reroute traffic — and if so, how.
  6. Engineer makes the configuration change (update failover priority, adjust SLA thresholds, modify PBR rules).
  7. Engineer verifies the change resolved the issue.
  8. Engineer documents the incident and resolution.

Steps 2 through 8 take anywhere from 20 minutes to several hours, depending on complexity, engineer availability, and whether it is 3 AM. Multiply this by dozens of incidents per week across a fleet of hundreds of sites, and you have a significant operational burden.

Our AI-assisted provisioning workflow compresses steps 2-7 into a largely automated pipeline with a human approval gate:

[Anomaly Detected]
[AI Analyzes Telemetry Context]
- Current link metrics (all WAN links at the site)
- Historical baseline for this site
- Traffic classification and volume
- Device health and capacity
[AI Proposes Remediation]
- Specific configuration change
- Expected impact assessment
- Rollback plan
[Human Review & Approval] ← The critical gate
[Provisioning API Applies Change]
[Automated Verification]
- Metrics monitored for improvement
- Automatic rollback if degradation detected
[Incident Documented Automatically]

Each step is logged and auditable. The AI does not operate in a black box — every decision is traceable to specific telemetry data and model outputs.

The system comprises four components:

All telemetry data and alerts flow through a unified event pipeline. When the anomaly detection layer (described in our Hopbox AI introduction post) flags an issue, it emits a structured event that triggers the remediation workflow.

{
"event_type": "anomaly_detected",
"site_id": "BR-042",
"link": "wan1",
"anomaly_class": "gradual_degradation",
"metrics": {
"latency_ms": 34.2,
"baseline_latency_ms": 14.5,
"packet_loss_pct": 0.31,
"baseline_loss_pct": 0.02,
"trend_direction": "worsening",
"trend_duration_hours": 96
},
"confidence": 0.94,
"timestamp": "2026-03-25T14:32:00Z"
}

The inference engine takes the anomaly event and the full site context (all links, current traffic, device state) and determines the optimal remediation. This is not a simple rule engine — though it does incorporate rules for known failure modes. The ML component handles pattern matching against historical incidents and their successful resolutions.

For example, the engine might determine:

  • WAN1 is degraded, WAN2 is healthy with spare capacity.
  • Current traffic volume can be handled by WAN2 alone.
  • Historical data shows WAN1 degradations at this site have previously lasted 24-72 hours before ISP resolution.
  • Recommended action: shift traffic from WAN1 to WAN2 via SLA policy update.

The provisioning API is the same one used for zero-touch provisioning and manual configuration pushes. The AI workflow does not get special access — it generates a standard configuration delta that goes through the same validation, application, and verification pipeline as any other config change.

Terminal window
# Simplified example of the config delta the AI generates
uci set network.wan1_failover.metric='100' # Deprioritize WAN1
uci set network.wan2_failover.metric='10' # Prioritize WAN2
uci set sdwan.sla_policy.primary='wan2' # Update SD-WAN path selection
uci commit
service network reload

The approval interface presents the NOC engineer with:

  • A summary of the detected issue.
  • The proposed change in human-readable form and as a raw config diff.
  • An impact assessment (estimated traffic shift, affected applications).
  • A one-click approve or reject action.
  • An optional field for the engineer to modify the proposed change before approval.

Here is a concrete scenario that plays out regularly in our deployments:

14:00 — Hopbox AI detects gradual latency increase on WAN1 at site BR-042. Latency has crept from 14ms to 34ms over four days. Packet loss trending upward.

14:01 — Inference engine evaluates site context. WAN2 (backup LTE link) is healthy: 22ms latency, 0% loss, 78% capacity available. Current traffic volume is within WAN2’s capacity.

14:02 — System generates remediation proposal: shift SD-WAN path selection to prefer WAN2, deprioritize WAN1, open an ISP ticket for WAN1 investigation.

14:02 — NOC engineer receives notification in Slack with the proposal summary.

14:05 — Engineer reviews the proposal, sees the data backing the recommendation, clicks approve.

14:05 — Provisioning API pushes the config change to the device. Change applied and verified within 30 seconds.

14:06 — Automated verification confirms latency dropped to 22ms (WAN2 baseline), no packet loss, all applications operating normally.

14:06 — Incident documented with full timeline, metrics, proposed change, approval record, and verification results.

Total time from detection to resolution: 6 minutes. No SSH sessions. No manual log analysis. No guesswork.

The impact on mean time to resolution is significant:

MetricBeforeAfter
Detection to acknowledgment< 2 minutes (automated)
Acknowledgment to diagnosis< 1 minute (automated)
Diagnosis to remediation< 5 minutes (includes human approval)
Total MTTR

Automating network changes is inherently risky. We have built multiple layers of safety into the system:

The approval gate is non-negotiable. The AI proposes; a human decides. There is no fully autonomous mode today, and we do not plan to offer one for production networks. The cost of a wrong automated change on a production WAN link is too high to remove human judgment from the loop.

The AI is constrained in what changes it can propose:

  • It can adjust SD-WAN path selection and failover priorities.
  • It can modify QoS policies within predefined bounds.
  • It cannot modify firewall rules, routing tables, or VPN configurations.
  • It cannot disable links or interfaces.
  • It cannot make changes that would leave a site with no WAN connectivity.

These limits are enforced at the provisioning API level, not just in the AI model. Even if the inference engine produced an out-of-bounds recommendation (which it should not), the API would reject it.

After a change is applied, the system monitors the affected metrics for a configurable verification window (default: 15 minutes). If the metrics degrade further — meaning the change made things worse — the system automatically rolls back to the previous configuration and notifies the NOC team.

[Change Applied] → [Monitor 15 min] → Metrics improved? → Done
→ Metrics degraded? → Auto-rollback → Alert NOC

Every step is logged: the anomaly event, the site context snapshot, the model’s reasoning, the proposed change, the approval (who, when), the applied config diff, and the verification results. This audit trail is immutable and available for compliance review.

AI-assisted provisioning is not about replacing network engineers. It is about changing what they spend their time on. Instead of SSH sessions and manual config edits at 3 AM, engineers review well-reasoned proposals with full context and make informed approval decisions.

The result is:

  • Faster resolution. Minutes instead of hours.
  • Consistent quality. Every remediation follows the same analysis framework, regardless of which engineer is on shift or how tired they are.
  • Better documentation. Incidents are documented automatically, completely, and consistently.
  • Reduced alert fatigue. When the system can handle routine remediations, engineers focus on genuinely complex issues.

We are expanding the range of remediations the AI can propose, starting with QoS policy adjustments based on traffic pattern changes and bandwidth capacity forecasting. We are also working on multi-site correlation — detecting when multiple sites on the same ISP are degrading simultaneously and escalating appropriately.

If you are managing a distributed network and spending too much time on reactive troubleshooting, we would like to show you what proactive, AI-assisted operations looks like. Get in touch.

v1.7.9