Back to Blog
Guide11 min

Reproducing Carrier-Grade Outages in a Virtual Lab: The Post-Mortem Playbook

A framework for carrier network engineering teams: turn every production outage into a reproducible cloud lab within hours, not weeks. Covers the seven phases of a virtual-lab post-mortem.

D
David Kim
DevOps Engineer

Carrier network outages follow a predictable pattern. Someone notices traffic dropping. Oncall opens a bridge. The network is restored through rollback or workaround within hours. Then the real work starts: figuring out what actually happened, building a reproduction, and validating a permanent fix. That second phase — the post-mortem — typically takes 4 to 6 weeks when the reproduction requires physical hardware. Here is how carrier engineering teams are compressing that to 1-2 days using virtual labs, and the seven-phase playbook they follow.

Why post-mortems take weeks with physical hardware

The bottleneck is rarely the investigation itself. It is the environment setup. A typical cross-vendor carrier incident requires:

  • Sourcing matching hardware (often end-of-life gear from retirement warehouses)
  • Flashing matching firmware versions (often archived images not readily available)
  • Racking, cabling, and power
  • Matching configuration of both production devices
  • Custom tooling to reproduce the trigger (malformed packet, specific timing, traffic pattern)
  • Iteration on fixes requires device reloads, which take minutes each

Every one of these steps is external to the actual engineering question. The engineer solving the bug might spend 10 hours on the bug and 150 hours on the environment.

The shift: A cloud-hosted lab built from a prompt collapses weeks of logistics into minutes. The post-mortem stops being a multi-week project and becomes a normal engineering task that closes in the same sprint.

The seven-phase playbook

Phase 1: Capture the incident artifacts (within 1 hour of resolution)

Before anything else, while the incident is still fresh, grab everything:

  • Packet captures from both sides of the failing link
  • show tech-support or request support information outputs from both vendors
  • Device logs for the ~15 minutes before and after the trigger
  • BGP / OSPF / protocol session state transitions
  • Traffic graphs from monitoring (InfluxDB, Prometheus, vendor NMS)
  • The exact versions of firmware, config, and any recent changes

You are building the evidence packet the virtual lab will need to faithfully reproduce the trigger. Missing artifacts here cost days later.

Phase 2: Identify the minimum viable topology (1-2 hours)

Carrier networks are huge. The bug is almost always local to a small subgraph. Isolate the smallest topology that could have exhibited the failure:

  • The failing device (the one that crashed, lost sessions, dropped traffic)
  • Its direct peers (the immediate neighbors whose behavior could have triggered it)
  • Any adjacent device with special configuration (route reflector, firewall, middlebox)
  • A traffic source / sink (Linux host or traffic generator)

For most cross-vendor outages, this is 2-4 devices. You do not need to reproduce the entire fabric — you need the interaction graph where the bug lives.

Phase 3: Describe the lab in plain English (5 minutes)

Open NetPilot and paste the topology as a prompt. For example, for a cross-vendor BGP-EVPN outage:

Build a 3-node lab: Cisco IOL router (representing EOL device), Juniper cRPD (representing vMX), and Arista cEOS (representing adjacent leaf). All three peer via BGP EVPN over VXLAN underlay. Cisco in AS 65001, Juniper in AS 65002, Arista in AS 65003. Route reflector at Juniper side. Add a Linux endpoint with Scapy connected to Cisco for packet crafting. Advertise 10.100.0.0/24 from Cisco and Arista.

NetPilot parses the topology, generates vendor-specific configuration for each device, and deploys the lab to a cloud VM. Two minutes later you have SSH access to every device.

Phase 4: Match production configuration (15-30 minutes)

Copy the relevant production configuration into the lab — not the whole config, just the parts involved in the incident.

  • BGP peering and policy (prefix lists, route-maps, policy-statements)
  • Protocol configuration (EVPN, OSPF, IS-IS)
  • Interface configuration for involved links
  • Any security or QoS policies that interact with the failing protocol
  • Exact firmware version if your NetPilot plan supports version pinning (BYOI on enterprise plan)

This is where historical firmware matters. If the production trigger was a bug fixed in a later release, you need the older release — which is easier with container images than physical devices.

Phase 5: Reproduce the trigger (30 minutes to 2 hours)

With the lab running production-equivalent configuration, reproduce the condition that caused the outage. Describe what you need to the agent:

  • Malformed packet: "Inject a malformed EVPN Type-2 route with oversized community attribute from the Linux endpoint" — agent crafts and sends the Scapy packet
  • BGP event: "Flap the BGP session between Cisco and Juniper" — agent runs clear ip bgp / clear bgp neighbor
  • Link failure: "Drop all traffic on the R2-R3 link for 30 seconds" — agent applies tc netem and removes it
  • Timing-sensitive sequences: for complex triggers (precise multi-event sequences), write a bash or Python script directly — commit it to your repo as part of the repro artifact

Direct CLI is always available for triggers that need precise control — SSH into the device or control node and run commands by hand or via script.

Each trigger is reproducible — either as an agent prompt or a committed script — so anyone on the team can rerun it.

Phase 6: Iterate on the fix (1-4 hours)

Once the bug reproduces, try candidate fixes. This is the phase that kills post-mortems on hardware (every iteration is a reload) but flies in a virtual lab (every iteration is a commit).

For each candidate, tell the agent what to change:

"Apply an extended-community filter on Juniper to block malformed EVPN community attributes"

The agent generates the right syntax per vendor and applies it. You don't need to remember policy-options policy-statement syntax — describe the intent. Direct CLI is always available to review the change, adjust it, or write it yourself:

set policy-options policy-statement drop-bad-evpn then reject
commit

Each iteration is 2-5 minutes instead of 20-60 on hardware.

Phase 7: Document and productize (1-2 hours)

Every validated repro becomes a team asset:

  • Save the prompt that describes the topology
  • Save the trigger scripts (Scapy, bash, Python)
  • Save the candidate fix as a config snippet
  • Save the baseline verification commands so anyone can confirm the lab works before running the trigger
  • Write the post-mortem with a link to the reproducible lab, not just a description

Six months later when the same bug class shows up in a different pair of devices, the repro lab is already waiting.

What enterprise carrier teams actually do

Patterns we see from carrier engineering on the enterprise plan:

  • Incident playbook library: Every major outage root cause gets a saved NetPilot lab prompt. The playbook index maps production symptoms → lab prompts.
  • TAC case attachments: Vendor TAC cases include the NetPilot lab prompt in the initial ticket. TAC teams can recreate the exact scenario by running the prompt themselves. Case resolution drops from weeks to days.
  • Shift handoffs: Nightly oncall uses the lab library to verify workarounds during the incident itself, not just in the post-mortem.
  • New-hire training: New network engineers spend their first week running historical incident labs — faster ramp than reading after-action reports.
  • Vendor qualification: Before signing the next 5-year gear contract, run the candidate vendor's device through every historical bug repro you have. If their implementation breaks on last year's bugs, skip them.

Pairing with other validation tools

A virtual lab post-mortem is complementary to, not a replacement for, your existing tools:

  • Hardware testers (Keysight IxNetwork, VIAVI TestCenter — formerly Spirent TestCenter) stay relevant for line-rate traffic generation and certification labs
  • Static analysis (Batfish) stays relevant for config audit and unreachability analysis
  • Production telemetry (ThousandEyes, Kentik, vendor NMS) stays relevant for detection and capture
  • Virtual lab (NetPilot) adds the reproduction + iteration phase that was previously hardware-bottlenecked

For the full landscape see Keysight vs Spirent vs NetPilot.

FAQ

How long should a carrier outage post-mortem actually take?

With physical hardware reproduction: 4-6 weeks from incident close to post-mortem close. With a virtual lab: 1-3 days for most bugs, up to a week for complex timing-dependent or scale-dependent issues. The primary variable is environment setup time — virtual labs collapse that to minutes.

Do I need to reproduce the entire production fabric?

No. Most bugs are local to a 2-5 device subgraph. Identify the minimum viable topology that can exhibit the failure — usually the failing device, its direct peers, and any device with special configuration (route reflector, firewall, policy boundary). Adding more devices just makes the investigation slower.

What if the bug only reproduces on a specific firmware version?

Use the NetPilot enterprise plan's BYOI (bring your own image) capability to pin the exact firmware version. For container-based vendor NOS images, this is straightforward. For VM-based images, contact sales about custom image onboarding.

Can my vendor's TAC team access the same lab?

Yes. Share the prompt with vendor TAC. They can deploy the same lab in their own NetPilot instance, matching your exact topology. Enterprise plans support shared lab environments for joint TAC engagements.

How do I handle timing-sensitive bugs that are hard to reproduce?

Script the trigger. Timing-sensitive bugs look "unreproducible" because humans can't hit the exact sequence of events by hand. A bash or Python script with precise sleeps can reproduce a 5ms-window bug on the 3rd try instead of the 3000th. Scripts live in the lab prompt artifact.


Copy-paste ready: The cross-vendor EVPN bug reproduction prompt and outage forensics playbook prompt are the templates carrier teams start from.

Running enterprise carrier network engineering? The Network Research Lab is built for this workflow. Contact sales for enterprise plans with dedicated environments, BYOI for firmware pinning, and shared lab support for TAC engagements.

Try NetPilot Free

Build enterprise-grade network labs in seconds with AI assistance

Get Started Free