Back to Blog
Case Study11 min

How to Reproduce a Cross-Vendor EVPN Bug in 10 Minutes (Not 6 Weeks)

A Tier-1 ISP outage caused by an end-of-life Cisco router sending malformed EVPN packets to a Juniper vMX. Traditional post-mortem took 6 weeks of hardware sourcing. Here's how to reproduce it in 10 minutes with a virtual lab.

D
David Kim
DevOps Engineer

A Tier-1 ISP had a production outage. The trigger: an end-of-life Cisco router sent a malformed EVPN packet to a Juniper vMX. The vMX crashed. Traffic loss cascaded across the fabric. By the time the team started the post-mortem, the network engineering group was facing a six-week forensic process — source the EOL hardware from a retirement warehouse, ship it to the lab, set up a matching topology, reproduce the exact packet, iterate on the fix. Most of that time would be waiting: for gear, for firmware images, for environment replication.

Here's how the same investigation runs in a cloud-hosted AI-built lab, start to finish, in about 10 minutes.

The traditional post-mortem timeline

StepTraditional approachTime
Source matching EOL hardwareLocate spare gear, ship to lab3-7 days
Rack + cablePhysical install, cabling, power1 day
Match firmware versionsFind archived images, flash devices1-2 days
Build matching topologyManual config on both devices1-2 days
Reproduce the packet conditionCustom tooling, trial and error3-7 days
Iterate on the fixConfig changes, reloads, retest5-10 days
Document and closeWrite-up, sign-offs1-2 days
Total4-6 weeks

The total cost: weeks of engineer time, blocked roadmap, customers asking why the post-mortem is still open, and a dusty rack of EOL gear that's sat unused since the last cross-vendor bug three months ago.

The shift: What used to require six weeks of hardware logistics now fits into a 10-minute lab session. The bottleneck was never the investigation — it was the environment setup. Remove that, and bug reproduction becomes a normal part of an engineer's day instead of a project.

The NetPilot approach — start to finish

Step 1: Describe the lab (30 seconds)

Open NetPilot and paste the topology in plain English:

Build a 2-node lab: Cisco IOL router (simulating the EOL device) and Juniper cRPD (simulating the production vMX) in EVPN peering over VXLAN. Add a Linux endpoint connected to the Cisco side with Scapy installed for malformed packet generation. Use AS 65001 for Cisco and AS 65002 for Juniper. Advertise 10.100.0.0/24 from both sides.

NetPilot parses the prompt, designs the topology, generates vendor-specific configurations, and begins the deployment.

Step 2: Lab deploys in the cloud (~2 minutes)

In about 2 minutes the lab is live. The generated artifacts:

  • Cisco IOL device with EVPN BGP configuration
  • Juniper cRPD device with matching EVPN config
  • Linux endpoint with Scapy pre-installed
  • Topology diagram showing the peering
  • SSH access to every device from the browser

No image sourcing. No firmware flashing. No cabling. No license keys.

Step 3: Verify baseline (1 minute)

Ask the agent:

"Check EVPN session state — verify BGP session is established and Type-2/Type-3 routes are exchanging between Cisco and Juniper."

The agent checks both devices in parallel and returns a consolidated view: session state, route counts per type, any mismatches. Baseline confirmed green in one step across both vendors — no need to remember show bgp l2vpn evpn summary for Cisco and show route table bgp.evpn.0 for Juniper separately.

Direct CLI is always available if you want to verify by hand or dig deeper on either device:

show bgp summary
show evpn instance
show route table bgp.evpn.0
show bgp l2vpn evpn summary
show bgp l2vpn evpn

Step 4: Craft the malformed packet (2 minutes)

From here, you're writing custom Scapy code — this is the research part that the agent doesn't automate (yet). On the Linux endpoint, Scapy is pre-installed. Write the malformed EVPN packet — for this reproduction, a Type-2 MAC/IP route with an out-of-spec extended community attribute:

from scapy.all import *
from scapy.contrib.bgp import *
 
# Construct a BGP UPDATE with malformed EVPN Type-2 NLRI
bad_update = BGPHeader(type=2) / BGPUpdate(
    path_attr=[
        # Normal attributes...
        BGPPathAttr(type_flags=0x40, type_code=2, attr_len=6,
                    attribute=BGPPAAS4BytesPath(segments=[BGPPAAS4BytesPathSegment(
                        segment_type=2, segment_length=1, segment_value=[65001]
                    )])),
        # Malformed extended community (the bug trigger)
        BGPPathAttr(type_flags=0xc0, type_code=16, attr_len=255, attribute=b'\xff' * 255),
    ]
)
 
# Send into the established BGP session
send(IP(dst="10.100.1.1") / TCP(dport=179) / bad_update, iface="eth1")

Step 5: Observe the behavior (2 minutes)

Ask the agent:

"Watch the Juniper BGP session — tell me if the session drops or if any error is logged in the next 30 seconds."

The agent monitors session state and log output across both devices and alerts you when the session drops. You'll see: packet arrives → session torn down → routing process behavior (restart, crash, or silent misbehavior depending on the bug class).

Direct CLI is available for raw packet-level inspection — SSH into Juniper and run:

monitor traffic interface ge-0/0/0 matching "port 179" count 20

Take a packet capture for the report. Save the Scapy script in the lab's /root/repro/ so anyone on the team can rerun it.

Step 6: Test the fix (2-3 minutes)

Ask the agent:

"Apply an extended-community filter on Juniper to drop malformed EVPN routes with oversized community attributes, then commit."

The agent generates the Junos policy, applies it, runs commit check, and commits. You tell it what to do; it handles the syntax.

Direct CLI is available to review the change or write it yourself:

set policy-options policy-statement drop-bad-evpn from family evpn
set policy-options policy-statement drop-bad-evpn from community-count 100 orhigher
set policy-options policy-statement drop-bad-evpn then reject
set protocols bgp group EVPN import drop-bad-evpn
commit check
commit

Rerun the Scapy script. The malformed packet arrives, the policy drops it, the BGP session stays up. Fix verified.

Step 7: Document and close

Export the topology as a lab template. Save the Scapy repro script to your repo. Attach the packet capture to the incident ticket. Close the post-mortem.

Total elapsed time: roughly 10 minutes, including pauses to read CLI output.

Why the real device CLI matters

Every competitive approach to this problem has a tradeoff.

Static config analysis (Batfish): Tells you the configs are valid. It won't show you a runtime crash from a malformed packet — there's no runtime to crash.

Simulation (Packet Tracer): Gives you a simplified protocol model. It doesn't reproduce vendor-specific packet parsing bugs because the parser isn't the real one.

Hardware traffic generators (Keysight IxNetwork, VIAVI TestCenter — formerly Spirent TestCenter): Can generate malformed packets at line rate. Don't give you multi-vendor device-level behavior with real CLIs in a lab you can spin up in 2 minutes.

DIY ContainerLab or GNS3: Gets you there eventually. Requires sourcing images, configuring the host, building the topology, scripting Scapy yourself. Hours to days of setup.

AI-built cloud labs: Describe the topology, get the lab, reproduce the bug. No setup tax.

The critical property for this workflow is real device behavior with real CLI access. Cisco IOL is the actual Cisco IOS binary running in a container. Juniper cRPD is Juniper's actual routing process. When the malformed packet hits the parser, it hits the real parser — not a simulator that approximates it. That's what makes the reproduction faithful.

What teams actually do with this

A few patterns we see on the Pro and enterprise plans:

  • ISP carrier engineering: Reproduce cross-vendor bugs before filing vendor TAC cases. Attach a reproduction prompt to the ticket. Vendors take it seriously when you hand them a repro in a shared cloud lab.
  • Network equipment vendor TAC: When a customer reports a bug, stand up the customer's topology from their ticket description. Iterate on the fix against a matching lab.
  • Security research: Fuzz vendor implementations against RFC edge cases. Document protocol conformance deltas.
  • Pre-purchase vendor validation: Before committing to new gear, stress-test vendor X's parser with the adversarial packets you already saw break vendor Y.
  • Outage post-mortem: Every production outage root cause gets a saved repro lab. When the same bug class reappears, the lab is already waiting.

The category is changing

Three years ago, cross-vendor bug reproduction meant a rack of gear and a six-week forensic cycle. The shift to cloud-hosted AI-built labs doesn't eliminate that workflow — it converts it from a project into a daily task. Post-mortems that used to drag across quarters close in the same sprint they opened.

That's the actual value prop. Not "replaces Keysight." Not "cheaper than a hardware lab." It's "makes bug reproduction cheap enough that engineers do it as a normal part of investigating something."

For the full comparison of platforms that can and can't run this workflow, see Keysight vs Spirent vs NetPilot. For debugging Cisco-Juniper EVPN specifically, see Debugging Cisco-Juniper EVPN Interop Issues.

FAQ

How do I reproduce a cross-vendor network bug without physical hardware?

Describe the topology to NetPilot in plain English, including both vendor devices and a Linux endpoint for packet crafting. NetPilot builds the lab in under 2 minutes with real device CLIs. Use Scapy on the Linux endpoint to inject the specific packet that triggered the issue, then observe vendor behavior via SSH. For cross-vendor bugs specifically, include both vendors in the same topology so the interop path matches production.

Can I run Scapy in a NetPilot lab?

Yes. Every NetPilot lab includes Linux endpoint nodes with Scapy pre-installed. You SSH in, write your packet-crafting script, and send packets directly into the network. Scapy's BGP and EVPN support makes it straightforward to craft malformed protocol messages for vendor testing.

Yes, when done in your own lab environment for defensive research, bug reporting, and network validation. You own the lab, you own the traffic, you're testing vendor behavior against your own workloads. Most Tier-1 vendor TAC teams welcome detailed reproductions with packet captures and lab scripts — they cut TAC case resolution time substantially.

How long does a typical bug reproduction take?

For the scenario described above — cross-vendor EVPN with a malformed Type-2 route — about 10 minutes end to end, including lab deployment, baseline verification, packet crafting, reproduction, and testing a candidate fix. More complex scenarios (partition events, timing-dependent bugs, interaction with firewalls) can take 30-60 minutes. Compare to 4-6 weeks with physical hardware.

Can I share the reproduction with my vendor's TAC team?

Yes. Save the Scapy script and the lab prompt. The vendor can recreate the exact lab in NetPilot using the same prompt. This is dramatically faster than shipping them a packet capture and hoping they can replicate the topology. Enterprise plans support shared lab environments for TAC engagements.


Copy-paste ready: The cross-vendor EVPN bug reproduction prompt is the exact prompt used in this walkthrough. Clone it, edit the vendor mix, and reproduce your own cross-vendor scenarios.

Running an enterprise network research program? The Network Research Lab is built for this workflow — multi-vendor, real CLIs, cloud, failure injection. Contact sales for enterprise plans with dedicated environments and custom vendor support.

Try NetPilot Free

Build enterprise-grade network labs in seconds with AI assistance

Get Started Free