What GPU do I need to run a 30B-class local model like Gemma with Ollama?

At 4-bit quantization a ~30-31B model is about a 20 GB download and loads on a single 24 GB GPU (e.g. RTX 4090) for single-user or pilot use — though that leaves little room for long context, so throughput drops if context grows. For production or long context, a 48 GB card (RTX 6000 Ada, A6000, or L40S) is the safe choice. Apple Silicon with 64 GB or more of unified memory also runs it. CPU-only works with ~32 GB of system RAM, but it is slow — a fallback, not the recommended path.

Can NetPilot On-Prem run on a single workstation GPU without a datacenter cluster?

Yes. NetPilot On-Prem runs the agent on a local LLM that you operate on a single GPU server or workstation — Ollama, vLLM, or Microsoft Foundry Local. We validated the agent on Google's Gemma 4 31B via Ollama, and it did not need a datacenter GPU. You supply and size the model host (one VM); NetPilot never distributes model weights — you pull the model yourself.

What GPU Do You Need to Run an On-Prem AI Network Engineer? (Local LLM Hardware Guide)

Q: Do I need an H100 or A100 to run an on-prem AI network tool on a local LLM?

No. A single ~31B local model — the sweet spot for an agent that designs and validates network labs — runs in roughly 17-20 GB of VRAM at 4-bit quantization, which fits on a single workstation GPU. H100/A100 (80 GB) datacenter cards exist for training and high-concurrency multi-model serving; they are not required to run one model for one agent. A single 24 GB card handles a pilot, and a 48 GB workstation card (RTX 6000 Ada or A6000) is the comfortable production spec.

When a network team scopes an on-prem, air-gapped AI tool, the first question from infrastructure and procurement is never about features. It's "what hardware do I need to run this?" — and the honest answer determines whether the project is a single requisition or a six-figure GPU build.

The good news: running an AI network agent on a local model is a one-workstation-GPU job, not a datacenter job. This guide gives the real sizing — VRAM by model size, which GPUs actually work, and where the line is — plus the model we tested for NetPilot On-Prem.

The short answer

For an agent that designs, deploys, and validates network labs, the sweet spot is a ~31B-class local model (large enough for reliable multi-step tool-calling, small enough to run on one card). Quantized to 4-bit, a model that size is roughly a 20 GB download and needs about 17–20 GB of VRAM — which fits on a single workstation GPU.

You do not need an NVIDIA H100 or A100. Those 80 GB datacenter cards exist for training and for serving many models or many concurrent users with large batch sizes. One model answering one agent's requests fits comfortably in 17–40 GB depending on quantization.

VRAM by model size (4-bit / Q4)

A rule of thumb: a 4-bit model needs roughly half a gigabyte of VRAM per billion parameters, plus headroom for the KV cache and your context length.

Model size (Q4)	Approx. VRAM (weights)	Runs on
~8B (e.g. Llama 3.1 8B)	~5–6 GB	Almost any modern GPU (8–12 GB)
~12–14B	~9–10 GB	A 16 GB card
~27–31B (e.g. Gemma 4 31B)	~17–20 GB	One 24 GB card (pilot) → 48 GB (production)
~70B	~40–45 GB	One 48 GB card (tight) → 80 GB or 2× cards

The numbers are approximate — actual usage rises with context length and the KV cache, and varies by runtime version. Plan for headroom rather than treating the weight size as the ceiling.

Which GPU, honestly

A single 48 GB workstation card is the no-compromise pick. An NVIDIA RTX 6000 Ada or RTX A6000 (48 GB) loads a ~31B model at 4-bit with plenty of room for long context, can even run an 8-bit (near-lossless) quant, and has ECC memory — a clean fit for regulated environments. If you want one spec to put in a requisition for a defense, finance, or telco deployment, this is it.

A 24 GB card (e.g. RTX 4090) works for pilots and single users. A ~31B model at 4-bit fits, but with only a few gigabytes left for context — fine at modest context lengths, but if context grows past available VRAM the runtime offloads layers to the CPU and throughput drops sharply. Great for a proof of concept; size up to 48 GB for production or long context.

Apple Silicon is a real option. Because the GPU shares system memory, capacity equals unified RAM: a Mac with 64 GB or more runs a ~31B model locally and air-gapped, quietly and at low power. Raw throughput under load still trails a 48 GB NVIDIA card, but for a desk-side or small-team air-gapped deployment it's a clean choice.

CPU-only runs, but it's slow. With about 32 GB of system RAM, a 4-bit ~31B model will run with no GPU at all — useful to prove a fully air-gapped box can host the agent, but too slow for interactive, multi-user work. Treat it as a fallback, not the plan.

Datacenter cards are optional. An L40S (48 GB) or L4 (24 GB) are fine rack-mount options; H100/A100 (80 GB) only earn their keep if you want full-precision weights or high concurrency from one box. They are not required — not useless, just overkill for a single-model agent.

The model we tested: Gemma 4 31B on Ollama

For NetPilot On-Prem we validated the agent on Google's Gemma 4 31B, deployed via Ollama. It runs at Ollama's default 4-bit quantization (about a 20 GB download) and drove the full design → deploy → verify loop on a single workstation GPU — no datacenter card involved.

A 31B-class model is a deliberate choice for an agent (as opposed to a chat assistant): it has to emit well-formed tool calls through a multi-step loop, not just answer one prompt, and very small models skip or malform tool calls under that load. Gemma 4 31B was large enough to be reliable and small enough to run on one card — the balance the on-prem use case needs.

A few honest notes:

You pull the model, not us. NetPilot never distributes model weights — you download Gemma (or your approved model) from Ollama or Google yourself, consistent with the bring-your-own-image model for everything else.
The model is swappable. It's configured in the admin console, so you can run whichever local model your security team approves — a different ~31B model, a smaller one for lighter hardware, or a larger one if you have the VRAM. Describe the agent's needs as "a ~31B-class local model such as Gemma 4 31B," not a single fixed dependency.
Tool-calling reliability matters more than benchmarks. The agent lives or dies on whether the model emits clean tool calls across a loop; favor models known for solid tool-calling over leaderboard trivia.

Where NetPilot On-Prem fits

This is the part most hardware guides leave out: they tell you what your home PC can run, or quote a 4× H100 datacenter build, and nothing in between. The enterprise middle — a single workstation GPU running a real, production AI agent on-prem — is exactly where an on-prem AI network tool lives.

NetPilot On-Prem is built for that middle. The agent runs on your local model on one GPU server or workstation; it deploys real multi-vendor ContainerLab labs on your own host; and nothing leaves your network. You size one VM for the model — a single 24 GB card for a pilot, 48 GB for production — and you're done. No GPU cluster, no cloud, no phone-home.

Direct CLI is always available, too — SSH into any lab device for the real vendor CLI; the agent is the fast path, the CLI is how you verify.

FAQ

Do I need an H100 or A100 to run an on-prem AI network tool on a local LLM?

No. A ~31B model — the sweet spot for a network-lab agent — runs in about 17–20 GB of VRAM at 4-bit, which fits on a single workstation GPU. A 24 GB card handles a pilot; a 48 GB card (RTX 6000 Ada / A6000) is the comfortable production spec. H100/A100 are for training and high-concurrency serving, not a single-model agent.

What GPU do I need to run a 30B-class model like Gemma with Ollama?

At 4-bit it's about a 20 GB download and loads on a single 24 GB GPU (e.g. RTX 4090) for single-user/pilot use, or a 48 GB card for production and long context. Apple Silicon with 64 GB+ unified memory also runs it; CPU-only works with ~32 GB RAM but is slow.

Can NetPilot On-Prem run on a single workstation GPU?

Yes. The agent runs on your local model on one GPU server or workstation (Ollama, vLLM, or Foundry Local). We validated it on Gemma 4 31B via Ollama with no datacenter GPU. You supply and size the model host; NetPilot never distributes the weights.

What if I only have CPU, or a small GPU?

CPU-only runs a 4-bit ~31B model with ~32 GB RAM but slowly — fine for proving an air-gapped box works, not for interactive use. On a small GPU, run a smaller model (an 8–14B model needs only 6–10 GB) and accept that a smaller model handles simpler agent tasks; step up to ~31B on 24–48 GB when you can.

Copy-paste ready: Once your model host is up, the three-AS eBGP prompt is a good first lab to confirm the agent loop runs end to end on your hardware.

Related reading: On-Prem AI Network Lab covers the full air-gapped deployment; Run an AI Network Engineer on a Local LLM covers the model and agent side; and Building an Air-Gapped Network Lab covers the no-cloud lab itself.

Scoping an on-prem deployment? The On-Prem AI Network Lab page covers it end to end — contact sales and we'll size the hardware with you.

Try NetPilot Free