Explainer

VRAM: The Only Spec That Matters for AI

VRAM for local AI: what it is, why models need it, how quantization cuts requirements, and a VRAM table for major models.

OwnRig Editorial|11 min read|March 14, 2026

Here's what nobody tells you when you're shopping for an AI GPU: the spec that determines whether you can run a model isn't clock speed, isn't CUDA cores, isn't TDP. It's VRAM. The amount of dedicated memory on the card. Everything else is noise.

I learned this the hard way. Bought a card with great benchmarks and 8 GB of VRAM. It couldn't run a single 14B model. A cheaper card with 16 GB ran it fine. This guide exists so you don't make the same mistake.

75%

VRAM savings from Q4 quantization

Run models 4x larger than full-precision VRAM requirements

01

What VRAM actually is

VRAM is your GPU's private memory. High-speed, on-board, and limited. For gaming, it holds textures and frame buffers. For AI, it holds model weights: the billions of numbers that define what a model knows.

These weights must fit entirely in VRAM for the model to run at full speed. Partially? Doesn't count. If even 10% spills to system RAM, performance craters. VRAM is binary: either the model fits, or it doesn't.

02

The VRAM formula

At full precision (FP16), the math is simple:

03

Quantization: the VRAM multiplier

Quantization is how you cheat the formula. It reduces weight precision from 16-bit to smaller formats. Less precision means less VRAM. Here's the trade-off at each level:

QuantizationBits per weightVRAM savingsQuality impact
FP1616BaselineFull quality; the reference standard
Q8_08~50%Nearly identical to FP16
Q6_K6~63%Excellent; subtle differences on edge cases
Q5_K_M5~69%Very good; minor impact on complex reasoning
Q4_K_M4~75%Good; the sweet spot for most users
Q3_K_M3~81%Noticeable degradation on harder tasks
Q2_K2~87%Significant quality loss; last resort only

Q4_K_M is the right answer for most people. I know that's a strong claim. But after testing dozens of models at every quantization level, Q4 consistently delivers output that's indistinguishable from Q8 for chat, coding, and general use. You only notice the difference on hard reasoning benchmarks.

04

VRAM requirements by model size

Every model in our database with VRAM requirements at recommended quality. Click any model to see all quantization options and compatible hardware.

1 to 3B (tiny): needs 2 to 4 GB VRAM

ModelParamsMin VRAMRecommended VRAM
Llama 3.2 1B Instruct1.24B819 MB1.1 GB
Whisper Large V31.55B1.3 GB1.5 GB
Stable Diffusion 3 Medium2B5 GB5 GB
Llama 3.2 3B Instruct3.21B1.7 GB2.8 GB
Phi-3 Mini 3.8B Instruct3.82B2.6 GB3 GB
Phi-4 Mini3.82B2 GB3.3 GB

7 to 8B (small): needs 4 to 10 GB VRAM

ModelParamsMin VRAMRecommended VRAM
Stable Diffusion XL 1.06.6B6.5 GB6.5 GB
Mistral 7B Instruct v0.37.24B3.6 GB5.3 GB
DeepSeek R1 Distill Qwen 7B7.62B4.4 GB6.6 GB
Qwen 2.5 7B Instruct7.62B3.9 GB5.5 GB
Qwen 2.5 Coder 7B Instruct7.62B4.4 GB6.6 GB
InternLM 2.5 7B Chat7.74B4.5 GB6.7 GB
Llama 3.1 8B Instruct8.03B4 GB6.7 GB
Stable Diffusion 3.5 Large8.1B9 GB12.5 GB
Gemma 2 9B Instruct9.24B4.6 GB6.6 GB

12 to 14B (medium): needs 6 to 16 GB VRAM

ModelParamsMin VRAMRecommended VRAM
FLUX.1 Dev12B7.2 GB13 GB
Gemma 3 12B12.2B5.7 GB10.5 GB
LLaVA 1.6 13B13B6.2 GB9.1 GB
Phi-3 Medium 14B Instruct14B6.7 GB9.7 GB
Phi-4 14B14.7B6.8 GB12.6 GB
Qwen 2.5 14B Instruct14.77B6.9 GB12.7 GB
StarCoder 2 15B15.5B7.3 GB10.7 GB
DeepSeek Coder V2 Lite 16B15.7B7.4 GB10.9 GB

22 to 34B (large): needs 12 to 24 GB VRAM

ModelParamsMin VRAMRecommended VRAM
Codestral 22B22.2B10.3 GB15.1 GB
Mistral Small 24B Instruct24B11.2 GB20.5 GB
Gemma 2 27B Instruct27.23B9.8 GB18.5 GB
Gemma 3 27B27.23B13.3 GB22.3 GB
DeepSeek R1 Distill Qwen 32B32.5B15.5 GB28 GB
Qwen 2.5 Coder 32B Instruct32.5B11.6 GB21.9 GB
QwQ 32B Preview32.5B11.6 GB21.9 GB
Code Llama 34B Instruct33.7B12 GB22.7 GB
Yi 1.5 34B Chat34.4B15.8 GB29.5 GB
Command R 35B35B16 GB30 GB

70B+ (very large): needs about 40 GB for Q4 (see table) VRAM

ModelParamsMin VRAMRecommended VRAM
Llama 3.1 70B Instruct70.6B24.5 GB47 GB
Llama 3.3 70B Instruct70.6B33 GB61 GB
Qwen 2.5 72B Instruct72.7B25.3 GB40.5 GB
DeepSeek V3671B115 GB360 GB
05

Context length: the hidden VRAM cost

The tables above cover model weights. But when you actually use a model, it also needs VRAM for the KV cache: the memory storing your conversation context. Longer conversations eat more VRAM.

At 4K to 8K context (typical for interactive chat), the KV cache adds 0.5 to 2 GB. At 32K+ context, it can add 4 to 8 GB. That's why you might see "out of memory" errors during long conversations even when the model loaded fine initially.

06

Your VRAM shopping list

Every GPU and Apple Silicon device in our database, sorted by VRAM. Match your model requirements from the tables above to a device below. For perspective, renting equivalent cloud GPU time costs roughly $0/hour at the cheapest provider.

DeviceTypeVRAMPrice
RTX 4060 8GBDiscrete GPU8 GB$289
RTX 3080 10GBDiscrete GPU10 GB$399
RTX 3060 12GBDiscrete GPU12 GB$269
RTX 4070 SuperDiscrete GPU12 GB$599
RTX 4070 Ti 12GBDiscrete GPU12 GB$749
RTX 4060 Ti 16GBDiscrete GPU16 GB$449
RTX 4070 Ti SuperDiscrete GPU16 GB$779
RTX 4080 SuperDiscrete GPU16 GB$979
RTX 5080Discrete GPU16 GB$1,099
M3 Pro (18GB Unified)Apple Silicon18 GB$1,799
M4 Pro (24GB Unified)Apple Silicon24 GB$1,999
RTX 3090Discrete GPU24 GB$899
RTX 4090Discrete GPU24 GB$1,799
RTX 5090Discrete GPU32 GB$2,199
M4 Max (36GB Unified)Apple Silicon36 GB$2,999
M4 Pro (48GB)Apple Silicon48 GB$2,499
M4 Max (64GB Unified)Apple Silicon64 GB$3,499
M4 Max (128GB Unified)Apple Silicon128 GB$4,499
07

How much you actually need

Here's the practical breakdown. I'll be direct.

  • 8 GB: The bare minimum. Runs 7B models at Q4. You'll outgrow it quickly. We don't recommend it for new buyers.
  • 12 to 16 GB: The sweet spot for most users. Runs 7 to 14B models comfortably. Some 34B models at aggressive quantization. This is where we tell most people to start.
  • 24 GB: The enthusiast standard for 34B and below at strong quants. For 70B-class models in our data, 24 GB usually means Q3 and/or partial offload, not full Q4 in VRAM. RTX 4090 and RTX 3090 live here.
  • 32 GB (discrete): The RTX 5090 — the largest GeForce VRAM we catalog. Our model data still lists 70B Q4 around 40 GB, so the matrix treats 70B Q4 here as offload-heavy, not fully in VRAM.
  • 36 to 128 GB (Apple unified): M4 Max configs; 64 GB and up is where 70B Q4 gets comfortable with headroom. Check each device and model page — unified memory is shared with the system.
Common Questions
What is VRAM?+
VRAM (Video RAM) is the dedicated memory on a graphics card. For AI, it's where model weights live during inference. Unlike system RAM, VRAM has much higher bandwidth, which is why GPUs are faster than CPUs for AI.
Can I add more VRAM to my GPU?+
No. VRAM is soldered to the board. You can't upgrade it. The only way to get more is to buy a different GPU. This is why VRAM capacity is the most important buying decision for AI hardware.
Is Apple Silicon unified memory the same as VRAM?+
Functionally, yes for sizing: one pool feeds CPU and GPU. A 64 GB or 128 GB Mac can dedicate tens of GB to a model — useful when 70B-class Q4 weights want about 40 GB in our database and a 24 GB discrete card is tight.
What happens if my model is bigger than my VRAM?+
The model either won't load, or the engine will partially offload layers to system RAM. Offloaded layers run 10 to 50x slower because system RAM bandwidth is much lower. A model doing 40 tok/s in VRAM might drop to 2 to 5 tok/s with partial offloading.
Does VRAM matter for image generation?+
Yes. Stable Diffusion XL needs 6 to 8 GB for standard 1024x1024 images. Higher resolutions and larger batch sizes need proportionally more. Video generation models need even more.

Priya Krishnan

Editor, hardware & inference

Priya obsesses over the gap between box specs and what actually happens when you hit Enter in Ollama. She got here untangling friends’ builds and sticker-shock cloud bills, and she still treats every recommendation like a debt she owes the reader.

Ready to build?

Tell us what you want to run, your budget, and your use case. We'll match you to the right hardware in under a minute.

All hardware specifications, prices, and performance data referenced in this guide are sourced from OwnRig's data layer, which is based on manufacturer specifications and community benchmarks. Prices are approximate US retail as of March 2026. Performance figures may vary by configuration, driver version, and software.