VRAM: The Only Spec That Matters for AI

Here's what nobody tells you when you're shopping for an AI GPU: the spec that determines whether you can run a model isn't clock speed, isn't CUDA cores, isn't TDP. It's VRAM. The amount of dedicated memory on the card. Everything else is noise.

I learned this the hard way. Bought a card with great benchmarks and 8 GB of VRAM. It couldn't run a single 14B model. A cheaper card with 16 GB ran it fine. This guide exists so you don't make the same mistake.

75%

VRAM savings from Q4 quantization

Run models 4x larger than full-precision VRAM requirements

What VRAM actually is

VRAM is your GPU's private memory. High-speed, on-board, and limited. For gaming, it holds textures and frame buffers. For AI, it holds model weights: the billions of numbers that define what a model knows.

These weights must fit entirely in VRAM for the model to run at full speed. Partially? Doesn't count. If even 10% spills to system RAM, performance craters. VRAM is binary: either the model fits, or it doesn't.

The VRAM formula

At full precision (FP16), the math is simple:

Quantization: the VRAM multiplier

Quantization is how you cheat the formula. It reduces weight precision from 16-bit to smaller formats. Less precision means less VRAM. Here's the trade-off at each level:

Quantization	Bits per weight	VRAM savings	Quality impact
FP16	16	Baseline	Full quality; the reference standard
Q8_0	8	~50%	Nearly identical to FP16
Q6_K	6	~63%	Excellent; subtle differences on edge cases
Q5_K_M	5	~69%	Very good; minor impact on complex reasoning
Q4_K_M	4	~75%	Good; the sweet spot for most users
Q3_K_M	3	~81%	Noticeable degradation on harder tasks
Q2_K	2	~87%	Significant quality loss; last resort only

Q4_K_M is the right answer for most people. I know that's a strong claim. But after testing dozens of models at every quantization level, Q4 consistently delivers output that's indistinguishable from Q8 for chat, coding, and general use. You only notice the difference on hard reasoning benchmarks.

VRAM requirements by model size

Every model in our database with VRAM requirements at recommended quality. Click any model to see all quantization options and compatible hardware.

1 to 3B (tiny): needs 2 to 4 GB VRAM

Model	Params	Min VRAM	Recommended VRAM
Llama 3.2 1B Instruct	1.24B	819 MB	1.1 GB
Whisper Large V3	1.55B	1.3 GB	1.5 GB
Stable Diffusion 3 Medium	2B	5 GB	5 GB
Llama 3.2 3B Instruct	3.21B	1.7 GB	2.8 GB
Phi-3 Mini 3.8B Instruct	3.82B	2.6 GB	3 GB
Phi-4 Mini	3.82B	2 GB	3.3 GB

7 to 8B (small): needs 4 to 10 GB VRAM

Model	Params	Min VRAM	Recommended VRAM
Arcee Trinity Nano 6B	6B	3.9 GB	5.4 GB
Stable Diffusion XL 1.0	6.6B	6.5 GB	6.5 GB
Mistral 7B Instruct v0.3	7.24B	3.6 GB	5.3 GB
DeepSeek R1 Distill Qwen 7B	7.62B	4.4 GB	6.6 GB
Qwen 2.5 7B Instruct	7.62B	3.9 GB	5.5 GB
Qwen 2.5 Coder 7B Instruct	7.62B	4.4 GB	6.6 GB
InternLM 2.5 7B Chat	7.74B	4.5 GB	6.7 GB
Gemma 4 E4B	8B	6 GB	7 GB
Llama 3.1 8B Instruct	8.03B	4 GB	6.7 GB
Stable Diffusion 3.5 Large	8.1B	9 GB	12.5 GB
Qwen3-8B Instruct	8.2B	4.5 GB	6.5 GB
Nemotron-Labs Diffusion 8B	8.5B	19 GB	19 GB
Gemma 2 9B Instruct	9.24B	4.6 GB	6.6 GB

12 to 14B (medium): needs 6 to 16 GB VRAM

Model	Params	Min VRAM	Recommended VRAM
GigaChat Lightning 10B	10B	6 GB	6 GB
Llama 3.2 11B Vision	11B	5.8 GB	10 GB
FLUX.1 Dev	12B	7.2 GB	13 GB
Gemma 3 12B	12.2B	5.7 GB	10.5 GB
LLaVA 1.6 13B	13B	6.2 GB	9.1 GB
Phi-3 Medium 14B Instruct	14B	6.7 GB	9.7 GB
Qwen3-14B Instruct	14B	7 GB	10 GB
Phi-4 14B	14.7B	6.8 GB	12.6 GB
Qwen 2.5 14B Instruct	14.77B	6.9 GB	12.7 GB
StarCoder 2 15B	15.5B	7.3 GB	10.7 GB
DeepSeek Coder V2 Lite 16B	15.7B	7.4 GB	10.9 GB

22 to 34B (large): needs 12 to 24 GB VRAM

Model	Params	Min VRAM	Recommended VRAM
Codestral 22B	22.2B	10.3 GB	15.1 GB
Mistral Small 24B Instruct	24B	8.8 GB	20.5 GB
Gemma 4 26B-A4B	25.2B	14 GB	24 GB
Arcee Trinity Mini 26B	26B	13 GB	20 GB
Qwen3.5-27B	27B	14 GB	19 GB
Qwen3.6-27B	27B	15 GB	20 GB
Gemma 2 27B Instruct	27.23B	9.8 GB	18.5 GB
Gemma 3 27B	27.23B	13.3 GB	22.3 GB
Qwen3-30B-A3B	30B	16 GB	23 GB
Gemma 4 31B	30.7B	16 GB	28 GB
Qwen3-32B Instruct	32B	17.5 GB	25 GB
DeepSeek R1 Distill Qwen 32B	32.5B	12.2 GB	28 GB
Qwen 2.5 Coder 32B Instruct	32.5B	11.6 GB	21.9 GB
QwQ 32B Preview	32.5B	11.6 GB	21.9 GB
Code Llama 34B Instruct	33.7B	12 GB	22.7 GB
Yi 1.5 34B Chat	34.4B	12.4 GB	29.5 GB
Command R 35B	35B	12.5 GB	30 GB
Qwen3.6-35B-A3B	35B	16 GB	25 GB

70B+ (very large): needs about 40 GB for Q4 (see table) VRAM

Model	Params	Min VRAM	Recommended VRAM
Llama 3.1 70B Instruct	70.6B	24.5 GB	47 GB
Llama 3.3 70B Instruct	70.6B	25.6 GB	61 GB
Qwen 2.5 72B Instruct	72.7B	25.3 GB	40.5 GB
Llama 4 Scout	109B	50 GB	75 GB
NVIDIA Nemotron-3-super-120B-A12B	120B	40 GB	70 GB
Qwen3.5-122B-A10B	122B	28 GB	42 GB
Mistral Large 2 123B	123B	45 GB	95 GB
Qwen3.5-397B (MoE)	397B	140 GB	230 GB
Arcee Trinity Large Thinking 400B	399B	195 GB	295 GB
DeepSeek R1	671B	115 GB	360 GB
DeepSeek V3	671B	115 GB	360 GB

Context length: the hidden VRAM cost

The tables above cover model weights. But when you actually use a model, it also needs VRAM for the KV cache: the memory storing your conversation context. Longer conversations eat more VRAM.

At 4K to 8K context (typical for interactive chat), the KV cache adds 0.5 to 2 GB. At 32K+ context, it can add 4 to 8 GB. That's why you might see "out of memory" errors during long conversations even when the model loaded fine initially.

Your VRAM shopping list

Every GPU and Apple Silicon device in our database, sorted by VRAM. Match your model requirements from the tables above to a device below. For perspective, renting equivalent cloud GPU time costs roughly $0/hour at the cheapest provider.

Device	Type	VRAM	Price
M1 (8GB Unified)	Apple Silicon	8 GB	$799
M2 (8GB Unified)	Apple Silicon	8 GB	$899
M3 (8GB Unified)	Apple Silicon	8 GB	$999
RTX 4060 8GB	Discrete GPU	8 GB	$289
RTX 4060 Laptop (40-60W)	Discrete GPU	8 GB	$0
RTX 4070 Laptop (80-115W)	Discrete GPU	8 GB	$0
RTX 5060 8GB	Discrete GPU	8 GB	$299
RX 7600	Discrete GPU	8 GB	$239
RX 9060 XT 8GB	Discrete GPU	8 GB	$299
RTX 3080 10GB	Discrete GPU	10 GB	$399
RTX 3060 12GB	Discrete GPU	12 GB	$269
RTX 4070 Super	Discrete GPU	12 GB	$599
RTX 4070 Ti 12GB	Discrete GPU	12 GB	$749
RTX 4080 Laptop (120-150W)	Discrete GPU	12 GB	$0
M1 (16GB Unified)	Apple Silicon	16 GB	$999
M1 Pro (16GB Unified)	Apple Silicon	16 GB	$1,299
M2 (16GB Unified)	Apple Silicon	16 GB	$1,099
M2 Pro (16GB Unified)	Apple Silicon	16 GB	$1,599
M3 (16GB Unified)	Apple Silicon	16 GB	$1,099
M4 (16GB Unified)	Apple Silicon	16 GB	$599
RTX 4060 Ti 16GB	Discrete GPU	16 GB	$449
RTX 4070 Ti Super	Discrete GPU	16 GB	$779
RTX 4080 Super	Discrete GPU	16 GB	$979
RTX 4090 Laptop (150-175W)	Discrete GPU	16 GB	$0
RTX 5060 Ti 16GB	Discrete GPU	16 GB	$429
RTX 5080	Discrete GPU	16 GB	$1,099
RX 9060 XT 16GB	Discrete GPU	16 GB	$349
RX 9070	Discrete GPU	16 GB	$549
M3 Pro (18GB Unified)	Apple Silicon	18 GB	$1,799
M4 Pro (24GB Unified)	Apple Silicon	24 GB	$1,999
RTX 3090	Discrete GPU	24 GB	$899
RTX 4090	Discrete GPU	24 GB	$1,799
RX 7900 XTX	Discrete GPU	24 GB	$849
RTX 5090	Discrete GPU	32 GB	$2,199
M4 Max (36GB Unified)	Apple Silicon	36 GB	$2,999
M4 Pro (48GB)	Apple Silicon	48 GB	$2,499
Pro W7900	Discrete GPU	48 GB	$3,299
M4 Max (64GB Unified)	Apple Silicon	64 GB	$3,499
RTX PRO 6000 Blackwell	Discrete GPU	96 GB	$7,500
RTX PRO 6000 Blackwell Max-Q	Discrete GPU	96 GB	$7,000
M4 Max (128GB Unified)	Apple Silicon	128 GB	$4,499
M4 Ultra (192GB)	Apple Silicon	192 GB	$7,999
Grace Blackwell Ultra GB300	Discrete GPU	288 GB	$30,000

How much you actually need

Here's the practical breakdown. I'll be direct.

8 GB: The bare minimum. Runs 7B models at Q4. You'll outgrow it quickly. We don't recommend it for new buyers.
12 to 16 GB: The sweet spot for most users. Runs 7 to 14B models comfortably. Some 34B models at aggressive quantization. This is where we tell most people to start.
24 GB: The enthusiast standard for 34B and below at strong quants. For 70B-class models in our data, 24 GB usually means Q3 and/or partial offload, not full Q4 in VRAM. RTX 4090 and RTX 3090 live here.
32 GB (discrete): The RTX 5090, the largest GeForce VRAM we catalog. Our model data still lists 70B Q4 around 40 GB, so the matrix treats 70B Q4 here as offload-heavy, not fully in VRAM.
36 to 128 GB (Apple unified): M4 Max configs; 64 GB and up is where 70B Q4 gets comfortable with headroom. Check each device and model page. Unified memory is shared with the system.

Common Questions

What is VRAM?

VRAM (Video RAM) is the dedicated memory on a graphics card. For AI, it's where model weights live during inference. Unlike system RAM, VRAM has much higher bandwidth, which is why GPUs are faster than CPUs for AI.

Can I add more VRAM to my GPU?

No. VRAM is soldered to the board. You can't upgrade it. The only way to get more is to buy a different GPU. This is why VRAM capacity is the most important buying decision for AI hardware.

Is Apple Silicon unified memory the same as VRAM?

Functionally, yes for sizing: one pool feeds CPU and GPU. A 64 GB or 128 GB Mac can dedicate tens of GB to a model. That matters when 70B-class Q4 weights want about 40 GB in our database and a 24 GB discrete card is tight.

What happens if my model is bigger than my VRAM?

The model either won't load, or the engine will partially offload layers to system RAM. Offloaded layers run 10 to 50x slower because system RAM bandwidth is much lower. A model doing 40 tok/s in VRAM might drop to 2 to 5 tok/s with partial offloading.

Does VRAM matter for image generation?

Yes. Stable Diffusion XL needs 6 to 8 GB for standard 1024x1024 images. Higher resolutions and larger batch sizes need proportionally more. Video generation models need even more.