Here's what nobody tells you when you're shopping for an AI GPU: the spec that determines whether you can run a model isn't clock speed, isn't CUDA cores, isn't TDP. It's VRAM. The amount of dedicated memory on the card. Everything else is noise.
I learned this the hard way. Bought a card with great benchmarks and 8 GB of VRAM. It couldn't run a single 14B model. A cheaper card with 16 GB ran it fine. This guide exists so you don't make the same mistake.
75%
VRAM savings from Q4 quantization
Run models 4x larger than full-precision VRAM requirements
What VRAM actually is
VRAM is your GPU's private memory. High-speed, on-board, and limited. For gaming, it holds textures and frame buffers. For AI, it holds model weights: the billions of numbers that define what a model knows.
These weights must fit entirely in VRAM for the model to run at full speed. Partially? Doesn't count. If even 10% spills to system RAM, performance craters. VRAM is binary: either the model fits, or it doesn't.
The VRAM formula
At full precision (FP16), the math is simple:
Quantization: the VRAM multiplier
Quantization is how you cheat the formula. It reduces weight precision from 16-bit to smaller formats. Less precision means less VRAM. Here's the trade-off at each level:
| Quantization | Bits per weight | VRAM savings | Quality impact |
|---|---|---|---|
| FP16 | 16 | Baseline | Full quality; the reference standard |
| Q8_0 | 8 | ~50% | Nearly identical to FP16 |
| Q6_K | 6 | ~63% | Excellent; subtle differences on edge cases |
| Q5_K_M | 5 | ~69% | Very good; minor impact on complex reasoning |
| Q4_K_M | 4 | ~75% | Good; the sweet spot for most users |
| Q3_K_M | 3 | ~81% | Noticeable degradation on harder tasks |
| Q2_K | 2 | ~87% | Significant quality loss; last resort only |
Q4_K_M is the right answer for most people. I know that's a strong claim. But after testing dozens of models at every quantization level, Q4 consistently delivers output that's indistinguishable from Q8 for chat, coding, and general use. You only notice the difference on hard reasoning benchmarks.
VRAM requirements by model size
Every model in our database with VRAM requirements at recommended quality. Click any model to see all quantization options and compatible hardware.
1 to 3B (tiny): needs 2 to 4 GB VRAM
| Model | Params | Min VRAM | Recommended VRAM |
|---|---|---|---|
| Llama 3.2 1B Instruct | 1.24B | 819 MB | 1.1 GB |
| Whisper Large V3 | 1.55B | 1.3 GB | 1.5 GB |
| Stable Diffusion 3 Medium | 2B | 5 GB | 5 GB |
| Llama 3.2 3B Instruct | 3.21B | 1.7 GB | 2.8 GB |
| Phi-3 Mini 3.8B Instruct | 3.82B | 2.6 GB | 3 GB |
| Phi-4 Mini | 3.82B | 2 GB | 3.3 GB |
7 to 8B (small): needs 4 to 10 GB VRAM
| Model | Params | Min VRAM | Recommended VRAM |
|---|---|---|---|
| Stable Diffusion XL 1.0 | 6.6B | 6.5 GB | 6.5 GB |
| Mistral 7B Instruct v0.3 | 7.24B | 3.6 GB | 5.3 GB |
| DeepSeek R1 Distill Qwen 7B | 7.62B | 4.4 GB | 6.6 GB |
| Qwen 2.5 7B Instruct | 7.62B | 3.9 GB | 5.5 GB |
| Qwen 2.5 Coder 7B Instruct | 7.62B | 4.4 GB | 6.6 GB |
| InternLM 2.5 7B Chat | 7.74B | 4.5 GB | 6.7 GB |
| Llama 3.1 8B Instruct | 8.03B | 4 GB | 6.7 GB |
| Stable Diffusion 3.5 Large | 8.1B | 9 GB | 12.5 GB |
| Gemma 2 9B Instruct | 9.24B | 4.6 GB | 6.6 GB |
12 to 14B (medium): needs 6 to 16 GB VRAM
| Model | Params | Min VRAM | Recommended VRAM |
|---|---|---|---|
| FLUX.1 Dev | 12B | 7.2 GB | 13 GB |
| Gemma 3 12B | 12.2B | 5.7 GB | 10.5 GB |
| LLaVA 1.6 13B | 13B | 6.2 GB | 9.1 GB |
| Phi-3 Medium 14B Instruct | 14B | 6.7 GB | 9.7 GB |
| Phi-4 14B | 14.7B | 6.8 GB | 12.6 GB |
| Qwen 2.5 14B Instruct | 14.77B | 6.9 GB | 12.7 GB |
| StarCoder 2 15B | 15.5B | 7.3 GB | 10.7 GB |
| DeepSeek Coder V2 Lite 16B | 15.7B | 7.4 GB | 10.9 GB |
22 to 34B (large): needs 12 to 24 GB VRAM
| Model | Params | Min VRAM | Recommended VRAM |
|---|---|---|---|
| Codestral 22B | 22.2B | 10.3 GB | 15.1 GB |
| Mistral Small 24B Instruct | 24B | 11.2 GB | 20.5 GB |
| Gemma 2 27B Instruct | 27.23B | 9.8 GB | 18.5 GB |
| Gemma 3 27B | 27.23B | 13.3 GB | 22.3 GB |
| DeepSeek R1 Distill Qwen 32B | 32.5B | 15.5 GB | 28 GB |
| Qwen 2.5 Coder 32B Instruct | 32.5B | 11.6 GB | 21.9 GB |
| QwQ 32B Preview | 32.5B | 11.6 GB | 21.9 GB |
| Code Llama 34B Instruct | 33.7B | 12 GB | 22.7 GB |
| Yi 1.5 34B Chat | 34.4B | 15.8 GB | 29.5 GB |
| Command R 35B | 35B | 16 GB | 30 GB |
70B+ (very large): needs about 40 GB for Q4 (see table) VRAM
| Model | Params | Min VRAM | Recommended VRAM |
|---|---|---|---|
| Llama 3.1 70B Instruct | 70.6B | 24.5 GB | 47 GB |
| Llama 3.3 70B Instruct | 70.6B | 33 GB | 61 GB |
| Qwen 2.5 72B Instruct | 72.7B | 25.3 GB | 40.5 GB |
| DeepSeek V3 | 671B | 115 GB | 360 GB |
Context length: the hidden VRAM cost
The tables above cover model weights. But when you actually use a model, it also needs VRAM for the KV cache: the memory storing your conversation context. Longer conversations eat more VRAM.
At 4K to 8K context (typical for interactive chat), the KV cache adds 0.5 to 2 GB. At 32K+ context, it can add 4 to 8 GB. That's why you might see "out of memory" errors during long conversations even when the model loaded fine initially.
Your VRAM shopping list
Every GPU and Apple Silicon device in our database, sorted by VRAM. Match your model requirements from the tables above to a device below. For perspective, renting equivalent cloud GPU time costs roughly $0/hour at the cheapest provider.
| Device | Type | VRAM | Price |
|---|---|---|---|
| RTX 4060 8GB | Discrete GPU | 8 GB | $289 |
| RTX 3080 10GB | Discrete GPU | 10 GB | $399 |
| RTX 3060 12GB | Discrete GPU | 12 GB | $269 |
| RTX 4070 Super | Discrete GPU | 12 GB | $599 |
| RTX 4070 Ti 12GB | Discrete GPU | 12 GB | $749 |
| RTX 4060 Ti 16GB | Discrete GPU | 16 GB | $449 |
| RTX 4070 Ti Super | Discrete GPU | 16 GB | $779 |
| RTX 4080 Super | Discrete GPU | 16 GB | $979 |
| RTX 5080 | Discrete GPU | 16 GB | $1,099 |
| M3 Pro (18GB Unified) | Apple Silicon | 18 GB | $1,799 |
| M4 Pro (24GB Unified) | Apple Silicon | 24 GB | $1,999 |
| RTX 3090 | Discrete GPU | 24 GB | $899 |
| RTX 4090 | Discrete GPU | 24 GB | $1,799 |
| RTX 5090 | Discrete GPU | 32 GB | $2,199 |
| M4 Max (36GB Unified) | Apple Silicon | 36 GB | $2,999 |
| M4 Pro (48GB) | Apple Silicon | 48 GB | $2,499 |
| M4 Max (64GB Unified) | Apple Silicon | 64 GB | $3,499 |
| M4 Max (128GB Unified) | Apple Silicon | 128 GB | $4,499 |
How much you actually need
Here's the practical breakdown. I'll be direct.
- 8 GB: The bare minimum. Runs 7B models at Q4. You'll outgrow it quickly. We don't recommend it for new buyers.
- 12 to 16 GB: The sweet spot for most users. Runs 7 to 14B models comfortably. Some 34B models at aggressive quantization. This is where we tell most people to start.
- 24 GB: The enthusiast standard for 34B and below at strong quants. For 70B-class models in our data, 24 GB usually means Q3 and/or partial offload, not full Q4 in VRAM. RTX 4090 and RTX 3090 live here.
- 32 GB (discrete): The RTX 5090 — the largest GeForce VRAM we catalog. Our model data still lists 70B Q4 around 40 GB, so the matrix treats 70B Q4 here as offload-heavy, not fully in VRAM.
- 36 to 128 GB (Apple unified): M4 Max configs; 64 GB and up is where 70B Q4 gets comfortable with headroom. Check each device and model page — unified memory is shared with the system.
