Llama · Llama 3.2 Community License
Ultra-lightweight model that runs on virtually any GPU. Surprisingly capable for its size — good at summarization, simple coding tasks, and quick chat. The default choice when speed matters more than depth.
Llama 3.2 3B Instruct (3.21B) requires 2.8 GB VRAM at recommended quality (Q6_K). At efficient quality (Q4_K_M), it fits in 2.1 GB VRAM, making it compatible with the NVIDIA GeForce RTX 3060 12GB. On NVIDIA GeForce RTX 5090, expect approximately 200 tok/s at Q8_0. For the best experience, Starter AI Desktop ($582) is recommended.
— OwnRig methodology, data updated 2026-03-15
| Quality | Quantization | VRAM | File Size |
|---|---|---|---|
| full | Q8_0 | 3.7 GB | 3.2 GB |
| recommended | Q6_K | 2.8 GB | 2.4 GB |
| recommended | Q5_K_M | 2.5 GB | 2.1 GB |
| efficient | Q4_K_M | 2.1 GB | 1.8 GB |
| compressed | Q3_K_M | 1.7 GB | 1.4 GB |
KV cache VRAM at Q6_K quality. Longer context = more memory.
| Context | KV Cache | Total VRAM |
|---|---|---|
| 2K | 102 MB | 2.9 GB |
| 4K | 102 MB | 2.9 GB |
| 8K | 307 MB | 3.1 GB |
| 16K | 512 MB | 3.3 GB |
| 32K | 1 GB | 3.8 GB |
| 64K | 2 GB | 4.8 GB |
| 128K | 4.1 GB | 6.9 GB |
Performance data for Llama 3.2 3B Instruct across different hardware.
| Device | Quantization | Speed | Rating | Fits in VRAM |
|---|---|---|---|---|
| NVIDIA GeForce RTX 3060 12GB | Q8_0 | 90 tok/s | Excellent | ✓ |
| NVIDIA GeForce RTX 4060 Ti 16GB | Q8_0 | 75 tok/s | Excellent | ✓ |
| NVIDIA GeForce RTX 4070 Ti Super | Q8_0 | 130 tok/s | Excellent | ✓ |
| NVIDIA GeForce RTX 4070 Super | Q8_0 | 110 tok/s | Excellent | ✓ |
| NVIDIA GeForce RTX 4080 Super | Q8_0 | 140 tok/s | Excellent | ✓ |
| NVIDIA GeForce RTX 4090 | Q8_0 | 170 tok/s | Excellent | ✓ |
| NVIDIA GeForce RTX 3090 | Q8_0 | 150 tok/s | Excellent | ✓ |
| NVIDIA GeForce RTX 5080 | Q8_0 | 160 tok/s | Excellent | ✓ |
| NVIDIA GeForce RTX 5090 | Q8_0 | 200 tok/s | Excellent | ✓ |
| Apple M4 Pro (24GB Unified) | Q8_0 | 60 tok/s | Excellent | ✓ |
| Apple M4 Pro (48GB) | Q8_0 | 60 tok/s | Excellent | ✓ |
| Apple M4 Max (36GB Unified) | Q8_0 | 100 tok/s | Excellent | ✓ |
| Apple M4 Max (64GB Unified) | Q8_0 | 100 tok/s | Excellent | ✓ |
| Apple M4 Max (128GB Unified) | Q8_0 | 100 tok/s | Excellent | ✓ |
| NVIDIA GeForce RTX 4060 8GB | Q8_0 | 65 tok/s | Excellent | ✓ |
| NVIDIA GeForce RTX 4070 Ti 12GB | Q8_0 | 95 tok/s | Excellent | ✓ |
| NVIDIA GeForce RTX 3080 10GB | Q8_0 | 140 tok/s | Excellent | ✓ |
| Apple M3 Pro (18GB Unified) | Q8_0 | 35 tok/s | Good | ✓ |
Llama 3.2 3B Instruct is commonly used with Cursor, Continue, Ollama, LM Studio, Open WebUI. For an AI coding workflow, pair it with an embedding model like nomic-embed-text for local RAG.
Complete PC builds that can run Llama 3.2 3B Instruct.
Data confidence: estimated. Last updated: 2026-03-15. Source