Llama · Llama 3.2 Community License
The smallest Llama model. Runs on integrated GPUs and even CPUs. Useful for basic classification, simple Q&A, and as a draft model for speculative decoding. Limited reasoning capability.
Llama 3.2 1B Instruct (1.24B) requires 1.1 GB VRAM at recommended quality (Q6_K). At efficient quality (Q4_K_M), it fits in 819 MB VRAM, making it compatible with the NVIDIA GeForce RTX 3060 12GB. On NVIDIA GeForce RTX 5090, expect approximately 300 tok/s at Q8_0. For the best experience, Starter AI Desktop ($582) is recommended.
— OwnRig methodology, data updated 2026-03-15
| Quality | Quantization | VRAM | File Size |
|---|---|---|---|
| full | Q8_0 | 1.5 GB | 1.2 GB |
| recommended | Q6_K | 1.1 GB | 0.9 GB |
| recommended | Q5_K_M | 1 GB | 0.8 GB |
| efficient | Q4_K_M | 819 MB | 0.7 GB |
KV cache VRAM at Q6_K quality. Longer context = more memory.
| Context | KV Cache | Total VRAM |
|---|---|---|
| 2K | 0 MB | 1.1 GB |
| 4K | 102 MB | 1.2 GB |
| 8K | 102 MB | 1.2 GB |
| 16K | 307 MB | 1.4 GB |
| 32K | 512 MB | 1.6 GB |
| 64K | 1 GB | 2.1 GB |
| 128K | 2 GB | 3.1 GB |
Performance data for Llama 3.2 1B Instruct across different hardware.
| Device | Quantization | Speed | Rating | Fits in VRAM |
|---|---|---|---|---|
| NVIDIA GeForce RTX 3060 12GB | Q8_0 | 140 tok/s | Excellent | ✓ |
| NVIDIA GeForce RTX 4060 Ti 16GB | Q8_0 | 120 tok/s | Excellent | ✓ |
| NVIDIA GeForce RTX 4070 Ti Super | Q8_0 | 190 tok/s | Excellent | ✓ |
| NVIDIA GeForce RTX 4070 Super | Q8_0 | 170 tok/s | Excellent | ✓ |
| NVIDIA GeForce RTX 4080 Super | Q8_0 | 200 tok/s | Excellent | ✓ |
| NVIDIA GeForce RTX 4090 | Q8_0 | 250 tok/s | Excellent | ✓ |
| NVIDIA GeForce RTX 3090 | Q8_0 | 220 tok/s | Excellent | ✓ |
| NVIDIA GeForce RTX 5080 | Q8_0 | 230 tok/s | Excellent | ✓ |
| NVIDIA GeForce RTX 5090 | Q8_0 | 300 tok/s | Excellent | ✓ |
| Apple M4 Pro (24GB Unified) | Q8_0 | 90 tok/s | Excellent | ✓ |
| Apple M4 Pro (48GB) | Q8_0 | 90 tok/s | Excellent | ✓ |
| Apple M4 Max (36GB Unified) | Q8_0 | 150 tok/s | Excellent | ✓ |
| Apple M4 Max (64GB Unified) | Q8_0 | 150 tok/s | Excellent | ✓ |
| Apple M4 Max (128GB Unified) | Q8_0 | 150 tok/s | Excellent | ✓ |
| NVIDIA GeForce RTX 4060 8GB | Q8_0 | 95 tok/s | Excellent | ✓ |
| NVIDIA GeForce RTX 4070 Ti 12GB | Q8_0 | 140 tok/s | Excellent | ✓ |
| NVIDIA GeForce RTX 3080 10GB | Q8_0 | 180 tok/s | Excellent | ✓ |
| Apple M3 Pro (18GB Unified) | Q8_0 | 45 tok/s | Good | ✓ |
Llama 3.2 1B Instruct is commonly used with Ollama, LM Studio. For an AI coding workflow, pair it with an embedding model like nomic-embed-text for local RAG.
Complete PC builds that can run Llama 3.2 1B Instruct.
Data confidence: estimated. Last updated: 2026-03-15. Source