Llama · Llama 3.1 Community License
Best-in-class 8B model. Strong general capabilities with excellent coding support. The go-to small model for local inference — fast, accurate, and well-supported across all inference engines.
Llama 3.1 8B Instruct (8.03B) requires 6.7 GB VRAM at recommended quality (Q6_K). At efficient quality (Q4_K_M), it fits in 4.9 GB VRAM, making it compatible with the NVIDIA GeForce RTX 3060 12GB. On NVIDIA GeForce RTX 5090, expect approximately 170 tok/s at Q8_0. For the best experience, Starter AI Desktop ($582) is recommended.
— OwnRig methodology, data updated 2026-03-01
| Quality | Quantization | VRAM | File Size |
|---|---|---|---|
| full | Q8_0 | 8.9 GB | 8 GB |
| recommended | Q6_K | 6.7 GB | 5.8 GB |
| recommended | Q5_K_M | 5.8 GB | 4.8 GB |
| efficient | Q4_K_M | 4.9 GB | 4 GB |
| compressed | Q3_K_M | 4 GB | 3.1 GB |
KV cache VRAM at Q6_K quality. Longer context = more memory.
| Context | KV Cache | Total VRAM |
|---|---|---|
| 2K | 102 MB | 6.8 GB |
| 4K | 307 MB | 7 GB |
| 8K | 512 MB | 7.2 GB |
| 16K | 1 GB | 7.7 GB |
| 32K | 2 GB | 8.7 GB |
| 64K | 4.1 GB | 10.8 GB |
| 128K | 8.2 GB | 14.9 GB |
Performance data for Llama 3.1 8B Instruct across different hardware.
| Device | Quantization | Speed | Rating | Fits in VRAM |
|---|---|---|---|---|
| NVIDIA GeForce RTX 3060 12GB | Q5_K_M | 35 tok/s | Good | ✓ |
| NVIDIA GeForce RTX 4060 Ti 16GB | Q8_0 | 55 tok/s | Excellent | ✓ |
| NVIDIA GeForce RTX 4070 Ti Super | Q8_0 | 75 tok/s | Excellent | ✓ |
| NVIDIA GeForce RTX 4090 | Q8_0 | 95 tok/s | Excellent | ✓ |
| NVIDIA GeForce RTX 3090 | Q8_0 | 70 tok/s | Excellent | ✓ |
| Apple M4 Pro (24GB Unified) | Q8_0 | 32 tok/s | Good | ✓ |
| Apple M4 Max (36GB Unified) | Q8_0 | 55 tok/s | Excellent | ✓ |
| Apple M4 Max (64GB Unified) | Q8_0 | 55 tok/s | Excellent | ✓ |
| NVIDIA GeForce RTX 5090 | Q8_0 | 170 tok/s | Excellent | ✓ |
| NVIDIA GeForce RTX 5080 | Q8_0 | 92 tok/s | Excellent | ✓ |
| NVIDIA GeForce RTX 4070 Super | Q5_K_M | 55 tok/s | Excellent | ✓ |
| Apple M4 Pro (48GB) | Q8_0 | 32 tok/s | Good | ✓ |
| NVIDIA GeForce RTX 4080 Super | Q8_0 | 82 tok/s | Excellent | ✓ |
| NVIDIA GeForce RTX 4060 8GB | Q4_K_M | 32 tok/s | Good | ✓ |
| NVIDIA GeForce RTX 4070 Ti 12GB | Q5_K_M | 52 tok/s | Excellent | ✓ |
| NVIDIA GeForce RTX 3080 10GB | Q5_K_M | 50 tok/s | Excellent | ✓ |
| Apple M3 Pro (18GB Unified) | Q4_K_M | 15 tok/s | Acceptable | ✓ |
Llama 3.1 8B Instruct is commonly used with Cursor, Continue, Aider, Open WebUI, LM Studio. For an AI coding workflow, pair it with an embedding model like nomic-embed-text for local RAG.
Complete PC builds that can run Llama 3.1 8B Instruct.

NVIDIA GeForce RTX 4090 · 64GB DDR5-5600 (2x32GB)
AMD Radeon RX 7900 XTX 24GB · 32GB DDR5-5600 (2x16GB)

NVIDIA GeForce RTX 3060 12GB · 32GB DDR4-3200 (2x16GB)

NVIDIA GeForce RTX 4060 Ti 16GB · 32GB DDR5-5200 (2x16GB)

NVIDIA GeForce RTX 4070 Super 12GB · 32GB DDR5-5600 (2x16GB)

NVIDIA GeForce RTX 4090 · 64GB DDR5-6000 (2x32GB)

2x NVIDIA GeForce RTX 3090 24GB (Used) + NVLink Bridge · 128GB DDR5-5600 (4x32GB)

Apple M4 Max 128GB (Mac Studio)

NVIDIA GeForce RTX 4060 Ti 16GB · 32GB DDR5-5600 (2x16GB)

NVIDIA GeForce RTX 3090 24GB (Used) · 64GB DDR5-5600 (2x32GB)

NVIDIA GeForce RTX 5090 32GB · 64GB DDR5-6000 (2x32GB)

NVIDIA GeForce RTX 4060 Ti 16GB · 32GB DDR5-5600 (2x16GB)

NVIDIA GeForce RTX 3060 12GB · 16GB DDR4-3200 (2x8GB)
Data confidence: verified. Last updated: 2026-03-01. Source