
8 GB · 272 GB/s
$289
Updated 2026-03-15
The NVIDIA GeForce RTX 4060 8GB with 8 GB GDDR6 VRAM can handle 42 AI models across embedding, ai_building, coding. Best performance: all-MiniLM-L6-v2 at 8500 tok/s (excellent). Current price: approximately $289.
— OwnRig methodology, data updated 2026-03-15
Insufficient VRAM for most AI coding workflows.
| Model | Quant | Speed | Rating | Notes |
|---|---|---|---|---|
| all-MiniLM-L6-v2 | FP16 | 8500 tok/s | Excellent | Tiny 23M param embedding model. Negligible VRAM (~0.25GB). Can run concurrently with any other model. |
| Codestral 22B | Q3_K_M | — | Not Viable | 22B model. Minimum Q3_K_M needs 10.3GB. 8GB insufficient. |
| Code Llama 34B Instruct | Q2_K | — | Not Viable | 33.7B params. Minimum quantizations still exceed 8GB. |
| Command R 35B | Q2_K | — | Not Viable | 35B class model. Not viable on 8GB. |
| DeepSeek Coder V2 Lite 16B | Q3_K_M | 45 tok/s | Good | MoE architecture makes this 15.7B total model run like a small model. Excellent coding performance for 8GB. |
| DeepSeek R1 Distill Qwen 32B | Q2_K | — | Not Viable | 32B class. Not viable on 8GB. |
| DeepSeek R1 Distill Qwen 7B | Q4_K_M | 32 tok/s | Good | 7.62B reasoning distillate. Q4_K_M fits with ~3GB headroom. 272 GB/s limits speed. |
| DeepSeek V3 | Q2_K | — | Not Viable | 671B parameter model. Cloud/inference API only. |
| FLUX.1 Dev | Q4_K_M | — | Marginal | Q4_K_M fits with ~0.8GB headroom. Slow image generation. FP16 and Q8_0 don't fit. |
| Gemma 2 27B Instruct | Q3_K_M | — | Not Viable | 27B model. Minimum Q3_K_M needs 13.3GB. |
| Gemma 2 9B Instruct | Q4_K_M | 28 tok/s | Good | Q4_K_M 5.6GB fits with ~2.4GB headroom. 9.24B model, 8GB is tight. |
| Gemma 3 4B | Q5_K_M | 55 tok/s | Excellent | 4.3B model. Q5_K_M 3.2GB fits easily. Fast and capable for 8GB. |
| Gemma 3 12B | Q3_K_M | 18 tok/s | Marginal | 12.2B model. Q3_K_M barely fits. Marginal quality but usable. |
| Gemma 3 27B | Q3_K_M | — | Not Viable | 27B model. Not viable on 8GB. |
| InternLM 2.5 7B Chat | Q4_K_M | 30 tok/s | Good | 7.74B model. Q4_K_M fits. Similar to other 7B models on 8GB. |
| Llama 3.1 70B Instruct | Q2_K | — | Not Viable | 70B class. Not viable on 8GB. |
| Llama 3.1 8B Instruct | Q4_K_M | 32 tok/s | Good | Q4_K_M 4.9GB fits with ~3GB headroom. 272 GB/s limits speed. Best 8B for 8GB. |
| Llama 3.2 1B Instruct | Q8_0 | 95 tok/s | Excellent | 1.24B model. Q8_0 1.5GB fits easily. Near-instant for 8GB. |
| Llama 3.2 3B Instruct | Q8_0 | 65 tok/s | Excellent | 3.21B model. Q8_0 3.7GB fits with ~4.3GB headroom. Excellent for 8GB. |
| Llama 3.3 70B Instruct | Q2_K | — | Not Viable | 70B class. Not viable on 8GB. |
| LLaVA 1.6 13B | Q3_K_M | 22 tok/s | Marginal | 13B multimodal. Q3_K_M barely fits with ~1.8GB headroom. Image processing adds overhead. |
| Mistral 7B Instruct v0.3 | Q4_K_M | 31 tok/s | Good | Q4_K_M 4.5GB fits with ~3.5GB headroom. 7.24B model. |
| Mistral Small 24B Instruct | Q3_K_M | — | Not Viable | 24B class. Q3_K_M ~11GB. Not viable. |
| Mixtral 8x7B Instruct | Q4_K_M | — | Not Viable | 46.7B total params. Not viable on 8GB. |
| nomic-embed-text v1.5 | Q8_0 | 4200 tok/s | Excellent | 137M embedding model. Q8_0 0.4GB. Can run concurrently with any LLM. |
| Phi-3 Medium 14B Instruct | Q3_K_M | 20 tok/s | Marginal | 14B model. Q3_K_M barely fits. Quality loss at Q3. |
| Phi-3 Mini 3.8B Instruct | Q5_K_M | 52 tok/s | Excellent | 3.82B model. Q5_K_M 3.0GB fits easily. Excellent for 8GB. |
| Phi-4 14B | Q3_K_M | 19 tok/s | Marginal | 14.66B model. Q3_K_M barely fits. Aggressive quantization. |
| Phi-4 Mini | Q5_K_M | 55 tok/s | Excellent | 3.82B model. Q5_K_M 2.8GB fits easily. Excellent for 8GB. |
| Qwen 2.5 14B Instruct | Q3_K_M | 17 tok/s | Marginal | 14.77B model. Q3_K_M barely fits. Tight. |
| Qwen 2.5 7B Instruct | Q4_K_M | 30 tok/s | Good | 7.62B model. Q4_K_M fits. Similar to Llama 8B on 8GB. |
| Qwen 2.5 72B Instruct | Q2_K | — | Not Viable | 72B class. Not viable on 8GB. |
| Qwen 2.5 Coder 32B Instruct | Q2_K | — | Not Viable | 32B coding model. Not viable on 8GB. |
| Qwen 2.5 Coder 7B Instruct | Q4_K_M | 31 tok/s | Good | 7.62B coding model. Q4_K_M fits. Good for coding on 8GB. |
| QwQ 32B Preview | Q2_K | — | Not Viable | 32B model. Not viable on 8GB. |
| Stable Diffusion 3.5 Large | Q8_0 | — | Not Viable | Q8_0 needs 9GB, FP16 needs 12.5GB. 8GB insufficient. |
| Stable Diffusion XL 1.0 | FP16 | — | Good | ~15-25 seconds per 1024x1024 image at 30 steps. 6.5GB VRAM. Tight but fits. |
| Stable Diffusion 3 Medium | FP16 | — | Good | ~12-18 seconds per image. 5GB VRAM fits with headroom. |
| StarCoder 2 15B | Q3_K_M | 16 tok/s | Marginal | 15.73B coding model. Q3_K_M at limit. Marginal. |
| Whisper Large V3 | Q5_K_M | — | Excellent | 1.55B transcription model. Q5_K_M 1.5GB. Real-time transcription on 8GB. |
| Whisper Large V3 Turbo | FP16 | — | Excellent | 8x faster than full Whisper. 1.6GB VRAM. Real-time on any GPU. |
| Yi 1.5 34B Chat | Q2_K | — | Not Viable | 34B model. Not viable on 8GB. |
Prices and availability vary. Inspect hardware before purchasing.
Generation: Ada Lovelace. Last updated: 2026-03-15.