Mixtral · Apache 2.0
Mixture of Experts model — 46.7B total params but only ~12.9B active per token. Excellent quality-to-speed ratio. Despite large total params, inference speed closer to a 13B model. Needs more VRAM for the full weight set though.
Mixtral 8x7B Instruct (46.7B) requires 31.4 GB VRAM at recommended quality (Q5_K_M). On NVIDIA GeForce RTX 4090, expect approximately 35 tok/s at Q3_K_M. For the best experience, High-End Home AI Server ($3,842) is recommended.
— OwnRig methodology, data updated 2026-03-01
| Quality | Quantization | VRAM | File Size |
|---|---|---|---|
| recommended | Q5_K_M | 31.4 GB | 28 GB |
| efficient | Q4_K_M | 26.2 GB | 23.4 GB |
| compressed | Q3_K_M | 21 GB | 18.2 GB |
| compressed | Q2_K | 16.4 GB | 14 GB |
KV cache VRAM at Q5_K_M quality. Longer context = more memory.
| Context | KV Cache | Total VRAM |
|---|---|---|
| 2K | 410 MB | 31.8 GBexceeds 24 GB |
| 4K | 819 MB | 32.2 GBexceeds 24 GB |
| 8K | 1.5 GB | 32.9 GBexceeds 24 GB |
| 16K | 3.1 GB | 34.5 GBexceeds 24 GB |
| 32K | 6.1 GB | 37.5 GBexceeds 24 GB |
Performance data for Mixtral 8x7B Instruct across different hardware.
| Device | Quantization | Speed | Rating | Fits in VRAM |
|---|---|---|---|---|
| NVIDIA GeForce RTX 4090 | Q3_K_M | 35 tok/s | Good | ✓ |
| Apple M4 Max (36GB Unified) | Q4_K_M | 20 tok/s | Good | ✓ |
| Apple M4 Max (64GB Unified) | Q5_K_M | 18 tok/s | Good | ✓ |
| NVIDIA GeForce RTX 4060 8GB | Q4_K_M | — | Not Viable | ✗ (offload) |
| NVIDIA GeForce RTX 4070 Ti 12GB | Q4_K_M | — | Not Viable | ✗ (offload) |
| NVIDIA GeForce RTX 3080 10GB | Q2_K | — | Not Viable | ✗ (offload) |
| Apple M3 Pro (18GB Unified) | Q2_K | — | Not Viable | ✗ (offload) |
Complete PC builds that can run Mixtral 8x7B Instruct.

2x NVIDIA GeForce RTX 3090 (Used) · 128GB DDR5-5600 (4x32GB)

NVIDIA GeForce RTX 4090 · 64GB DDR5-6000 (2x32GB)

2x NVIDIA GeForce RTX 3090 24GB (Used) + NVLink Bridge · 128GB DDR5-5600 (4x32GB)
Data confidence: verified. Last updated: 2026-03-01. Source