Qwen · Qwen License
Top-tier 72B model, competitive with Llama 3.1 70B. Excellent at coding and structured output. Requires 48GB+ VRAM for usable quantizations.
Qwen 2.5 72B Instruct (72.7B) requires 40.5 GB VRAM at recommended quality (Q4_K_M). On Apple M4 Max (64GB Unified), expect approximately 6 tok/s at Q3_K_M. For the best experience, High-End Home AI Server ($3,842) is recommended.
— OwnRig methodology, data updated 2026-03-01
| Quality | Quantization | VRAM | File Size |
|---|---|---|---|
| efficient | Q4_K_M | 40.5 GB | 36.4 GB |
| compressed | Q3_K_M | 32.5 GB | 28.4 GB |
| compressed | Q2_K | 25.3 GB | 21.8 GB |
KV cache VRAM at Q4_K_M quality. Longer context = more memory.
| Context | KV Cache | Total VRAM |
|---|---|---|
| 2K | 717 MB | 41.2 GBexceeds 24 GB |
| 4K | 1.3 GB | 41.8 GBexceeds 24 GB |
| 8K | 2.6 GB | 43.1 GBexceeds 24 GB |
| 16K | 5.3 GB | 45.8 GBexceeds 24 GB |
| 32K | 10.6 GB | 51.1 GBexceeds 24 GB |
| 64K | 21.1 GB | 61.6 GBexceeds 24 GB |
| 128K | 42.2 GB | 82.7 GBexceeds 24 GB |
Performance data for Qwen 2.5 72B Instruct across different hardware.
| Device | Quantization | Speed | Rating | Fits in VRAM |
|---|---|---|---|---|
| Apple M4 Max (64GB Unified) | Q3_K_M | 6 tok/s | Acceptable | ✓ |
| Apple M4 Max (128GB Unified) | Q4_K_M | 6 tok/s | Acceptable | ✓ |
| NVIDIA GeForce RTX 4060 8GB | Q2_K | — | Not Viable | ✗ (offload) |
| NVIDIA GeForce RTX 4070 Ti 12GB | Q2_K | — | Not Viable | ✗ (offload) |
| NVIDIA GeForce RTX 3080 10GB | Q2_K | — | Not Viable | ✗ (offload) |
| Apple M3 Pro (18GB Unified) | Q2_K | — | Not Viable | ✗ (offload) |
Complete PC builds that can run Qwen 2.5 72B Instruct.
Data confidence: verified. Last updated: 2026-03-01. Source