
24 GB · 1008 GB/s
$1,799
Updated 2026-03-01
The NVIDIA GeForce RTX 4090 with 24 GB GDDR6X VRAM can handle 38 AI models across chat, coding, ai_coding. Best performance: Llama 3.2 1B Instruct at 250 tok/s (excellent). For AI coding workflows, it supports the Power AI Coding tier — runs 32B coding models at good quality. Current price: approximately $1,799.
— OwnRig methodology, data updated 2026-03-01
Runs 32B coding models at good quality. Can handle coding model + embeddings concurrently.
| Model | Quant | Speed | Rating | Notes |
|---|---|---|---|---|
| Llama 3.1 8B Instruct | Q8_0 | 95 tok/s | Excellent | With nomic-embed running concurrently, speed drops ~5% to ~90 tok/s. Total VRAM: 9.3GB, leaving 14.7GB free. |
| Qwen 2.5 Coder 32B Instruct | Q4_K_M | 25 tok/s | Good | Running concurrently with nomic-embed-text (0.5GB): total ~19GB VRAM, leaving 5GB free. Speed impact negligible. |
| Llama 3.1 70B Instruct | Q3_K_M | 5 tok/s | Marginal | Q3 at 31.6GB exceeds 24GB VRAM — requires CPU offloading. Usable but slow. For 70B on a single GPU, need 48GB+ VRAM. |
| DeepSeek Coder V2 Lite 16B | Q5_K_M | 55 tok/s | Excellent | Q5 at 10.9GB leaves 13GB free on 4090. Lightning fast for a coding model. MoE efficiency is impressive. |
| nomic-embed-text v1.5 | FP16 | — | Excellent | Running alongside Qwen 2.5 Coder 32B Q4: total ~19GB, leaving 5GB free on 4090. |
| QwQ 32B Preview | Q4_K_M | 24 tok/s | Good | Q4 at 18.4GB fits on 4090 with 5.6GB headroom. Reasoning responses are long (chain-of-thought) so tok/s matters. Good but not fast. |
| Stable Diffusion XL 1.0 | FP16 | — | Excellent | ~3-5 seconds per 1024x1024 image at 30 steps. Massive headroom for LoRA stacking and higher resolutions. |
| FLUX.1 Dev | FP16 | — | Excellent | Full FP16 FLUX at 23.8GB fits on 4090 with minimal headroom. ~10-15 seconds per image. Best local image gen quality. |
| Whisper Large V3 | FP16 | — | Excellent | At 3.1GB, runs alongside any coding model without meaningful VRAM impact. |
| Mistral 7B Instruct v0.3 | Q8_0 | 90 tok/s | Excellent | Near-instant responses at full Q8 quality. |
| Mixtral 8x7B Instruct | Q3_K_M | 35 tok/s | Good | Q3 at 21GB fits on 24GB VRAM with 3GB headroom. MoE sparsity means speed is good despite quantization level. For Q4 quality, need 36GB+ VRAM (M4 Max). |
| Phi-3 Mini 3.8B Instruct | Q8_0 | 130 tok/s | Excellent | Overkill GPU for this model — but useful as a fast secondary model alongside larger ones. |
| Gemma 2 27B Instruct | Q4_K_M | 22 tok/s | Good | Q4 at 15.5GB fits well on 4090 with 8.5GB headroom. Good quality-to-speed ratio. |
| Codestral 22B | Q5_K_M | 35 tok/s | Excellent | Q5 at 15.1GB on 4090. Fast and high quality. 8.9GB headroom for embeddings. |
| all-MiniLM-L6-v2 | FP16 | — | Excellent | Can run alongside even the largest 32B models with zero meaningful VRAM impact. |
| Gemma 2 9B Instruct | Q8_0 | 80 tok/s | Excellent | Full Q8 quality. Very fast on 4090. |
| Qwen 2.5 7B Instruct | Q8_0 | 88 tok/s | Excellent | Full Q8 quality. 128K context support gives it an edge over Gemma 2 9B. |
| Phi-3 Medium 14B Instruct | Q8_0 | 55 tok/s | Excellent | Full Q8 at 15.2GB. Excellent for structured output and reasoning tasks. |
| StarCoder 2 15B | Q8_0 | 50 tok/s | Excellent | Full Q8 quality at 16.8GB. Fast code completion. |
| Code Llama 34B Instruct | Q4_K_M | 22 tok/s | Good | Q4 at 19GB fits on 4090 with 5GB headroom. Surpassed by Qwen 2.5 Coder 32B but still solid. |
| LLaVA 1.6 13B | Q5_K_M | 30 tok/s | Good | Q5 at 9.1GB. Good for image analysis tasks. Vision encoding adds latency. |
| DeepSeek R1 Distill Qwen 7B | Q4_K_M | 92 tok/s | Excellent | Excellent fit. High bandwidth delivers near-instant responses for reasoning. |
| DeepSeek R1 Distill Qwen 32B | Q4_K_M | 24 tok/s | Good | 32B reasoning model at Q4. Good quality-to-speed ratio on 4090. |
| Llama 3.3 70B Instruct | Q3_K_M | 6 tok/s | Marginal | Q3 exceeds 24GB VRAM — requires CPU offloading. Usable but slow. |
| Gemma 3 12B | Q5_K_M | 75 tok/s | Excellent | Excellent speed at Q5. Fast inference for 12B class. |
| Qwen 2.5 Coder 7B Instruct | Q8_0 | 90 tok/s | Excellent | Full Q8 quality. Near-instant code completion on 4090. |
| Phi-4 14B | Q5_K_M | 58 tok/s | Excellent | Excellent reasoning speed. Q5 quality with plenty of headroom. |
| Mistral Small 24B Instruct | Q5_K_M | 32 tok/s | Good | 24B at Q5 fits on 4090. Good quality general-purpose model. |
| Qwen 2.5 14B Instruct | Q5_K_M | 55 tok/s | Excellent | Excellent speed at Q5. 128K context support. |
| InternLM 2.5 7B Chat | Q8_0 | 88 tok/s | Excellent | Full Q8 quality. Fast inference for 7B class. |
| Stable Diffusion 3 Medium | FP16 | — | Excellent | Full FP16 SD3 Medium. ~8-12 seconds per image. Best quality local image gen. |
| Llama 3.2 3B Instruct | Q8_0 | 170 tok/s | Excellent | 1008 GB/s is overkill for 3B. Essentially instant inference. |
| Llama 3.2 1B Instruct | Q8_0 | 250 tok/s | Excellent | 1B at 250 tok/s. Overkill GPU for this model — useful as fast secondary. |
| Phi-4 Mini | Q8_0 | 160 tok/s | Excellent | 3.8B reasoning model flies on 4090. 19GB headroom. |
| Whisper Large V3 Turbo | FP16 | — | Excellent | ~30x faster than real-time. Practically instant transcription. |
| Stable Diffusion 3.5 Large | FP16 | — | Excellent | 1008 GB/s. ~3.5s per image. Best consumer GPU for SD 3.5 Large. |
| Gemma 3 27B | Q4_K_M | 22 tok/s | Good | Q4_K_M (16.3GB) fits with 7.7GB headroom. 1008 GB/s bandwidth enables good quality and speed. |
| DeepSeek V3 | Q2_K | — | Not Viable | 671B MoE model requires 115GB+ at Q2_K. 24GB insufficient. Would need 128GB+ unified memory. |
Prices and availability vary. Inspect hardware before purchasing.
Generation: Ada Lovelace. Last updated: 2026-03-01.