I plugged an RTX 4060 Ti 16GB into a three-year-old PC last week and asked it to write Python for me. It did. Not slowly, not badly. Real-time code generation, on my desk, with no API key and no cloud bill. That card costs $449.
The fact that a sub-$500 GPU can do this is why this guide exists. But the GPU market is full of traps: cards with impressive clock speeds and pathetic VRAM, cards with great VRAM but anemic bandwidth, cards that cost twice as much and deliver 10% more. I've tested our full compatibility matrix of 42 models across 18 devices to find the ones actually worth buying.
42
Models tested against every GPU below
Across 18 devices in OwnRig's compatibility matrix
The two specs that actually matter
Ignore the spec sheet. CUDA cores, clock speeds, TDP: none of it matters for AI the way two numbers do.
VRAM: can it run?
VRAM is the GPU's dedicated memory. AI models must fit entirely in VRAM to run at full speed. If a model needs 14 GB and your GPU has 12 GB, you're out of luck. The model won't load, or it'll spill to system RAM and crawl at 10 to 50x slower.
This is a hard gate. Not a soft preference. A $300 GPU with 16 GB VRAM will run more models than a $1,000 GPU with 8 GB. Buy VRAM first, everything else second.
Memory bandwidth: how fast?
Once a model fits in VRAM, bandwidth determines speed. Higher bandwidth means more tokens per second. In our spec data the RTX 4090 lists 1,008 GB/s memory bandwidth versus 288 GB/s on the RTX 4060 Ti 16GB — a large gap — and in practice the 4090 is often much faster on the same model, though exact tok/s also depends on software, batch size, and context length.
But bandwidth is secondary to VRAM. A fast GPU that can't fit the model is useless. A slow GPU that can fit the model still works. Buy capacity first; buy speed if you can afford both.
GPU recommendations by budget
Every discrete GPU in our database, organized by price. The "Models it runs" column shows how many of the 42 models each GPU handles at recommended quality. For context, running these same models on cloud GPUs would cost you roughly $0/hour at the cheapest provider.
Under $300
| GPU | VRAM | Bandwidth | Price | Models it runs |
|---|---|---|---|---|
| RTX 3060 12GB | 12 GB | 360 GB/s | $269 | 22 / 42 |
| RTX 4060 8GB | 8 GB | 272 GB/s | $289 | 42 / 42 |
$300 to $600
| GPU | VRAM | Bandwidth | Price | Models it runs |
|---|---|---|---|---|
| RTX 3080 10GB | 10 GB | 760 GB/s | $399 | 42 / 42 |
| RTX 4060 Ti 16GB | 16 GB | 288 GB/s | $449 | 21 / 42 |
| RTX 4070 Super | 12 GB | 504 GB/s | $599 | 13 / 42 |
$600 to $1,200
| GPU | VRAM | Bandwidth | Price | Models it runs |
|---|---|---|---|---|
| RTX 4070 Ti 12GB | 12 GB | 504 GB/s | $749 | 42 / 42 |
| RTX 4070 Ti Super | 16 GB | 672 GB/s | $779 | 12 / 42 |
| RTX 3090 | 24 GB | 936 GB/s | $899 | 9 / 42 |
| RTX 4080 Super | 16 GB | 736 GB/s | $979 | 12 / 42 |
| RTX 5080 | 16 GB | 960 GB/s | $1,099 | 13 / 42 |
$1,200 to $2,000
| GPU | VRAM | Bandwidth | Price | Models it runs |
|---|---|---|---|---|
| RTX 4090 | 24 GB | 1008 GB/s | $1,799 | 38 / 42 |
$2,000+
| GPU | VRAM | Bandwidth | Price | Models it runs |
|---|---|---|---|---|
| RTX 5090 | 32 GB | 1792 GB/s | $2,199 | 15 / 42 |
The Apple Silicon alternative
Apple Silicon uses unified memory: system RAM and GPU memory are the same pool. An M4 Max with 64 GB unified memory can load 70B-class models with more headroom than a 24 GB discrete GPU, which in our data often relies on Q3 or offload for the same tier. That's not an incremental advantage; it's a fundamentally different capability.
The trade-off is throughput. A top-end Mac generates tokens slower than an RTX 4090. But it can load models the 4090 can't even attempt. If you need to run the biggest models on a single device, Apple Silicon is the only consumer option.
| Device | Unified memory | Bandwidth | Price | Models it runs |
|---|---|---|---|---|
| M3 Pro (18GB Unified) | 18 GB | 150 GB/s | $1,799 | 42 / 42 |
| M4 Pro (24GB Unified) | 24 GB | 273 GB/s | $1,999 | 11 / 42 |
| M4 Pro (48GB) | 48 GB | 273 GB/s | $2,499 | 14 / 42 |
| M4 Max (36GB Unified) | 36 GB | 546 GB/s | $2,999 | 16 / 42 |
| M4 Max (64GB Unified) | 64 GB | 546 GB/s | $3,499 | 17 / 42 |
| M4 Max (128GB Unified) | 128 GB | 546 GB/s | $4,499 | 13 / 42 |
What we don't recommend
Trust is built by telling you what not to buy. Here's what we'd steer you away from:
- Any GPU with 8 GB VRAM or less. In 2026, 8 GB gets you the smallest 7B models at aggressive quantization. That's it. You'll hit the wall within weeks and wish you'd spent more. The $100 you save isn't worth halving your model compatibility.
- AMD GPUs for AI (for now). AMD's ROCm software stack is improving, but it's not there yet. You'll spend more time debugging compatibility than running models. When the software catches up, we'll update this guide. Until then, buy NVIDIA.
- NVIDIA Quadro or A-series for home use. These are enterprise cards with enterprise prices. A consumer RTX card with the same VRAM runs local inference just as fast for a fraction of the cost.
How to decide
Three questions. That's all you need.
- What models do you want to run? For 7B chat models (Llama 3.1 8B, Mistral 7B), 12 to 16 GB VRAM is plenty. For 70B-class reasoning models (Llama 3.3 70B, Qwen 2.5 72B), our model entries list about 40 GB VRAM for Q4; fully in GPU memory on Apple Silicon that means about 48 GB unified (M4 Pro) or 64 GB+ (M4 Max) in our data — GeForce 32 GB still uses offload for that tier. Check the model pages for exact GB.
- What's your budget for the GPU alone? Under $500, the RTX 4060 Ti 16GB is the answer. $500 to $1,200, the RTX 5080. Over $1,200, the RTX 4090 or RTX 5090.
- Do you need a complete system or just a GPU? If you're building from scratch, check our curated builds or use the configurator.
