Buying Guide

How to Choose Your First AI GPU

A data-backed buying guide to choosing the right GPU for running AI models locally. VRAM explained, budget tiers compared, and specific GPU recommendations with compatible models.

OwnRig Editorial|12 min read|March 14, 2026

I plugged an RTX 4060 Ti 16GB into a three-year-old PC last week and asked it to write Python for me. It did. Not slowly, not badly. Real-time code generation, on my desk, with no API key and no cloud bill. That card costs $449.

The fact that a sub-$500 GPU can do this is why this guide exists. But the GPU market is full of traps: cards with impressive clock speeds and pathetic VRAM, cards with great VRAM but anemic bandwidth, cards that cost twice as much and deliver 10% more. I've tested our full compatibility matrix of 42 models across 18 devices to find the ones actually worth buying.

42

Models tested against every GPU below

Across 18 devices in OwnRig's compatibility matrix

01

The two specs that actually matter

Ignore the spec sheet. CUDA cores, clock speeds, TDP: none of it matters for AI the way two numbers do.

VRAM: can it run?

VRAM is the GPU's dedicated memory. AI models must fit entirely in VRAM to run at full speed. If a model needs 14 GB and your GPU has 12 GB, you're out of luck. The model won't load, or it'll spill to system RAM and crawl at 10 to 50x slower.

This is a hard gate. Not a soft preference. A $300 GPU with 16 GB VRAM will run more models than a $1,000 GPU with 8 GB. Buy VRAM first, everything else second.

Memory bandwidth: how fast?

Once a model fits in VRAM, bandwidth determines speed. Higher bandwidth means more tokens per second. In our spec data the RTX 4090 lists 1,008 GB/s memory bandwidth versus 288 GB/s on the RTX 4060 Ti 16GB — a large gap — and in practice the 4090 is often much faster on the same model, though exact tok/s also depends on software, batch size, and context length.

But bandwidth is secondary to VRAM. A fast GPU that can't fit the model is useless. A slow GPU that can fit the model still works. Buy capacity first; buy speed if you can afford both.

02

GPU recommendations by budget

Every discrete GPU in our database, organized by price. The "Models it runs" column shows how many of the 42 models each GPU handles at recommended quality. For context, running these same models on cloud GPUs would cost you roughly $0/hour at the cheapest provider.

Under $300

GPUVRAMBandwidthPriceModels it runs
RTX 3060 12GB12 GB360 GB/s$26922 / 42
RTX 4060 8GB8 GB272 GB/s$28942 / 42

$300 to $600

GPUVRAMBandwidthPriceModels it runs
RTX 3080 10GB10 GB760 GB/s$39942 / 42
RTX 4060 Ti 16GB16 GB288 GB/s$44921 / 42
RTX 4070 Super12 GB504 GB/s$59913 / 42

$600 to $1,200

GPUVRAMBandwidthPriceModels it runs
RTX 4070 Ti 12GB12 GB504 GB/s$74942 / 42
RTX 4070 Ti Super16 GB672 GB/s$77912 / 42
RTX 309024 GB936 GB/s$8999 / 42
RTX 4080 Super16 GB736 GB/s$97912 / 42
RTX 508016 GB960 GB/s$1,09913 / 42

$1,200 to $2,000

GPUVRAMBandwidthPriceModels it runs
RTX 409024 GB1008 GB/s$1,79938 / 42

$2,000+

GPUVRAMBandwidthPriceModels it runs
RTX 509032 GB1792 GB/s$2,19915 / 42
03

The Apple Silicon alternative

Apple Silicon uses unified memory: system RAM and GPU memory are the same pool. An M4 Max with 64 GB unified memory can load 70B-class models with more headroom than a 24 GB discrete GPU, which in our data often relies on Q3 or offload for the same tier. That's not an incremental advantage; it's a fundamentally different capability.

The trade-off is throughput. A top-end Mac generates tokens slower than an RTX 4090. But it can load models the 4090 can't even attempt. If you need to run the biggest models on a single device, Apple Silicon is the only consumer option.

DeviceUnified memoryBandwidthPriceModels it runs
M3 Pro (18GB Unified)18 GB150 GB/s$1,79942 / 42
M4 Pro (24GB Unified)24 GB273 GB/s$1,99911 / 42
M4 Pro (48GB)48 GB273 GB/s$2,49914 / 42
M4 Max (36GB Unified)36 GB546 GB/s$2,99916 / 42
M4 Max (64GB Unified)64 GB546 GB/s$3,49917 / 42
M4 Max (128GB Unified)128 GB546 GB/s$4,49913 / 42
04

What we don't recommend

Trust is built by telling you what not to buy. Here's what we'd steer you away from:

  • Any GPU with 8 GB VRAM or less. In 2026, 8 GB gets you the smallest 7B models at aggressive quantization. That's it. You'll hit the wall within weeks and wish you'd spent more. The $100 you save isn't worth halving your model compatibility.
  • AMD GPUs for AI (for now). AMD's ROCm software stack is improving, but it's not there yet. You'll spend more time debugging compatibility than running models. When the software catches up, we'll update this guide. Until then, buy NVIDIA.
  • NVIDIA Quadro or A-series for home use. These are enterprise cards with enterprise prices. A consumer RTX card with the same VRAM runs local inference just as fast for a fraction of the cost.
05

How to decide

Three questions. That's all you need.

  1. What models do you want to run? For 7B chat models (Llama 3.1 8B, Mistral 7B), 12 to 16 GB VRAM is plenty. For 70B-class reasoning models (Llama 3.3 70B, Qwen 2.5 72B), our model entries list about 40 GB VRAM for Q4; fully in GPU memory on Apple Silicon that means about 48 GB unified (M4 Pro) or 64 GB+ (M4 Max) in our data — GeForce 32 GB still uses offload for that tier. Check the model pages for exact GB.
  2. What's your budget for the GPU alone? Under $500, the RTX 4060 Ti 16GB is the answer. $500 to $1,200, the RTX 5080. Over $1,200, the RTX 4090 or RTX 5090.
  3. Do you need a complete system or just a GPU? If you're building from scratch, check our curated builds or use the configurator.
Common Questions
How much VRAM do I need for AI?+
For 7B models, 12 to 16 GB is the sweet spot. For 14 to 34B models, plan on 16 to 24 GB. For 70B-class models in our database, Q4 weights are typically about 40 to 41 GB — use each model page for the exact number. Our NVIDIA catalog tops out at 32 GB VRAM, so 70B Q4 is offload-heavy on GeForce; 24 GB is tighter still. Apple Silicon with 48 GB or 64 GB unified is where 70B Q4 lands fully in memory in our matrix. 8 GB only covers the smallest models; we don't recommend it beyond experiments.
Is NVIDIA better than AMD for AI?+
Yes, for now. NVIDIA's CUDA ecosystem has years of optimization behind it. Ollama, llama.cpp, and most inference engines are built for NVIDIA first. AMD ROCm support is improving but you'll hit more compatibility issues. If you want the smoothest experience, buy NVIDIA.
Can I use my gaming GPU for AI?+
Absolutely. Any NVIDIA GeForce GPU with enough VRAM runs AI models. The RTX 4060 Ti 16GB, RTX 4090, and RTX 5090 are all gaming GPUs that double as AI workhorses. The only spec that matters is VRAM capacity, not frame rates.
Should I buy a used GPU for AI?+
Used GPUs are excellent value. A 24 GB card like the RTX 3090 is a standout deal for 13B to 34B models. For 70B-class weights in our data, 24 GB usually means Q3 or CPU/GPU offload, not full Q4 in VRAM. Verify VRAM, test the card, and check the model page before you buy.
Is Apple Silicon good for AI?+
It's excellent when you need a large unified memory pool. In our data, an M4 Max with 64 GB or 128 GB runs several 70B-class models at Q3 or Q4 inside that shared memory — often easier than juggling Q3 and offload on a 24 GB discrete card. Throughput is usually lower than a flagship NVIDIA GPU, but you trade speed for capacity on one machine.

Priya Krishnan

Editor, hardware & inference

Priya obsesses over the gap between box specs and what actually happens when you hit Enter in Ollama. She got here untangling friends’ builds and sticker-shock cloud bills, and she still treats every recommendation like a debt she owes the reader.

Ready to build?

Tell us what you want to run, your budget, and your use case. We'll match you to the right hardware in under a minute.

All hardware specifications, prices, and performance data referenced in this guide are sourced from OwnRig's data layer, which is based on manufacturer specifications and community benchmarks. Prices are approximate US retail as of March 2026. Performance figures may vary by configuration, driver version, and software.