Running Gemma 4 locally: which GPU you actually need

Gemma 4 is Google's most capable open model family to date, and if the search traffic on OwnRig is any signal, thousands of people are trying to figure out the same thing right now: which GPU actually runs it?

The honest answer is: it depends which variant you want. The family spans from tiny edge models to a 31B behemoth that will stress most consumer GPUs. Here's exactly what each one needs, with no hedging.

The Gemma 4 family, explained plainly

Google released four Gemma 4 variants in April 2026. They look like a simple lineup, but the naming trips people up. Here's what you're actually choosing between:

Model	Total params	Architecture	Min VRAM	Recommended VRAM
Gemma 4 E2B	5.1B	Dense	4 GB	4.5–5.5 GB
Gemma 4 E4B	8B	Dense	6 GB	7 GB
Gemma 4 26B-A4B	25.2B	MoE (3.8B active)	14 GB	20.5–24 GB
Gemma 4 31B	30.7B	Dense	16 GB	24–28 GB

The confusing one is the 26B-A4B. The "A4B" means only 4 billion parameters are active per token, but it's a Mixture of Experts model: all 26 billion parameters still need to fit in VRAM, because the routing layer picks different experts for each token. Don't let "4B active" fool you into thinking this runs like a 4B model. It doesn't.

Gemma 4 E2B and E4B: the easy ones

If you want to run Gemma 4 and your GPU has 8 GB or more of VRAM, the E2B and E4B variants run without drama on almost any modern hardware. These are dense models in the 5 to 8 billion parameter range.

They're fast. They fit. They're Google's answer to Phi-4 Mini and Gemma 3 4B. If you need a local AI that runs on a laptop GPU or an older gaming card and handles most everyday tasks, start here.

Gemma 4 26B-A4B: where things get interesting

This is the model everyone wants to run, and the one causing the most confusion. Here are the numbers from our database, with honest calls on which hardware actually works:

18 GB

VRAM needed for Gemma 4 26B-A4B at Q4_K_M

The recommended quantization for this model at consumer VRAM levels

Quantization	VRAM needed	Fits in	Quality
Q8_0	28 GB	RTX 5090 (32 GB), M4 Max 36 GB+	Full quality
Q6_K	24 GB	RTX 4090 (24 GB, tight), M4 Pro 24 GB+	Near-full quality
Q5_K_M	20.5 GB	RTX 4090 (with headroom), M4 Pro 24 GB	Excellent
Q4_K_M	18 GB	RTX 4090, M4 Pro 24 GB	Efficient tier in our model data
Q3_K_M	14 GB	RTX 4060 Ti 16 GB	Compressed tier in our model data

The RTX 4060 Ti 16 GB can run the 26B-A4B, but only at Q3_K_M, which is where you start noticing quality gaps on harder reasoning tasks. It's not unusable, but you're getting a compromised experience. The honest recommendation: the 26B-A4B wants more than 20 GB for the recommended tiers in our data. The RTX 4090 is the minimum consumer GPU I'd pair with this model if quality matters.

Gemma 4 31B: the VRAM-hungry one

The 31B is a dense model, not MoE. All 30.7 billion parameters are active for every token. That means it's slower than the 26B-A4B but with more consistent reasoning quality. And it needs real hardware.

Quantization	VRAM needed	Fits in
Q8_0	34 GB	M4 Max 36 GB+, multi-GPU setups
Q6_K	28 GB	M4 Max 36 GB+ (with headroom)
Q5_K_M	24 GB	RTX 4090 (tight), M4 Pro 24 GB
Q4_K_M	21 GB	RTX 4090 (comfortable), M4 Pro 24 GB
Q3_K_M	16 GB	RTX 4060 Ti 16 GB (barely; degraded quality)

Q4_K_M at 21 GB is the target for most people. An RTX 4090 handles it with about 3 GB of raw memory headroom. An M4 Pro 24 GB is also a verified Q4_K_M fit in the compatibility matrix, though at a lower performance tier than the 4090.

GPU compatibility at a glance

Based on our compatibility data, here's the honest per-GPU verdict for the 26B-A4B and 31B variants:

GPU	VRAM	26B-A4B verdict	31B verdict
RTX 4060 8GB	8 GB	No (too little VRAM)	No (too little VRAM)
RTX 4060 Ti 16GB	16 GB	Q3_K_M only (degraded)	Q3_K_M only (degraded)
RTX 4070 Ti 12GB	12 GB	No	No
RTX 4070 Ti Super	16 GB	Q3_K_M (degraded)	Q3_K_M (degraded)
RTX 4090	24 GB	Q4_K_M to Q6_K (good)	Q4_K_M (comfortable)
RTX 5080	16 GB	Q3_K_M (degraded)	Q3_K_M (degraded)
RTX 5090	32 GB	Q6_K to Q8 (excellent)	Q5_K_M to Q6_K (excellent)
M4 (16GB Unified)	16 GB	Q3_K_M (tight)	No
M4 Pro (24GB Unified)	24 GB	Q4_K_M (good)	Q4_K_M (comfortable)
M4 Pro (48GB)	48 GB	Q8 (full quality)	Q6_K to Q8 (excellent)

Apple Silicon vs Nvidia: the honest comparison

Apple Silicon has a structural advantage for Gemma 4: unified memory. A Mac with 24 GB doesn't split that between CPU and GPU; the AI model gets all of it. An M4 Pro 24 GB and an RTX 4090 24 GB land in roughly the same place for Gemma 4 model compatibility.

The speed is different. The RTX 4090 will be faster on tokens-per-second benchmarks for the 26B-A4B due to raw compute throughput. But the M4 Pro delivers perfectly usable performance for interactive chat and coding assistance. Check the M4 Pro 24 GB compatibility page and the RTX 4090 compatibility page for side-by-side numbers.

One more thing: if you're on a Mac with 16 GB, I won't pretend the 26B-A4B is a good experience. Stick with the E4B. It's genuinely impressive for 8 billion parameters, and it runs beautifully on 16 GB.

Common Questions

Can I run Gemma 4 26B-A4B on an RTX 4060 8 GB?

No. The RTX 4060 has 8 GB of VRAM, and Gemma 4 26B-A4B needs about 18 GB at Q4_K_M. Even the most compressed mode in our data, Q3_K_M, still needs about 14 GB. On 8 GB hardware, stick with Gemma 4 E2B (about 4 GB at Q4_K_M) or E4B (about 6 GB at Q4_K_M).

What is the Gemma 4 26B-A4B model?

Gemma 4 26B-A4B is a Mixture of Experts model with 25.2 billion total parameters but only about 3.8 billion active per token. Inference behaves more like a much smaller dense model, but VRAM sizing does not: all 25.2 billion weights still need to be loaded.

Can an RTX 4090 run Gemma 4 31B?

Yes, with the right quantization. At Q4_K_M, Gemma 4 31B needs about 21 GB of VRAM. The RTX 4090 has 24 GB, so it fits with 3 GB headroom for the KV cache. You will not be able to run it at Q6_K or above on the 4090 without partial offloading. Check the compatibility page for exact performance numbers.

Does Gemma 4 run on Apple Silicon?

Yes, and Apple Silicon is actually a strong choice for Gemma 4. Unified memory means a Mac with 24 GB or more treats all of it as GPU-accessible. An M4 Pro with 24 GB handles the 26B-A4B at Q4_K_M comfortably. An M4 Pro with 48 GB runs the 31B at recommended quality without sweating.

What is the best quantization for Gemma 4 quality?

Q6_K delivers near-FP16 quality and is our recommended target if your hardware supports it. Q4_K_M is the sweet spot for most users: roughly 75% VRAM savings versus FP16 with minimal quality loss on chat and coding tasks. Only drop to Q3_K_M if you have no other option; reasoning quality degrades noticeably.