How much VRAM does NVIDIA Nemotron-3-super-120B-A12B need?

NVIDIA Nemotron-3-super-120B-A12B requires 70 GB VRAM at recommended quality (Q4_K_M). At lower quality settings, it can fit in as little as 40 GB.

What is the best GPU for NVIDIA Nemotron-3-super-120B-A12B?

The NVIDIA Grace Blackwell Ultra GB300 delivers the best performance for NVIDIA Nemotron-3-super-120B-A12B, achieving 180 tok/s at Q4_K_M with an excellent rating.

Can I run NVIDIA Nemotron-3-super-120B-A12B on an RTX 4060 Ti?

NVIDIA Nemotron-3-super-120B-A12B at Q2_K requires 70 GB VRAM, which exceeds the RTX 4060 Ti's 16 GB. Consider a lower quantization or a GPU with more VRAM.

What quantization should I use for NVIDIA Nemotron-3-super-120B-A12B?

For the best quality, use Q4_K_M (70 GB VRAM). If your GPU has limited VRAM, Q2_K (40 GB) is the most efficient option with acceptable quality.

Is NVIDIA Nemotron-3-super-120B-A12B good for coding?

Yes. NVIDIA Nemotron-3-super-120B-A12B is used with Continue, LM Studio, Open WebUI for local AI coding. For the best coding experience, pair it with an embedding model for local RAG.

ChatCodingReasoningMulti-purpose120B

Chat

NVIDIA Nemotron-3-super-120B-A12B

Nemotron · NVIDIA Open Model License

Mixture of Experts: 120B total parameters, 12B active per token.

MoE architecture with 120B total parameters and roughly 12B active per token. Requires VRAM for the full expert pool but decodes more like a smaller model once loaded. Native 131K context with 1M-token extension support.

Parameters: 120B
Architecture: MoE (12B active)
Context: 1,048,576 tokens
Released: 2025-12-15
Engines: llama.cpp, vLLM
Builder Tools: Continue, LM Studio, Open WebUI

Parameters

120B

VRAM

70 GB

Context

1024K

Formats

GPUs

NVIDIA Nemotron-3-super-120B-A12B (120B) requires 70 GB VRAM at recommended quality (Q4_K_M). On NVIDIA Grace Blackwell Ultra GB300, expect approximately 180 tok/s at Q4_K_M. For the best experience, Mac Studio AI Builder ($3,999) is recommended.

Source: OwnRig methodology

VRAM (Recommended)

70 GB

Quantization

Q4_K_M

File Size

67 GB

Max Context

1024K tokens

Primary Use

Chat

Memory

VRAM Requirements

Quality	Quantization	VRAM	File Size
recommended	Q4_K_M	70 GB	67 GB
efficient	Q3_K_M	50 GB	48 GB
compressed	Q2_K	40 GB	38 GB

Scaling

Context Length Impact

KV cache VRAM at Q4_K_M quality. Longer context = more memory.

Context	KV Cache	Total VRAM
2K	512 MB	70.5 GBexceeds 24 GB
4K	1 GB	71 GBexceeds 24 GB
8K	2 GB	72 GBexceeds 24 GB
16K	4 GB	74 GBexceeds 24 GB
32K	8 GB	78 GBexceeds 24 GB
64K	16 GB	86 GBexceeds 24 GB
128K	32 GB	102 GBexceeds 24 GB

Compatible GPUs

43 devices


NVIDIA Grace Blackwell Ultra GB300	Q4_K_M	180 tok/s	Excellent
Apple M4 Max (128GB Unified)	Q4_K_M	39 tok/s	Excellent
Apple M4 Ultra (192GB)	Q4_K_M	51 tok/s	Excellent
NVIDIA RTX PRO 6000 Blackwell	Q4_K_M	158 tok/s	Excellent
NVIDIA RTX PRO 6000 Blackwell Max-Q	Q4_K_M	145 tok/s	Excellent
Apple M4 Max (64GB Unified)	Q3_K_M	41 tok/s	Good
Apple M4 Pro (48GB)	Q2_K	50 tok/s	Good
AMD Radeon Pro W7900	Q2_K	54 tok/s	Good
Apple M4 Max (36GB Unified)	Q2_K	9 tok/s	Marginal
Apple M4 Pro (24GB Unified)	Q2_K	8 tok/s	Marginal
NVIDIA GeForce RTX 3090	Q2_K	11 tok/s	Marginal
NVIDIA GeForce RTX 4090	Q2_K	18 tok/s	Marginal
NVIDIA GeForce RTX 5090	Q2_K	23 tok/s	Marginal
Apple M3 Pro (18GB Unified)	Q2_K	–	Not viable
Apple M4 (16GB Unified)	Q2_K	–	Not viable
NVIDIA GeForce RTX 3060 12GB	Q2_K	–	Not viable
NVIDIA GeForce RTX 3080 10GB	Q2_K	–	Not viable
NVIDIA GeForce RTX 4060 8GB	Q2_K	–	Not viable
NVIDIA RTX 4060 Laptop (40-60W)	Q2_K	–	Not viable
NVIDIA GeForce RTX 4060 Ti 16GB	Q2_K	–	Not viable
NVIDIA RTX 4070 Laptop (80-115W)	Q2_K	–	Not viable
NVIDIA GeForce RTX 4070 Super	Q2_K	–	Not viable
NVIDIA GeForce RTX 4070 Ti 12GB	Q2_K	–	Not viable
NVIDIA GeForce RTX 4070 Ti Super	Q2_K	–	Not viable
NVIDIA RTX 4080 Laptop (120-150W)	Q2_K	–	Not viable
NVIDIA GeForce RTX 4080 Super	Q2_K	–	Not viable
NVIDIA RTX 4090 Laptop (150-175W)	Q2_K	–	Not viable
NVIDIA GeForce RTX 5080	Q2_K	–	Not viable
AMD Radeon RX 7600	Q2_K	–	Not viable
AMD Radeon RX 7900 XTX	Q2_K	–	Not viable
AMD Radeon RX 9070	Q2_K	–	Not viable
Apple M1 (8GB Unified)	Q2_K	–	Not viable
Apple M1 (16GB Unified)	Q2_K	–	Not viable
Apple M1 Pro (16GB Unified)	Q2_K	–	Not viable
Apple M2 (8GB Unified)	Q2_K	–	Not viable
Apple M2 (16GB Unified)	Q2_K	–	Not viable
Apple M2 Pro (16GB Unified)	Q2_K	–	Not viable
Apple M3 (8GB Unified)	Q2_K	–	Not viable
Apple M3 (16GB Unified)	Q2_K	–	Not viable
AMD Radeon RX 9060 XT 16GB	Q2_K	–	Not viable
AMD Radeon RX 9060 XT 8GB	Q2_K	–	Not viable
NVIDIA GeForce RTX 5060 8GB	Q2_K	–	Not viable
NVIDIA GeForce RTX 5060 Ti 16GB	Q2_K	–	Not viable

Showing 43 of 43 entries

Builder Context

NVIDIA Nemotron-3-super-120B-A12B is commonly used with Continue, LM Studio, Open WebUI. For an AI coding workflow, pair it with an embedding model like nomic-embed-text for local RAG.

FAQ

Frequently Asked Questions

How much VRAM does NVIDIA Nemotron-3-super-120B-A12B need?: NVIDIA Nemotron-3-super-120B-A12B requires 70 GB VRAM at recommended quality (Q4_K_M). At lower quality settings, it can fit in as little as 40 GB.
What is the best GPU for NVIDIA Nemotron-3-super-120B-A12B?: The NVIDIA Grace Blackwell Ultra GB300 delivers the best performance for NVIDIA Nemotron-3-super-120B-A12B, achieving 180 tok/s at Q4_K_M with an excellent rating.
Can I run NVIDIA Nemotron-3-super-120B-A12B on an RTX 4060 Ti?: NVIDIA Nemotron-3-super-120B-A12B at Q2_K requires 70 GB VRAM, which exceeds the RTX 4060 Ti's 16 GB. Consider a lower quantization or a GPU with more VRAM.
What quantization should I use for NVIDIA Nemotron-3-super-120B-A12B?: For the best quality, use Q4_K_M (70 GB VRAM). If your GPU has limited VRAM, Q2_K (40 GB) is the most efficient option with acceptable quality.
Is NVIDIA Nemotron-3-super-120B-A12B good for coding?: Yes. NVIDIA Nemotron-3-super-120B-A12B is used with Continue, LM Studio, Open WebUI for local AI coding. For the best coding experience, pair it with an embedding model for local RAG.

All models

Data confidence: estimated. Source

VRAM requirements are calculated from model parameters and may vary by inference engine, context length, and batch size. Performance estimates are based on community benchmarks and should be verified for your specific configuration.Nemotron is a trademark of its respective owner. OwnRig is not affiliated with or endorsed by the model creator.