What are diffusion language models?

Most local AI guides assume one story: load a GGUF, pick a quantization, watch tokens stream left to right. That is autoregressive decoding. It is what Llama, Qwen, and Mistral do in Ollama today.

Diffusion language models break that habit. They can still chat. They can still code. But under the hood they sometimes rewrite whole chunks of text in parallel instead of marching one token at a time.

Autoregressive vs diffusion

Autoregressive (AR): each new token conditions on everything before it. Simple mental model, predictable VRAM (weights plus KV cache), mature tooling.

Diffusion LM: the model iterates on a block of tokens, refining noise into readable text. NVIDIA's Nemotron-Labs family also advertises self-speculation, switching between AR and diffusion-style steps based on attention patterns.

That flexibility is the pitch. It is also the support problem. Your inference engine has to implement those modes, not just load weights.

What OwnRig tracks today

We added Nemotron-Labs Diffusion 8B to the catalog with architecture type diffusion_lm. Official BF16 weights are about 16.98 GB on disk; we estimate 19 GB total VRAM at practical context. Those numbers come from NVIDIA's published artifact size, not from our RTX 4090 benchmark lab.

We deliberately omit GPU compatibility rows. Ada Gate A failed the consumer runtime bar: no official GGUF, no stock Ollama path, SGLang DLM support still landing via pull requests. Listing tok/s would be theater.

How to read vendor speed claims

NVIDIA's launch materials cite multipliers versus autoregressive baselines on datacenter hardware with custom kernels. Impressive slides. Not a shopping list for a $299 GPU.

OwnRig policy: we publish speeds only when a typical builder can reproduce the command on hardware we track. Until then, editorial context only.

What to run instead right now

If you want a coding model on a 16GB card today, Qwen3.6-35B-A3B at Q3_K_M is the honest OwnRig story: MoE, Apache 2.0, community GGUF, Ollama-ready. Diffusion LMs are the next chapter, not the current homework.

Common Questions

Is a diffusion language model the same as Stable Diffusion?

No. Image diffusion models denoise pixels. Diffusion language models denoise token blocks in text space. Same broad idea (iterative refinement), different domain. Your SD workflow does not automatically run a diffusion LM.

Will Ollama run Nemotron-Labs Diffusion?

Not today in our verification. NVIDIA ships Safetensors with custom architecture code. Stock Ollama and llama.cpp paths expect autoregressive GGUF weights. OwnRig lists Nemotron as catalog-only until a reproducible consumer runtime lands.

Should I buy hardware for diffusion LMs right now?

Buy for what you run this month. If that is Qwen, Llama, or Mistral through Ollama, an RTX 4060 Ti 16GB or RTX 4090 still matches reality. Diffusion LMs are worth watching, not worth restructuring a build around until tooling catches up.

What decode mode matters for hardware planning?

VRAM tracks weight format first. Throughput tracks decode mode second. A model that looks like an 8B dense checkpoint on paper can behave differently in diffusion or self-speculation mode. Always read which mode a benchmark used before you compare tok/s numbers.