How we test: OwnRig's benchmark methodology

Every number on OwnRig is a promise. When we say the RTX 4090 runs Mistral 7B at 90 tok/s, or Qwen 2.5 Coder 32B at 25 tok/s, you should be able to reproduce that on your own hardware. If you can't, we've failed.

This page explains exactly how we arrive at those numbers, what our ratings mean, how we handle quantization, and where our data comes from. If you're a journalist, researcher, or YouTuber looking to cite our data, this is your reference.

How we measure tokens per second

Tokens per second (tok/s) is the speed at which a model generates text during inference. It's the single most important performance metric for local AI, because it determines whether a model feels responsive or feels like waiting for dial-up.

We measure tok/s using llama.cpp as the primary inference engine, running each model with its recommended quantization on each GPU. The measurement captures sustained generation speed (not just the first token) over a standardized 512-token output with a 128-token prompt.

512

Tokens generated per test run

Median of 5 runs reported, not the fastest single pass

Testing conditions

Inference engine: llama.cpp (latest stable release at time of test)
OS: Ubuntu 24.04 LTS for NVIDIA GPUs; macOS for Apple Silicon
Drivers: Latest stable NVIDIA driver; latest macOS release
Temperature: Ambient 22°C; GPU at thermal steady state
Prompt: Standardized 128-token input across all tests
Output: 512 tokens generated per run; median of 5 runs reported
No other GPU workloads during testing (no display output from test GPU where possible)

Performance ratings explained

Raw tok/s numbers need context. Is 15 tok/s good? It depends on what you're doing. For a chatbot, that's comfortable. For real-time code completion, it's sluggish. Our rating system accounts for this.

Rating	What it means	Typical tok/s range
Excellent	Fast, responsive, no compromise. The model runs as well as it can.	40+ tok/s
Good	Comfortable for interactive use. Most users won't notice any lag.	20 to 39 tok/s
Acceptable	Usable but you'll notice the pace. Fine for batch work or patient users.	8 to 19 tok/s
Marginal	It runs. Technically. Expect long waits. Better than nothing; worse than cloud.	1 to 7 tok/s

Quantization and quality tiers

Quantization reduces a model's precision to fit into less VRAM. It's the difference between running a 70B model on a $1,800 GPU and not running it at all. But it comes with tradeoffs.

We test each model at multiple quantization levels and assign quality tiers:

Recommended (Q4_K_M): The sweet spot for most users. Minimal quality loss, significant VRAM savings. This is the quantization we use for our primary compatibility ratings.
High quality (Q6_K or Q8_0): Closer to the original model. Use this if you have VRAM to spare and care about output quality for professional work.
Minimum viable (Q3_K_M or Q2_K): Aggressive compression. Noticeable quality degradation, especially for reasoning and code generation. We include these so you know what's technically possible, but we don't recommend them for primary use.

Where our data comes from

We combine three sources, weighted by reliability:

Our own testing on hardware we have physical access to. This covers the most popular GPU and Apple Silicon configurations.
Community benchmarks from r/LocalLLaMA, Hugging Face leaderboards, and llama.cpp performance threads. We cross-reference at least two independent sources before accepting a community figure.
Manufacturer specifications for hardware data (VRAM capacity, bandwidth, TDP). These come directly from NVIDIA, AMD, and Apple technical documentation.

When our testing and community data disagree, we publish the more conservative figure. We'd rather understate performance than overstate it. A user who gets 40 tok/s when we promised 35 is pleasantly surprised. The opposite destroys trust.

Data freshness and update cadence

Hardware prices change weekly. New models ship monthly. Our data has to keep up.

Prices: Verified against Amazon US (primary) and manufacturer sites monthly. Every price entry carries a price_updated date you can check.
Model compatibility: Re-tested when new quantization methods or inference engine versions ship. Major model releases (Llama, Mistral, Qwen) are tested within 2 weeks of public availability.
New hardware: Added within 4 weeks of retail availability, once we can verify real-world (not pre-release) performance.

We run an automated freshness checker that flags any price older than 30 days. If you see a stale price, it's a bug, not a feature. Our guides carry explicit "Updated" dates in the header.

Known limitations

We believe in publishing what we don't know alongside what we do. Here's where our data has gaps:

AMD GPUs: We don't currently cover AMD Radeon GPUs. ROCm support for llama.cpp is improving, but the ecosystem isn't mature enough for us to publish reliable compatibility data yet. This is on our roadmap.
Multi-GPU setups: Our compatibility matrix assumes single-GPU inference. Multi-GPU configurations (NVLink, PCIe bridging) can run larger models but introduce communication overhead we haven't systematically benchmarked.
Fine-tuning: Our data covers inference only. Fine-tuning VRAM requirements are significantly higher (typically 2 to 4x inference requirements) and depend heavily on training framework and batch size.
Context length: Our standard test uses 128-token prompts with 512-token outputs. Longer context windows (32K+) require additional VRAM for the KV cache that our base figures don't account for.

Common Questions

Can I cite OwnRig data in my article or video?

Yes. Link to the relevant page and mention OwnRig as the source. We appreciate attribution but don't require permission for editorial use.

Why llama.cpp and not vLLM or TensorRT-LLM?

llama.cpp is the most widely used inference engine for local AI and runs on every platform we cover (NVIDIA, Apple Silicon). It's the common denominator. We may add vLLM benchmarks in the future for server-focused use cases.

How do you handle models that need CPU offloading?

We mark these as "requires offloading" in the compatibility matrix and rate them accordingly (usually marginal). CPU offloading works but cuts performance by 10 to 100x depending on the split.

Why don't you test every quantization level?

We focus on Q4_K_M (recommended), Q6_K (high quality), and Q3_K_M (minimum viable) because these cover the practical range. Testing every GGUF variant for every model on every GPU would be thousands of combinations with diminishing informational value.