Every number on OwnRig is a promise. When we say the RTX 4090 runs Mistral 7B at 90 tok/s, or Qwen 2.5 Coder 32B at 25 tok/s, you should be able to reproduce that on your own hardware. If you can't, we've failed.
This page explains exactly how we arrive at those numbers, what our ratings mean, how we handle quantization, and where our data comes from. If you're a journalist, researcher, or YouTuber looking to cite our data, this is your reference.
How we measure tokens per second
Tokens per second (tok/s) is the speed at which a model generates text during inference. It's the single most important performance metric for local AI, because it determines whether a model feels responsive or feels like waiting for dial-up.
We measure tok/s using llama.cpp as the primary inference engine, running each model with its recommended quantization on each GPU. The measurement captures sustained generation speed (not just the first token) over a standardized 512-token output with a 128-token prompt.
512
Tokens generated per test run
Median of 5 runs reported, not the fastest single pass
Testing conditions
- Inference engine: llama.cpp (latest stable release at time of test)
- OS: Ubuntu 24.04 LTS for NVIDIA GPUs; macOS for Apple Silicon
- Drivers: Latest stable NVIDIA driver; latest macOS release
- Temperature: Ambient 22°C; GPU at thermal steady state
- Prompt: Standardized 128-token input across all tests
- Output: 512 tokens generated per run; median of 5 runs reported
- No other GPU workloads during testing (no display output from test GPU where possible)
Performance ratings explained
Raw tok/s numbers need context. Is 15 tok/s good? It depends on what you're doing. For a chatbot, that's comfortable. For real-time code completion, it's sluggish. Our rating system accounts for this.
| Rating | What it means | Typical tok/s range |
|---|---|---|
| Excellent | Fast, responsive, no compromise. The model runs as well as it can. | 40+ tok/s |
| Good | Comfortable for interactive use. Most users won't notice any lag. | 20 to 39 tok/s |
| Acceptable | Usable but you'll notice the pace. Fine for batch work or patient users. | 8 to 19 tok/s |
| Marginal | It runs. Technically. Expect long waits. Better than nothing; worse than cloud. | 1 to 7 tok/s |
Quantization and quality tiers
Quantization reduces a model's precision to fit into less VRAM. It's the difference between running a 70B model on a $1,800 GPU and not running it at all. But it comes with tradeoffs.
We test each model at multiple quantization levels and assign quality tiers:
- Recommended (Q4_K_M): The sweet spot for most users. Minimal quality loss, significant VRAM savings. This is the quantization we use for our primary compatibility ratings.
- High quality (Q6_K or Q8_0): Closer to the original model. Use this if you have VRAM to spare and care about output quality for professional work.
- Minimum viable (Q3_K_M or Q2_K): Aggressive compression. Noticeable quality degradation, especially for reasoning and code generation. We include these so you know what's technically possible, but we don't recommend them for primary use.
Where our data comes from
We combine three sources, weighted by reliability:
- Our own testing on hardware we have physical access to. This covers the most popular GPU and Apple Silicon configurations.
- Community benchmarks from r/LocalLLaMA, Hugging Face leaderboards, and llama.cpp performance threads. We cross-reference at least two independent sources before accepting a community figure.
- Manufacturer specifications for hardware data (VRAM capacity, bandwidth, TDP). These come directly from NVIDIA, AMD, and Apple technical documentation.
When our testing and community data disagree, we publish the more conservative figure. We'd rather understate performance than overstate it. A user who gets 40 tok/s when we promised 35 is pleasantly surprised. The opposite destroys trust.
Data freshness and update cadence
Hardware prices change weekly. New models ship monthly. Our data has to keep up.
- Prices: Verified against Amazon US (primary) and manufacturer sites monthly. Every price entry carries a
price_updateddate you can check. - Model compatibility: Re-tested when new quantization methods or inference engine versions ship. Major model releases (Llama, Mistral, Qwen) are tested within 2 weeks of public availability.
- New hardware: Added within 4 weeks of retail availability, once we can verify real-world (not pre-release) performance.
We run an automated freshness checker that flags any price older than 30 days. If you see a stale price, it's a bug, not a feature. Our guides carry explicit "Updated" dates in the header.
Known limitations
We believe in publishing what we don't know alongside what we do. Here's where our data has gaps:
- AMD GPUs: We don't currently cover AMD Radeon GPUs. ROCm support for llama.cpp is improving, but the ecosystem isn't mature enough for us to publish reliable compatibility data yet. This is on our roadmap.
- Multi-GPU setups: Our compatibility matrix assumes single-GPU inference. Multi-GPU configurations (NVLink, PCIe bridging) can run larger models but introduce communication overhead we haven't systematically benchmarked.
- Fine-tuning: Our data covers inference only. Fine-tuning VRAM requirements are significantly higher (typically 2 to 4x inference requirements) and depend heavily on training framework and batch size.
- Context length: Our standard test uses 128-token prompts with 512-token outputs. Longer context windows (32K+) require additional VRAM for the KV cache that our base figures don't account for.
