Tutorial

The Complete Guide to Running LLMs Locally

Run large language models locally: hardware needs, Ollama and llama.cpp, model picks by use case, and quantization.

OwnRig Editorial|15 min read|March 14, 2026

I ran my first local LLM on a Tuesday afternoon. Installed Ollama, typed one command, waited 90 seconds for the model to download, and started chatting with an AI that lived entirely on my machine. No API key. No cloud account. No bill.

That was two months ago. Since then, I've run 42 different models on 18 devices for OwnRig's compatibility database. This guide is everything I've learned, compressed into the four steps between you and a working local AI.

4

Steps from zero to your first local LLM

Hardware check, model choice, engine install, generate

01

What you need

Local AI has three hard requirements. Miss any one and you're stuck.

  1. A GPU with enough VRAM. This is the wall most people hit. Your GPU's VRAM (its dedicated memory) must hold the entire model. Our VRAM guide has the full breakdown; our GPU buying guide has specific recommendations.
  2. System RAM: 16 GB minimum, 32 GB recommended. Some models partially offload to system RAM. More headroom means you can run other applications alongside AI without slowdowns.
  3. An SSD with free space. Model files range from 2 GB (small 3B models) to 40+ GB (large 70B models). An NVMe SSD keeps model loading fast.
02

Choose your model

What you want the AI to do determines which model you need. Here are our picks by use case, sorted smallest to largest. Smaller models run on cheaper hardware; larger models produce better output. Pick the biggest one your GPU can fit.

Chat and general assistant

ModelParametersVRAM (Q4)Best for
Llama 3.2 1B Instruct1.24B819 MBQuick responses, lower hardware
Llama 3.2 3B Instruct3.21B2.1 GBQuick responses, lower hardware
Phi-3 Mini 3.8B Instruct3.82B2.6 GBQuick responses, lower hardware
Phi-4 Mini3.82B2.4 GBQuick responses, lower hardware
Gemma 3 4B4.3B2.5 GBQuick responses, lower hardware
Mistral 7B Instruct v0.37.24B4.5 GBQuick responses, lower hardware

Coding and development

If you're running a local coding assistant (as a Copilot replacement or IDE integration), these are the models to use. I'd recommend Qwen 2.5 Coder if your GPU can fit it; it's the best open coding model I've tested.

ModelParametersVRAM (Q4)Specialty
Llama 3.2 1B Instruct1.24B819 MBThe smallest Llama model. Runs on integrated GPUs and even C
Llama 3.2 3B Instruct3.21B2.1 GBUltra-lightweight model that runs on virtually any GPU. Surp
Phi-3 Mini 3.8B Instruct3.82B2.6 GBPunches above its weight — a 3.8B model that rivals many 7B
Phi-4 Mini3.82B2.4 GBMicrosoft's tiny powerhouse. Punches well above its weight a
Gemma 3 4B4.3B2.5 GBCompact Gemma 3 model for chat and light coding on low-VRAM

Image generation

For images, look at FLUX.1 Dev, Stable Diffusion 3.5 Large, and Stable Diffusion XL 1.0. These need 6 to 12 GB VRAM for standard generation and work well with ComfyUI and Automatic1111.

03

Install an inference engine

An inference engine loads the model onto your GPU and runs it. Three options, one clear recommendation.

Ollama (start here)

The simplest option. One installer, one command to run any supported model. It handles GPU detection, quantization selection, and model downloads automatically. Works on macOS, Linux, and Windows.

# Install Ollama (macOS/Linux)
curl -fsSL https://ollama.com/install.sh | sh

# Run your first model
ollama run llama3.1:8b

# Run a coding model
ollama run qwen2.5-coder:32b

On Windows, use the installer from ollama.com/download instead of the shell script. The commands after install are the same in PowerShell or Command Prompt.

That's it. Seriously. If you're reading this guide for the first time, install Ollama and stop here until you've run a model. Everything below is optimization.

llama.cpp (for power users)

The engine under Ollama's hood. Direct llama.cpp gives you control over quantization, context length, GPU layer allocation, and batch size. Use it when you need to squeeze every last token per second out of your hardware.

LM Studio (if you want a GUI)

A desktop app with a chat interface. Download models from a built-in browser, tweak settings with sliders, start chatting. Good for people who don't want to touch a terminal.

04

Understanding quantization

Quantization is the trick that makes local AI practical. It reduces model weight precision from 16-bit to smaller formats, dramatically shrinking VRAM requirements:

  • FP16 (full precision): Best quality, highest VRAM. 1B parameters needs roughly 2 GB.
  • Q8_0 (8-bit): Nearly identical quality, about 50% less VRAM.
  • Q4_K_M (4-bit): The sweet spot. About 75% less VRAM, quality is good for chat, coding, and general use.
  • Q2/Q3 (2-3 bit): Noticeable quality loss. Only use when you absolutely need a big model in limited VRAM.

Performance tips

  • Context length: Start at 4K. Shorter context means less VRAM for the KV cache and faster responses. Increase only when you need to paste long documents.
  • GPU offloading: If a model barely doesn't fit, offload a few layers to CPU. Slower than full GPU, but much faster than full CPU.
  • Model format: Use GGUF. It's the standard for local inference. Avoid GPTQ or AWQ unless you have specific compatibility needs.

Troubleshooting

  • "Out of memory" errors: Model too large. Try Q4 instead of Q8, or pick a smaller model.
  • Very slow generation: Check that your GPU is actually being used (nvidia-smi). If GPU utilization is 0%, the model is running on CPU.
  • Model won't download: Check disk space. 70B models at Q4 are about 40 GB.
  • Garbled output: Try a different quantization. Aggressive quants (Q2, Q3) can degrade some models significantly.
Common Questions
What is the easiest way to run an LLM locally?+
Ollama. Install it, run "ollama run llama3.1:8b" in your terminal, and you're chatting with a local AI in under two minutes. It handles downloading, quantization, and GPU detection automatically.
How much disk space do I need?+
A 7B model at Q4 quantization is about 4 GB. A 70B model at Q4 is about 40 GB. If you want several models downloaded at once, plan for at least 100 GB of free SSD space.
Can I run LLMs on a laptop?+
Yes, especially on Apple Silicon MacBooks with 16+ GB unified memory. On Windows or Linux laptops, you need a dedicated GPU with enough VRAM. Integrated graphics are too slow for practical use.
What is quantization?+
Quantization reduces model precision from 16-bit to 4-bit or 8-bit, shrinking VRAM requirements by 2 to 4x with modest quality loss. Q4_K_M is the sweet spot for most users: it cuts VRAM needs by 75% while keeping output quality good enough for chat, coding, and general use.
Why is my local LLM slow?+
Three common causes. First: the model is too large for your VRAM and is spilling to system RAM (check with nvidia-smi). Second: you're running on CPU instead of GPU. Third: your context length is set very high, eating VRAM for the KV cache. Try a smaller model or lower quantization first.

Priya Krishnan

Editor, hardware & inference

Priya obsesses over the gap between box specs and what actually happens when you hit Enter in Ollama. She got here untangling friends’ builds and sticker-shock cloud bills, and she still treats every recommendation like a debt she owes the reader.

Ready to build?

Tell us what you want to run, your budget, and your use case. We'll match you to the right hardware in under a minute.

All hardware specifications, prices, and performance data referenced in this guide are sourced from OwnRig's data layer, which is based on manufacturer specifications and community benchmarks. Prices are approximate US retail as of March 2026. Performance figures may vary by configuration, driver version, and software.