The Complete Guide to Running LLMs Locally

I ran my first local LLM on a Tuesday afternoon. Installed Ollama, typed one command, waited 90 seconds for the model to download, and started chatting with an AI that lived entirely on my machine. No API key. No cloud account. No bill.

That was two months ago. Since then, I've run 65 different models on 43 devices for OwnRig's compatibility database. This guide is everything I've learned, compressed into the four steps between you and a working local AI.

Steps from zero to your first local LLM

Hardware check, model choice, engine install, generate

What you need

Local AI has three hard requirements. Miss any one and you're stuck.

A GPU with enough VRAM. This is the wall most people hit. Your GPU's VRAM (its dedicated memory) must hold the entire model. Our VRAM guide has the full breakdown; our GPU buying guide has specific recommendations.
System RAM: 16 GB minimum, 32 GB recommended. Some models partially offload to system RAM. More headroom means you can run other applications alongside AI without slowdowns.
An SSD with free space. Model files range from 2 GB (small 3B models) to 40+ GB (large 70B models). An NVMe SSD keeps model loading fast.

Choose your model

What you want the AI to do determines which model you need. Here are our picks by use case, sorted smallest to largest. Smaller models run on cheaper hardware; larger models produce better output. Pick the biggest one your GPU can fit.

Chat and general assistant

Model	Parameters	VRAM (Q4)	Best for
Llama 3.2 1B Instruct	1.24B	819 MB	Quick responses, lower hardware
Llama 3.2 3B Instruct	3.21B	2.1 GB	Quick responses, lower hardware
Phi-3 Mini 3.8B Instruct	3.82B	2.6 GB	Quick responses, lower hardware
Phi-4 Mini	3.82B	2.4 GB	Quick responses, lower hardware
Gemma 3 4B	4.3B	2.5 GB	Quick responses, lower hardware
Gemma 4 E2B	5.1B	4 GB	Quick responses, lower hardware

Coding and development

If you're running a local coding assistant (as a Copilot replacement or IDE integration), these are the models to use. I'd recommend Qwen 2.5 Coder if your GPU can fit it; it's the best open coding model I've tested.

Model	Parameters	VRAM (Q4)	Specialty
Llama 3.2 1B Instruct	1.24B	819 MB	The smallest Llama model. Runs on integrated GPUs and even C
Llama 3.2 3B Instruct	3.21B	2.1 GB	Ultra-lightweight model that runs on virtually any GPU. Surp
Phi-3 Mini 3.8B Instruct	3.82B	2.6 GB	Punches above its weight: a 3.8B model that rivals many 7B m
Phi-4 Mini	3.82B	2.4 GB	Microsoft's tiny powerhouse. Punches well above its weight a
Gemma 3 4B	4.3B	2.5 GB	Compact Gemma 3 model for chat and light coding on low-VRAM

Image generation

For images, look at FLUX.1 Dev, Stable Diffusion 3.5 Large, and Stable Diffusion XL 1.0. These need 6 to 12 GB VRAM for standard generation and work well with ComfyUI and Automatic1111.

Install an inference engine

An inference engine loads the model onto your GPU and runs it. Three options, one clear recommendation.

Ollama (start here)

The simplest option. One installer, one command to run any supported model. It handles GPU detection, quantization selection, and model downloads automatically. Works on macOS, Linux, and Windows.

# Install Ollama (macOS/Linux)
curl -fsSL https://ollama.com/install.sh | sh

# Run your first model
ollama run llama3.1:8b

# Run a coding model
ollama run qwen2.5-coder:32b

On Windows, use the installer from ollama.com/download instead of the shell script. The commands after install are the same in PowerShell or Command Prompt.

That's it. Seriously. If you're reading this guide for the first time, install Ollama and stop here until you've run a model. Everything below is optimization.

llama.cpp (for power users)

The engine under Ollama's hood. Direct llama.cpp gives you control over quantization, context length, GPU layer allocation, and batch size. Use it when you need to squeeze every last token per second out of your hardware.

LM Studio (if you want a GUI)

A desktop app with a chat interface. Download models from a built-in browser, tweak settings with sliders, start chatting. Good for people who don't want to touch a terminal.

Understanding quantization

Quantization is the trick that makes local AI practical. It reduces model weight precision from 16-bit to smaller formats, dramatically shrinking VRAM requirements:

FP16 (full precision): Best quality, highest VRAM. 1B parameters needs roughly 2 GB.
Q8_0 (8-bit): Nearly identical quality, about 50% less VRAM.
Q4_K_M (4-bit): The sweet spot. About 75% less VRAM, quality is good for chat, coding, and general use.
Q2/Q3 (2-3 bit): Noticeable quality loss. Only use when you absolutely need a big model in limited VRAM.

Performance tips

Context length: Start at 4K. Shorter context means less VRAM for the KV cache and faster responses. Increase only when you need to paste long documents.
GPU offloading: If a model barely doesn't fit, offload a few layers to CPU. Slower than full GPU, but much faster than full CPU.
Model format: Use GGUF. It's the standard for local inference. Avoid GPTQ or AWQ unless you have specific compatibility needs.

Troubleshooting

"Out of memory" errors: Model too large. Try Q4 instead of Q8, or pick a smaller model.
Very slow generation: Check that your GPU is actually being used (nvidia-smi). If GPU utilization is 0%, the model is running on CPU.
Model won't download: Check disk space. 70B models at Q4 are about 40 GB.
Garbled output: Try a different quantization. Aggressive quants (Q2, Q3) can degrade some models significantly.

Common Questions

What is the easiest way to run an LLM locally?

Ollama. Install it, run "ollama run llama3.1:8b" in your terminal, and you're chatting with a local AI in under two minutes. It handles downloading, quantization, and GPU detection automatically.

How much disk space do I need?

A 7B model at Q4 quantization is about 4 GB. A 70B model at Q4 is about 40 GB. If you want several models downloaded at once, plan for at least 100 GB of free SSD space.

Can I run LLMs on a laptop?

Yes, especially on Apple Silicon MacBooks with 16+ GB unified memory. On Windows or Linux laptops, you need a dedicated GPU with enough VRAM. Integrated graphics are too slow for practical use.

What is quantization?

Quantization reduces model precision from 16-bit to 4-bit or 8-bit, shrinking VRAM requirements by 2 to 4x with modest quality loss. Q4_K_M is the sweet spot for most users: it cuts VRAM needs by 75% while keeping output quality good enough for chat, coding, and general use.

Why is my local LLM slow?

Three common causes. First: the model is too large for your VRAM and is spilling to system RAM (check with nvidia-smi). Second: you're running on CPU instead of GPU. Third: your context length is set very high, eating VRAM for the KV cache. Try a smaller model or lower quantization first.