I ran my first local LLM on a Tuesday afternoon. Installed Ollama, typed one command, waited 90 seconds for the model to download, and started chatting with an AI that lived entirely on my machine. No API key. No cloud account. No bill.
That was two months ago. Since then, I've run 42 different models on 18 devices for OwnRig's compatibility database. This guide is everything I've learned, compressed into the four steps between you and a working local AI.
4
Steps from zero to your first local LLM
Hardware check, model choice, engine install, generate
What you need
Local AI has three hard requirements. Miss any one and you're stuck.
- A GPU with enough VRAM. This is the wall most people hit. Your GPU's VRAM (its dedicated memory) must hold the entire model. Our VRAM guide has the full breakdown; our GPU buying guide has specific recommendations.
- System RAM: 16 GB minimum, 32 GB recommended. Some models partially offload to system RAM. More headroom means you can run other applications alongside AI without slowdowns.
- An SSD with free space. Model files range from 2 GB (small 3B models) to 40+ GB (large 70B models). An NVMe SSD keeps model loading fast.
Choose your model
What you want the AI to do determines which model you need. Here are our picks by use case, sorted smallest to largest. Smaller models run on cheaper hardware; larger models produce better output. Pick the biggest one your GPU can fit.
Chat and general assistant
| Model | Parameters | VRAM (Q4) | Best for |
|---|---|---|---|
| Llama 3.2 1B Instruct | 1.24B | 819 MB | Quick responses, lower hardware |
| Llama 3.2 3B Instruct | 3.21B | 2.1 GB | Quick responses, lower hardware |
| Phi-3 Mini 3.8B Instruct | 3.82B | 2.6 GB | Quick responses, lower hardware |
| Phi-4 Mini | 3.82B | 2.4 GB | Quick responses, lower hardware |
| Gemma 3 4B | 4.3B | 2.5 GB | Quick responses, lower hardware |
| Mistral 7B Instruct v0.3 | 7.24B | 4.5 GB | Quick responses, lower hardware |
Coding and development
If you're running a local coding assistant (as a Copilot replacement or IDE integration), these are the models to use. I'd recommend Qwen 2.5 Coder if your GPU can fit it; it's the best open coding model I've tested.
| Model | Parameters | VRAM (Q4) | Specialty |
|---|---|---|---|
| Llama 3.2 1B Instruct | 1.24B | 819 MB | The smallest Llama model. Runs on integrated GPUs and even C |
| Llama 3.2 3B Instruct | 3.21B | 2.1 GB | Ultra-lightweight model that runs on virtually any GPU. Surp |
| Phi-3 Mini 3.8B Instruct | 3.82B | 2.6 GB | Punches above its weight — a 3.8B model that rivals many 7B |
| Phi-4 Mini | 3.82B | 2.4 GB | Microsoft's tiny powerhouse. Punches well above its weight a |
| Gemma 3 4B | 4.3B | 2.5 GB | Compact Gemma 3 model for chat and light coding on low-VRAM |
Image generation
For images, look at FLUX.1 Dev, Stable Diffusion 3.5 Large, and Stable Diffusion XL 1.0. These need 6 to 12 GB VRAM for standard generation and work well with ComfyUI and Automatic1111.
Install an inference engine
An inference engine loads the model onto your GPU and runs it. Three options, one clear recommendation.
Ollama (start here)
The simplest option. One installer, one command to run any supported model. It handles GPU detection, quantization selection, and model downloads automatically. Works on macOS, Linux, and Windows.
# Install Ollama (macOS/Linux)
curl -fsSL https://ollama.com/install.sh | sh
# Run your first model
ollama run llama3.1:8b
# Run a coding model
ollama run qwen2.5-coder:32bOn Windows, use the installer from ollama.com/download instead of the shell script. The commands after install are the same in PowerShell or Command Prompt.
That's it. Seriously. If you're reading this guide for the first time, install Ollama and stop here until you've run a model. Everything below is optimization.
llama.cpp (for power users)
The engine under Ollama's hood. Direct llama.cpp gives you control over quantization, context length, GPU layer allocation, and batch size. Use it when you need to squeeze every last token per second out of your hardware.
LM Studio (if you want a GUI)
A desktop app with a chat interface. Download models from a built-in browser, tweak settings with sliders, start chatting. Good for people who don't want to touch a terminal.
Understanding quantization
Quantization is the trick that makes local AI practical. It reduces model weight precision from 16-bit to smaller formats, dramatically shrinking VRAM requirements:
- FP16 (full precision): Best quality, highest VRAM. 1B parameters needs roughly 2 GB.
- Q8_0 (8-bit): Nearly identical quality, about 50% less VRAM.
- Q4_K_M (4-bit): The sweet spot. About 75% less VRAM, quality is good for chat, coding, and general use.
- Q2/Q3 (2-3 bit): Noticeable quality loss. Only use when you absolutely need a big model in limited VRAM.
Performance tips
- Context length: Start at 4K. Shorter context means less VRAM for the KV cache and faster responses. Increase only when you need to paste long documents.
- GPU offloading: If a model barely doesn't fit, offload a few layers to CPU. Slower than full GPU, but much faster than full CPU.
- Model format: Use GGUF. It's the standard for local inference. Avoid GPTQ or AWQ unless you have specific compatibility needs.
Troubleshooting
- "Out of memory" errors: Model too large. Try Q4 instead of Q8, or pick a smaller model.
- Very slow generation: Check that your GPU is actually being used (
nvidia-smi). If GPU utilization is 0%, the model is running on CPU. - Model won't download: Check disk space. 70B models at Q4 are about 40 GB.
- Garbled output: Try a different quantization. Aggressive quants (Q2, Q3) can degrade some models significantly.
