Meta
ChatCodingReasoningMulti-purpose70.6B
Chat

Llama 3.3 70B Instruct

Llama Β· Llama 3.3 Community License

Flagship Llama 3.3 model with best-in-class general and coding performance.

Parameters
70.6B
Architecture
Dense
Context
131,072 tokens
Released
2024-12-06
Engines
llama.cpp, ollama, vLLM, TGI
Builder Tools
Cursor, Continue, Aider, Open WebUI, LM Studio

Parameters

70.6B

VRAM

61 GB

Context

128K

Formats

6

GPUs

22

Llama 3.3 70B Instruct (70.6B) requires 61 GB VRAM at recommended quality (Q6_K). On NVIDIA Grace Blackwell Ultra GB300, expect approximately 55 tok/s at Q8_0. For the best experience, Mac Studio AI Builder ($3,999) is recommended.

Source: OwnRig methodology

VRAM (Recommended)

61 GB

Quantization

Q6_K

File Size

52 GB

Max Context

128K tokens

Primary Use

Chat

Memory

VRAM Requirements

QualityQuantizationVRAMFile Size
fullQ8_078 GB70 GB
recommendedQ6_K61 GB52 GB
recommendedQ5_K_M51 GB43 GB
efficientQ4_K_M41 GB35 GB
compressedQ3_K_M33 GB27 GB
compressedQ2_K25.6 GB21 GB
Scaling

Context Length Impact

KV cache VRAM at Q6_K quality. Longer context = more memory.

ContextKV CacheTotal VRAM
2K1.2 GB62.2 GBexceeds 24 GB
4K2.3 GB63.3 GBexceeds 24 GB
8K4.6 GB65.6 GBexceeds 24 GB
16K9.2 GB70.2 GBexceeds 24 GB
32K18.4 GB79.4 GBexceeds 24 GB
64K36.9 GB97.9 GBexceeds 24 GB
128K73.7 GB134.7 GBexceeds 24 GB

Compatible GPUs

22 devices
NVIDIA Grace Blackwell Ultra GB300Q8_055 tok/sExcellent
Apple M4 Max (128GB Unified)Q4_K_M18 tok/sAcceptable
Apple M4 Max (64GB Unified)Q3_K_M18 tok/sAcceptable
Apple M4 Pro (48GB)Q4_K_M12 tok/sAcceptable
Apple M4 Ultra (192GB)Q4_K_M24 tok/sAcceptable
AMD Radeon Pro W7900Q4_K_M6 tok/sAcceptable
NVIDIA RTX PRO 6000 BlackwellQ8_014 tok/sAcceptable
NVIDIA RTX PRO 6000 Blackwell Max-QQ8_013 tok/sAcceptable
NVIDIA GeForce RTX 4090Q3_K_M6 tok/sMarginal
NVIDIA GeForce RTX 5090Q4_K_M8 tok/sMarginal
Apple M3 Pro (18GB Unified)Q2_K–Not viable
NVIDIA GeForce RTX 3080 10GBQ2_K–Not viable
NVIDIA GeForce RTX 4060 8GBQ2_K–Not viable
NVIDIA RTX 4060 Laptop (40-60W)Q2_K–Not viable
NVIDIA RTX 4070 Laptop (80-115W)Q2_K–Not viable
NVIDIA GeForce RTX 4070 Ti 12GBQ2_K–Not viable
AMD Radeon RX 7600Q3_K_M–Not viable
AMD Radeon RX 7900 XTXQ3_K_M–Not viable
AMD Radeon RX 9070Q3_K_M–Not viable
AMD Radeon RX 9060 XT 16GBQ3_K_M–Not viable
AMD Radeon RX 9060 XT 8GBQ3_K_M–Not viable
NVIDIA GeForce RTX 5060 8GBQ2_K–Not viable

Showing 22 of 22 entries

Builder Context

Llama 3.3 70B Instruct is commonly used with Cursor, Continue, Aider, Open WebUI, LM Studio. For an AI coding workflow, pair it with an embedding model like nomic-embed-text for local RAG.

Hardware

Recommended Builds

Complete PC builds that can run Llama 3.3 70B Instruct.

FAQ

Frequently Asked Questions

How much VRAM does Llama 3.3 70B Instruct need?
Llama 3.3 70B Instruct requires 61 GB VRAM at recommended quality (Q6_K). At lower quality settings, it can fit in as little as 25.6 GB.
What is the best GPU for Llama 3.3 70B Instruct?
The NVIDIA Grace Blackwell Ultra GB300 delivers the best performance for Llama 3.3 70B Instruct, achieving 55 tok/s at Q8_0 with an excellent rating.
What quantization should I use for Llama 3.3 70B Instruct?
For the best quality, use Q6_K (61 GB VRAM). If your GPU has limited VRAM, Q2_K (25.6 GB) is the most efficient option with acceptable quality.
Is Llama 3.3 70B Instruct good for coding?
Yes. Llama 3.3 70B Instruct is used with Cursor, Continue, Aider, Open WebUI, LM Studio for local AI coding. For the best coding experience, pair it with an embedding model for local RAG.

Related Guides

All models

Data confidence: estimated. Source

VRAM requirements are calculated from model parameters and may vary by inference engine, context length, and batch size. Performance estimates are based on community benchmarks and should be verified for your specific configuration.Llama is a trademark of its respective owner. OwnRig is not affiliated with or endorsed by the model creator.