Choosing a model
ext-infer loads any GGUF
file llama.cpp can handle. Picking which GGUF is the most important
choice you’ll make — it dominates inference quality, memory footprint,
and latency. This page is a tour of the landscape.
What is GGUF?
GGUF (GPT-Generated Unified Format) is llama.cpp’s native model
format. A .gguf file packs:
- Weights in a specific quantization.
- Tokenizer (vocabulary + merges).
- Architecture metadata (layer count, hidden size, attention config) so llama.cpp knows how to run the model without a separate config file.
- Chat template (for instruct models) so
Model::chat()knows how to render messages. - Pooling type (for embedding models) so
Model::embed()knows how to collapse hidden states.
GGUF files self-contain everything ext-infer needs. There is no
separate config / tokenizer / vocab file to manage.
Model families
There are three broad categories you’ll encounter:
| Category | What ext-infer method to use | Examples |
|---|---|---|
| Base / pretrained | raw() only | Llama 3 base, Mistral 7B base, Qwen base |
| Chat / instruct | chat(), raw() | Qwen3-Instruct, Llama 3.x Instruct, Mistral Instruct |
| Embedding / reranker | embed() | Qwen3-Embedding, BGE, E5, GTE |
A chat model loaded with 'embedding' => true will return a vector,
but it’s not what the model was optimized for — the vectors are
noisier than what a purpose-built embedding model produces. The
reverse (chat() against a pure embedding GGUF) usually fails because
embedding-only models don’t ship a chat template.
Quantization
A 7B-parameter model at full precision is ~14 GB on disk. Quantization trades a small amount of quality for a much smaller, faster file. The suffixes you’ll see in GGUF filenames:
| Suffix | Approx. size for a 7B model | Quality | Notes |
|---|---|---|---|
F16 | ~14 GB | Lossless | Reference. Rarely worth the size unless you have plenty of memory. |
Q8_0 | ~7 GB | Near-lossless | Good default when you can afford the disk. |
Q6_K | ~5.5 GB | Excellent | |
Q5_K_M | ~5 GB | Very good | |
Q4_K_M | ~4.5 GB | Good | The most popular size/quality compromise. |
Q4_K_S | ~4 GB | Solid | |
Q3_K_M | ~3.5 GB | Noticeable degradation | Useful on memory-constrained boxes. |
Q2_K | ~2.5 GB | Significant degradation | Last resort. |
The K-family (Q4_K_M, etc.) uses k-quants, a smarter scheme than
the legacy non-K variants (Q4_0, Q4_1). Prefer K-quants when both
are offered for the same model.
Picking a quant
Two questions:
- How much memory can you spend? Quants below
Q4_K_Msave space at increasing quality cost. AboveQ4_K_M, the marginal gain per GB shrinks fast. - Is the model small enough that quantization barely matters?
For sub-1B models like Qwen3-0.6B, even
Q8_0is ~640 MB — negligible by 2026 standards. Take the quality bump.
A good default rule: Q4_K_M for models > 3B, Q8_0 for smaller
models.
Recommended starting points
What we’ve actually tested against:
Chat (smallest reasonable)
Qwen/Qwen3-0.6B-GGUF — Apache-2.0, 600M params, Q8 ≈ 640 MB.
Reasoning model: emits <think>…</think> blocks through its chat
template, which Response splits
for you. Great for getting started; not great for production-quality
answers.
curl -L -o models/Qwen3-0.6B-Q8_0.gguf \
https://huggingface.co/Qwen/Qwen3-0.6B-GGUF/resolve/main/Qwen3-0.6B-Q8_0.gguf
Chat (production-ish)
bartowski/Qwen3-7B-Instruct-GGUF at Q4_K_M (~4.4 GB) — same
family, much better reasoning quality. Or bartowski/Llama-3.2-3B-Instruct-GGUF
at Q4_K_M (~1.9 GB) for a smaller, non-reasoning option.
Embedding (small, fast)
Qwen/Qwen3-Embedding-0.6B-GGUF — Apache-2.0, 1024-dim
embeddings, last pooling baked into metadata. Same size as the chat
model; quality is competitive with BGE/E5 small variants.
curl -L -o models/Qwen3-Embedding-0.6B-Q8_0.gguf \
https://huggingface.co/Qwen/Qwen3-Embedding-0.6B-GGUF/resolve/main/Qwen3-Embedding-0.6B-Q8_0.gguf
Alternative: CompendiumLabs/bge-small-en-v1.5-gguf — 384-dim,
mean pooling, ~130 MB. Lower-quality vectors but tiny.
Where to look for more
- Hugging Face GGUF tag
—
library=gguffilters to GGUF-format models. - bartowski — prolific publisher of quantized GGUFs for popular models. Reliable, consistent naming.
- mradermacher — ditto.
- The model’s own official GGUF repo when one exists (e.g.
Qwen/Qwen3-7B-Instruct-GGUF) — always the most trusted source.
License caveats
GGUF files inherit the underlying model’s license. Some models that are nominally “open” (Llama 3.x, Gemma) ship under custom licenses with use restrictions; others (Qwen, Mistral, several smaller players) are Apache-2.0 / MIT. Check the model card before depending on a model in a commercial deployment.
ext-infer itself is MIT-licensed — the extension doesn’t care which
GGUF you load, but downstream concerns are on you.
Next
- Embeddings — when you’ve picked an embedding model.
- Chat completions — when you’ve picked a chat model.
- Performance tuning —
n_gpu_layers,mmap,mlockfor the model you ended up with.