Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Introduction

ext-infer is a PHP 8.3+ extension that loads a GGUF model and runs LLM inference inside the PHP process via llama.cpp. PHP-native semantic search, RAG pipelines, and CLI / worker inference run without shelling out to Python or hitting a remote API.

It is written in Rust on top of ext-php-rs and the llama-cpp-2 bindings. The public PHP surface is designed to feel native: a fluent, role-aware Prompt builder; a Response that splits reasoning from answer; an Embedding that knows how to normalize itself and compute cosine similarity. You should rarely, if ever, need to think about <|im_start|> tokens.

Why an extension?

Three reasons local inference belongs in PHP rather than next to it:

  • Latency. A subprocess fork or HTTP roundtrip is at least milliseconds, often tens. An in-process call is bounded only by decode time.
  • Operational surface. No Python sidecar to package, no daemon to supervise, no inference server to scale alongside FPM. The PHP process is the inference server.
  • API ergonomics. Calling a local LLM should be as natural in PHP as calling intl or pdo. The extension API is shaped to match that — see Prompts and Chat completions.

What’s here

This guide is split into five layers, navigable from the sidebar:

SectionWhat you’ll find
Getting StartedInstall, run hello-world, verify it loaded.
GuideConceptual walkthroughs of each public class. Read in order on first pass.
RecipesCopy-paste-ready patterns: multi-turn chat, semantic search, RAG, worker pools.
ReferenceComplete API listing, exceptions, environment variables, compatibility matrix.
AdvancedThreading model, Apple Metal, performance tuning.

Status

ext-infer is pre-release — the class surface is stable but the first tagged release (v0.1.0) is still in flight. See RELEASE.md for the cut-a-release flow and PLAN.md for what’s coming next.

Conventions in this guide

  • Code blocks are runnable as written, with one exception: PHP code assumes the extension is loaded. Either install it system-wide or prepend -d extension=… to your php command. See Installation.
  • Model without a namespace prefix means Displace\Infer\Model; same for Prompt, Response, Embedding. Real code needs the use statement at the top of the file.
  • CLI snippets are written for a POSIX shell (bash / zsh). Adjust for fish / PowerShell as needed; differences are usually only quoting.

Installation

Two supported install paths:

  1. Via PIE — pulls a pre-built binary for your (php-minor, arch, os, libc) combo. No local C/C++ toolchain. Recommended for application developers.
  2. From source — builds llama.cpp locally via cargo. Needed for contributors, distros without a pre-built artifact, or anyone who wants to enable the metal cargo feature.

Via PIE

Heads up: PIE installation is wired up but the first published release (v0.1.0) is still in flight. Until then, install from source — the pie install flow becomes the recommended path the moment we ship binaries.

PIE (PHP Installer for Extensions) is the official tool for installing PHP extensions from Composer-style metadata. Get it once:

curl -L --output pie.phar \
    https://github.com/php/pie/releases/latest/download/pie.phar
chmod +x pie.phar && sudo mv pie.phar /usr/local/bin/pie

Then install ext-infer:

pie install displace/ext-infer

PIE reads composer.json to learn that ext-infer ships pre-packaged binaries, fetches the right tarball from the matching GitHub Release, extracts infer.so (or infer.dylib on macOS) into the PHP extension directory, and adds it to your php.ini.

Verify the install with php -m:

php -m | grep infer
# infer

From source

Prerequisites

ToolPurposeMinimum
PHP CLIhost process8.3
php-configtells ext-php-rs where the PHP headers are(matches PHP)
Rust toolchaincompiles the extension1.88
cmakellama.cpp builds via cmake during cargo build3.18+
C/C++ toolchainllama.cpp itselfClang / GCC
cargo-phpwraps make install to drop the artifact in PHP’s extension dir0.1+

The Rust toolchain is pinned via rust-toolchain.toml, so you don’t need to install a specific version manually — rustup will fetch it on first build. On macOS, cmake is a brew install cmake away; on Debian/Ubuntu, apt install cmake build-essential libclang-dev.

Install cargo-php once:

cargo install cargo-php

Build and install

git clone https://github.com/DisplaceTech/ext-infer
cd ext-infer
make release          # builds target/release/libinfer.{so,dylib}
make install          # cargo php install --release
php -m | grep infer

A cold build compiles llama.cpp from source — that takes a few minutes on a fresh machine. Subsequent builds reuse cargo’s incremental cache and the rebuilt llama.cpp object files; expect sub-minute rebuilds after the first one.

Without make install (development)

If you want to load a freshly built binary without committing to installing it system-wide, pass the path on the PHP command line:

make build       # debug build (faster compile, slower runtime)
php -d extension=$PWD/target/debug/libinfer.dylib your-script.php

Substitute .so for .dylib on Linux. This is the workflow used throughout the examples.

Apple Metal acceleration (opt-in)

The default build is CPU-only and portable. For Apple Silicon GPU acceleration:

make release FEATURES=metal
make install  FEATURES=metal

See Apple Metal for what this does and what trade-offs it implies.

Uninstalling

Via PIE:

pie uninstall displace/ext-infer

From a source install:

make uninstall    # cargo php remove

Either way, confirm with php -m | grep infer (should produce no output).

Troubleshooting

If php -m | grep infer shows nothing after install, see Verifying your install for the diagnostic checklist — it walks through the four or five most common failure modes (extension_dir mismatch, PHP minor mismatch, missing -undefined,dynamic_lookup on macOS, libc mismatch on Linux).

Quick start

This page assumes you’ve already installed the extension. From a cold install to a working answer in under a minute:

1. Grab a model

GGUF files are big. Even the smallest interesting ones are 600 MB quantized. For getting started, Qwen3-0.6B-Q8_0 is a good first model — Apache-2.0 licensed, ~640 MB, fast on CPU, good enough at toy questions:

mkdir -p models
curl -L -o models/Qwen3-0.6B-Q8_0.gguf \
    https://huggingface.co/Qwen/Qwen3-0.6B-GGUF/resolve/main/Qwen3-0.6B-Q8_0.gguf

See Choosing a model for the broader landscape.

2. Write the script

Save the following as hello.php:

<?php

declare(strict_types=1);

use Displace\Infer\Model;
use Displace\Infer\Prompt;

$model = Model::load('models/Qwen3-0.6B-Q8_0.gguf');

$response = $model->chat(
    Prompt::system('You are a helpful, concise assistant.')
        ->withUser('What is 2+2?'),
    maxTokens: 256,
    temperature: 0.0,
);

echo $response->answer(), PHP_EOL;

$model->close();

Three things going on:

  • Model::load(...) reads the GGUF into memory. Loading is the slow step — for a real app, load once and keep the handle around. See Choosing a model.
  • Prompt::system(...)->withUser(...) builds a chat prompt without any template tokens. The Prompt is immutable; each with* returns a new instance. See Prompts.
  • $model->chat($prompt, ...) renders the prompt through whatever chat template the GGUF ships, runs inference, and returns a Response. answer() is the model’s reply with any <think>...</think> reasoning stripped.

3. Run it

If you installed via PIE (or make install), just:

php hello.php

If you’re running against a make build artifact instead:

php -d extension=$(pwd)/target/debug/libinfer.dylib hello.php

Substitute .so on Linux. Expected output:

2 + 2 equals 4.

4. What just happened

llama.cpp normally spams several hundred lines to stderr per inference (model layout, KV-cache sizing, graph reservation). ext-infer silences that by default — it’s noise inside a PHP request and tends to poison structured logs. Bring it back when you need to debug:

EXT_INFER_LOG=1 php hello.php

See Environment variables for the complete list.

Next steps

Verifying your install

After installing, three things should be true. If any of them isn’t, this page is the checklist.

The fast version

# 1. Is the extension loaded?
php -m | grep infer
# expected: infer

# 2. Are the classes registered?
php -r 'echo class_exists("Displace\\Infer\\Model") ? "yes\n" : "no\n";'
# expected: yes

# 3. Does inference actually work?
php -r '
$m = \Displace\Infer\Model::load("models/Qwen3-0.6B-Q8_0.gguf");
$r = $m->chat(\Displace\Infer\Prompt::user("Say hello."));
echo $r->answer(), PHP_EOL;
$m->close();
'
# expected: a one-line greeting

All three pass → you’re done. Skip to the Guide.

Diagnosis if php -m | grep infer is empty

The extension didn’t load. PHP loads extensions from a specific directory and looks for them by exact filename — usually one of these four things is off.

1. PHP can’t find the binary

Confirm where PHP is looking:

php -i | grep -E '^extension_dir|^Loaded Configuration File'

Then confirm the binary is in that directory:

ls -l $(php -r 'echo ini_get("extension_dir");')/infer.*

If the file is missing:

  • After make install, cargo-php should have placed it there. Try re-running with -v to see where it landed: make install (or cargo php install --release -v).
  • After pie install, look at PIE’s output for the install path.

If the file is in a different directory than extension_dir, either move it or update extension_dir in your php.ini.

2. PHP minor mismatch

A binary built against PHP 8.4 will not load into PHP 8.5 (and vice versa). Confirm both:

php --version | head -1
# e.g. PHP 8.4.20 (cli)

# For PIE-installed binaries, the tarball filename encodes the PHP
# minor — check the GitHub Release you installed from:
ls -l $(php -r 'echo ini_get("extension_dir");')/infer.*

Cross-check that the binary’s PHP minor matches your running PHP minor. If they disagree, re-install with the right tarball (PIE handles this automatically; manual installs may need pie install --force).

The extension uses dlopen-style undefined-symbol resolution against the host PHP binary. If you built from source on macOS and skipped the extension’s own build.rs, the linker errors out at build time with Undefined symbols for architecture arm64. From-source builds via make build / make release configure this automatically. If you invoked cargo build from somewhere unusual (e.g. an IDE), repeat the build via make to be safe.

4. Linux: libc mismatch

The released binaries target glibc Linux. Alpine (musl) is not in the v0.1 release matrix. Confirm your libc:

ldd --version 2>&1 | head -1
# expected: ldd (GNU libc) 2.x
# if you see musl: rebuild from source — see Installation

Building from source on musl works; .cargo/config.toml carries the needed crt-static opt-out.

Diagnosis if classes are missing

If php -m shows infer but class_exists("Displace\\Infer\\Model") returns no, the namespace probably has a typo somewhere upstream of you. The full list:

Displace\Infer\Model
Displace\Infer\Prompt
Displace\Infer\Message
Displace\Infer\Response
Displace\Infer\Embedding
Displace\Infer\InferException
Displace\Infer\ModelLoadException
Displace\Infer\InferenceException

All eight should exist after a successful load. If only some do, you likely have a ext-infer install left over from an older API surface — uninstall the old version (pie uninstall or make uninstall) and reinstall.

Diagnosis if inference fails

If Model::load throws ModelLoadException:

  • “no such file” — the GGUF path is wrong. PHP resolves relative paths against the working directory, not the script’s directory.
  • “failed to load model: …” — check that the file isn’t truncated (du -h should match what the publisher lists) and that it really is a GGUF (file <path> should mention “data” or similar; if it says “ASCII text” it’s probably an HTML 404 from a failed download).

If Model::chat throws InferenceException with "model has no embedded chat template", you’ve picked a base model rather than an instruct/chat variant. See Choosing a model or use Model::raw() with your own templating.

If the script segfaults rather than throwing — please open an issue at github.com/DisplaceTech/ext-infer/issues with the model name, PHP version, and OS. That’s a bug.

Enabling verbose logging

llama.cpp’s own diagnostic chatter is silenced by default. To see it (model layout, KV cache sizing, graph reservation, …):

EXT_INFER_LOG=1 php hello.php

A noisy log can sometimes point straight at the issue — e.g. “n_ctx exceeds model’s training context” tells you the model is being asked to handle longer input than it was trained for.

Prompts

Displace\Infer\Prompt is the input to Model::chat(). It represents an ordered list of role-tagged messages — system, user, assistant — that the extension renders into whatever chat-template format the underlying model expects. You never write <|im_start|> (or its Llama 3 / Mistral / Gemma equivalent) by hand.

Two-stage construction

A Prompt starts with a factory — either system() or user() — and grows via with* calls. Each with* returns a new Prompt; the receiver is never modified.

use Displace\Infer\Prompt;

// Start with a system message:
$p = Prompt::system('You are a helpful assistant.')
    ->withUser('What is 2+2?');

// Or start with a user message (no system instruction):
$p = Prompt::user('Hello!');

// Multi-turn replays:
$p = Prompt::system('You are a poet.')
    ->withUser('Write a haiku about Rust.')
    ->withAssistant("Code runs cold and fast,\nMemory safe by the borrow,\nNo crashes today.")
    ->withUser('Now translate it to French.');

Direct new Prompt() is refused at runtime:

new Prompt();
// Displace\Infer\InferException: use Displace\Infer\Prompt::system()
// or Prompt::user() to start a prompt

Why immutable?

The shape mirrors DateTimeImmutable. Two practical consequences:

  • A Prompt you’ve built once is safe to share across multiple chat() calls, hand to a queue worker, or stash in a class property. Nothing downstream can mutate it.

  • Branching is free. The multi-turn chat recipe keeps a $base Prompt around (system-message-only) so /reset can drop conversation history without re-rendering the system prompt:

    $base         = Prompt::system($systemMessage);
    $conversation = $base;
    // … many turns …
    if ($userTyped === '/reset') {
        $conversation = $base;   // immutable; $base is untouched no
                                 // matter how many turns went through it
    }
    

Inspecting a Prompt

$p->messages();    // list<Displace\Infer\Message>
$p->count();       // int — number of messages
$p->isEmpty();     // bool
$p->lastRole();    // ?string — role of the most recent message, or null

Each Message is read-only:

foreach ($p->messages() as $msg) {
    printf("[%s] %s\n", $msg->role(), $msg->content());
}
// [system] You are a helpful assistant.
// [user] What is 2+2?

role() is always one of 'system', 'user', or 'assistant'. Method-name discipline on the construction side (withSystem, withUser, withAssistant) keeps typos from creating fictional roles at compile time.

Role ordering

ext-infer does not enforce role ordering at construction time. You can build:

Prompt::user('hi')->withSystem('be terse');  // legal
Prompt::system('a')->withSystem('b');         // also legal

…and they will be rendered as written. Whether the model accepts the result is a chat-template decision: most modern chat templates require exactly one leading system message (or none) followed by alternating user / assistant turns. Build sequences that match that convention and the chat template will render them; deviate and you may get an error from Model::chat() at call time.

Composition patterns

Pre-baked system prompts

If your application has a few stock personalities, define them once:

final class Personas
{
    public static function poet(): Prompt
    {
        return Prompt::system(
            'You are a haiku poet. Respond in three lines. ' .
            'Five syllables, then seven, then five.'
        );
    }

    public static function reviewer(): Prompt
    {
        return Prompt::system(
            'You review code. Always cite specific line numbers ' .
            'and prefer questions over assertions when uncertain.'
        );
    }
}

$response = $model->chat(Personas::poet()->withUser('Tell me about autumn.'));

Because Prompt is immutable, returning a Prompt from a helper method is safe — callers can’t mutate the cached base.

Replaying history

When you have stored history (e.g. fetched from a database), rebuild the Prompt from scratch each turn:

$prompt = Prompt::system($systemMessage);
foreach ($historyFromDb as $row) {
    $prompt = match ($row['role']) {
        'user'      => $prompt->withUser($row['content']),
        'assistant' => $prompt->withAssistant($row['content']),
    };
}
$prompt = $prompt->withUser($newUserInput);

This is the canonical multi-turn-chat shape. See the multi-turn chat recipe.

Feeding Response::answer() back, not text()

When you append the assistant’s reply to the prompt for the next turn, use Response::answer() (reasoning stripped), not Response::text():

$response = $model->chat($prompt);
$prompt   = $prompt->withAssistant($response->answer());
//                                          ^^^^^^^^^
//                            not ->text(), which includes <think>…</think>

Feeding <think> blocks back as conversation history derails reasoning models — they see their own thoughts in the transcript and get confused. See Reasoning models.

Next

Chat completions

Model::chat() is the main inference entry point. It takes a Prompt and returns a Response:

public function chat(
    \Displace\Infer\Prompt $prompt,
    int   $maxTokens   = 128,
    int   $nCtx        = 2048,
    float $temperature = 0.0,
    int   $seed        = 1234,
): \Displace\Infer\Response;

All four sampling arguments are PHP 8 named arguments — no options array. See Options reference for what each one does.

What chat() does

Three steps happen between the call and the return value:

  1. Render. The Prompt’s messages are fed through llama_chat_apply_template, using the chat template embedded in the GGUF. Qwen3, Llama 3, Mistral, Gemma — each ships its own Jinja template inside the model file. ext-infer reads it and uses it verbatim.

  2. Decode. The rendered prompt is tokenized, decoded through the model in a single batch, then a sampler generates output tokens one by one until either the model emits an end-of-generation token (finishReason = 'eos') or the maxTokens budget is exhausted (finishReason = 'length').

  3. Split. If the generated text contains <think>...</think> blocks (Qwen3 / DeepSeek R1 / other reasoning models), they’re captured into Response::reasoning() and stripped from Response::answer(). See Reasoning models for the details.

Inspecting a Response

Response is read-only. Six getters:

$response->text();              // string — full output, <think>…</think> + answer
$response->reasoning();         // ?string — captured <think>…</think>, or null
$response->answer();            // string — text() minus reasoning, leading WS trimmed
$response->hasReasoning();      // bool
$response->finishReason();      // string — 'eos' | 'length' | 'stop'
$response->tokensGenerated();   // int — generated tokens only, not prompt

Response is created internally — new Response() throws.

text() vs answer()

For non-reasoning models, the two are byte-identical. For reasoning models invoked through their chat template:

text():     <think>Okay so 2+2…</think>\n\n2 + 2 equals 4.
answer():   2 + 2 equals 4.
reasoning(): Okay so 2+2…

answer() is what end users want to read; reasoning() is what you’d log for debugging or display behind a “show thinking” toggle.

finishReason()

Three possible values:

ValueMeaning
'eos'Model emitted an end-of-generation token. Output is complete.
'length'maxTokens was hit before EOS. Output is likely truncated mid-thought.
'stop'Reserved for future stop-string support. Currently only reachable when the prompt produced zero tokens (a degenerate input).

When you see 'length', surface it to the user — “hit the token budget, bump maxTokens to see more”. Silently truncating is a bad UX.

tokensGenerated()

Counts generated tokens only, not the prompt’s tokens. Useful for billing-like accounting, latency analysis, or capping conversation length.

Calling chat()

A minimal call uses every default:

$response = $model->chat(Prompt::user('Hello!'));

A fully-specified one:

$response = $model->chat(
    Prompt::system('You are a helpful, concise assistant.')
        ->withUser('What is the capital of Antarctica?'),
    maxTokens: 256,
    nCtx: 4096,
    temperature: 0.7,
    seed: 42,
);

Sampling defaults — temperature: 0.0, seed: 1234 — give greedy, deterministic output: the same prompt always produces the same reply. Crank temperature up for varied / creative output; the seed only matters when temperature > 0.

Errors

Model::chat() raises InferenceException for any failure between “the model exists” and “we got tokens back”. The most common message strings:

SubstringMeaning
model has been closedYou called chat() after $model->close(). Reload the model.
model has no embedded chat templateThe GGUF is a base model, not an instruct/chat variant. Either pick a chat-tuned model or use Model::raw().
apply_chat_template failedThe chat template rendered but llama.cpp rejected the result. Usually means the message-role sequence is one the template doesn’t support (e.g. multiple system messages).
prompt is N tokens but n_ctx is only MThe rendered prompt is longer than nCtx. Bump nCtx or shorten the prompt.
chat message contains a null byteA Prompt’s content has an embedded \0. Strip it before constructing the prompt.

Chat-template errors

If you load a model that doesn’t ship a chat template — typically a “base” or “pretrained” model rather than an instruct variant — you’ll see:

InferenceException: model has no embedded chat template — use
Model::raw() for this model: …

Model::raw() lets you do your own templating. See Raw completions.

Streaming

chat() is currently synchronous: it returns the complete Response once decoding finishes. A streaming variant — likely Model::chatStream(): \Generator — is in the roadmap.

For long generations under a request/response model where blocking is unacceptable, the workable shortcut today is to set a tight maxTokens and call chat() repeatedly with the previous turn appended to the Prompt. That sacrifices KV-cache reuse but works.

Next

Raw completions

Model::raw() is the escape hatch for callers who want full control over the prompt string instead of going through Prompt + Model::chat().

public function raw(
    string $prompt,
    int    $maxTokens   = 128,
    int    $nCtx        = 2048,
    float  $temperature = 0.0,
    int    $seed        = 1234,
    bool   $addBos      = true,
): string;

Returns a plain string — no Response wrapper, no reasoning split. If you want either of those, use chat().

When to use raw()

Three legitimate use cases:

1. Models without a chat template

“Base” / “pretrained” / “foundation” models — Llama 3 base, Mistral base, Qwen base — ship GGUFs that haven’t been instruction-tuned and have no embedded chat template. Model::chat() rejects them with:

InferenceException: model has no embedded chat template — use
Model::raw() for this model

For these models, raw() is the only path. Build prompts in whatever shape the model expects — typically just free-form text-continuation:

$text = $model->raw(
    "The capital of France is",
    maxTokens: 8,
    temperature: 0.0,
);
// " Paris."

2. Custom chat templates

Maybe the model’s embedded chat template doesn’t match what you want — e.g. you want to add a tool-result message that the embedded template doesn’t know about, or you’re injecting RAG context in a non-standard slot. Build the prompt string yourself:

$prompt = <<<TXT
<|im_start|>system
You are a calculator. Only emit JSON: {"result": <number>}.
<|im_end|>
<|im_start|>user
What is 2+2?
<|im_end|>
<|im_start|>assistant

TXT;

$text = $model->raw($prompt, maxTokens: 32, temperature: 0.0);
// '{"result": 4}'

The trade-off: you own template correctness. The chat template that chat() uses is the one the model author tested with; hand-rolling means hand-checking.

3. Stop-sequence simulation

Stop-string support is on the roadmap but not in v0.1. If you need a generation to halt at a specific marker, raw() plus post-processing is the workaround:

$text = $model->raw($promptEndingWithMarker, maxTokens: 256);
$text = substr($text, 0, strpos($text, '</answer>') ?: strlen($text));

What raw() does NOT do

  • No reasoning split. raw() returns a string, not a Response. If the model emits <think>…</think> blocks, they end up in your string verbatim. You can strip them yourself with a regex if it matters; the canonical case for that is Reasoning models.
  • No chat-template rendering. What you pass in is what gets tokenized.
  • No finish-reason or token-count metadata. If you need those, use chat().

addBos

The addBos: true default tells the tokenizer to prepend the model’s beginning-of-sequence token (whatever it is for that family). For most models that’s right. Set addBos: false when:

  • You’re building a prompt that already starts with the BOS token explicitly (rare).
  • The model’s tokenizer rejects BOS prepending (also rare).
  • You’re feeding raw() mid-conversation and don’t want a new BOS in the middle (very rare and probably a sign you should be using a Prompt).

The other named arguments — maxTokens, nCtx, temperature, seed — behave the same as in chat(). See Options reference.

When NOT to use raw()

If the model has a chat template and you’re sending the “system / user / assistant” shape: use chat(). It’s shorter, safer, and the Response it returns gives you reasoning splitting + metadata for free.

raw() exists so escape hatches don’t require dropping the extension entirely. Treat it as the lower-level layer, not the default.

Next

Embeddings

Model::embed() turns a piece of text into a fixed-length vector of floats. Cosine similarity between two such vectors approximates semantic similarity between the texts they came from — that’s the foundation of every semantic-search / RAG pipeline.

public function embed(string $text): \Displace\Infer\Embedding;

Enable embedding mode at load time

Embedding generation requires a context built with with_embeddings(true) under the hood. Because that conflicts with generation mode for a given context, ext-infer makes the choice explicit at load:

use Displace\Infer\Model;

$model = Model::load('models/embedding-model.gguf', [
    'embedding' => true,
]);

With embedding: true, embed() works. Without it, embed() throws:

InferenceException: Model::embed() requires loading with ['embedding' => true]

chat() and raw() still work on an embedding-loaded handle — they build their own per-call context for generation. So one handle can do both, but you opt in to embed() explicitly.

Pooling

Sentence embeddings need a way to collapse the per-token hidden states into a single vector. Different model families do this differently:

PoolingUsed by
meanBGE, GTE, E5 — average across tokens
clsoriginal BERT — uses the [CLS] token’s hidden state
lastQwen3-Embedding — uses the last token’s hidden state
rankrerankers — emits a single score, not a vector
noneper-token vectors, no pooling

Modern embedding GGUFs declare their pooling type in metadata. ext-infer’s default is 'unspecified' (trust the metadata):

$model = Model::load($path, ['embedding' => true]);
// pooling: whatever the GGUF says (almost always correct)

Override if a GGUF ships without the metadata or you want to experiment:

$model = Model::load($path, [
    'embedding' => true,
    'pooling'   => 'mean',   // 'unspecified' | 'none' | 'mean' | 'cls' | 'last' | 'rank'
]);

An unknown pooling string is rejected at load time, not at first embed() call:

InferException: invalid option pooling: expected one of
unspecified/none/mean/cls/last/rank, got "weighted"

Generating embeddings

$emb = $model->embed('The cat sat on the mat.');

$emb->vector();        // list<float> — length matches the model's n_embd
$emb->dimensions();    // int — same as count($emb->vector())

Vectors are returned as PHP arrays of floats (doubles); internally we hold Vec<f32> and let ext-php-rs convert f32 → f64 at the boundary, which is lossless.

Vector math, built in

Embedding carries the math you need most of the time so you don’t have to write a numpy-equivalent in PHP:

$emb->norm();              // float — L2 norm: sqrt(sum_i x_i^2)
$emb->normalize();         // new Embedding scaled to unit length
$a->cosineSimilarity($b);  // float in [-1, 1]

normalize() returns a new Embedding — the original is not modified. This matters for caching: cache the normalized form once, then every subsequent cosineSimilarity call is just a dot product.

cosineSimilarity() throws on a dimension mismatch:

InferenceException: cannot compare embeddings of different
dimensions: 1024 vs 384

That’s deliberate — comparing across model families is almost always a bug, and silently returning a number would hide it.

Why normalize before comparing?

Cosine similarity ignores magnitude — it compares direction. If either vector has magnitude zero, the answer is undefined; we return 0.0 rather than NaN. If both are non-zero, cosineSimilarity does the right thing on un-normalized vectors too. But:

  • For a fixed corpus you query against, normalizing once is cheap and makes the inner loop a single dot product: array_sum(array_map(fn($x, $y) => $x * $y, $a, $b)).
  • For pgvector / sqlite-vec storage, you usually want normalized vectors stored so the database can use the inner-product operator (<#> in pgvector) instead of the cosine operator (<=>).

A canonical pipeline:

$query = $model->embed($userQuestion)->normalize();
$best  = null;
$bestScore = -INF;
foreach ($corpusEmbeddings as $docId => $docEmb) {
    // $docEmb is also pre-normalized
    $score = $query->cosineSimilarity($docEmb);
    if ($score > $bestScore) {
        $best = $docId;
        $bestScore = $score;
    }
}

For real-world indexing — even at a few thousand documents — push the storage into a database. See Semantic search and RAG over markdown.

Choosing an embedding model

The chat-tuned models people download for completions (Qwen3-0.6B, Llama 3.2 3B, Mistral 7B) can be loaded with embedding: true and will return a vector — but it’s not what they were trained for, and similarity numbers are noisier than what a purpose-built embedding model produces.

Model familyDimsNotes
Qwen3-Embedding (0.6B)1024Apache-2.0. Same architecture as Qwen3-0.6B, retrained for embeddings. Strong default.
BGE-small / BGE-large384 / 1024Beijing Academy of AI. Widely used, mean pooling.
E5-small / E5-large384 / 1024Microsoft. Trained on text similarity tasks.
GTE-small / GTE-large384 / 1024Alibaba.

See Choosing a model for more on GGUF quants and what size to start with.

Next

Choosing a model

ext-infer loads any GGUF file llama.cpp can handle. Picking which GGUF is the most important choice you’ll make — it dominates inference quality, memory footprint, and latency. This page is a tour of the landscape.

What is GGUF?

GGUF (GPT-Generated Unified Format) is llama.cpp’s native model format. A .gguf file packs:

  • Weights in a specific quantization.
  • Tokenizer (vocabulary + merges).
  • Architecture metadata (layer count, hidden size, attention config) so llama.cpp knows how to run the model without a separate config file.
  • Chat template (for instruct models) so Model::chat() knows how to render messages.
  • Pooling type (for embedding models) so Model::embed() knows how to collapse hidden states.

GGUF files self-contain everything ext-infer needs. There is no separate config / tokenizer / vocab file to manage.

Model families

There are three broad categories you’ll encounter:

CategoryWhat ext-infer method to useExamples
Base / pretrainedraw() onlyLlama 3 base, Mistral 7B base, Qwen base
Chat / instructchat(), raw()Qwen3-Instruct, Llama 3.x Instruct, Mistral Instruct
Embedding / rerankerembed()Qwen3-Embedding, BGE, E5, GTE

A chat model loaded with 'embedding' => true will return a vector, but it’s not what the model was optimized for — the vectors are noisier than what a purpose-built embedding model produces. The reverse (chat() against a pure embedding GGUF) usually fails because embedding-only models don’t ship a chat template.

Quantization

A 7B-parameter model at full precision is ~14 GB on disk. Quantization trades a small amount of quality for a much smaller, faster file. The suffixes you’ll see in GGUF filenames:

SuffixApprox. size for a 7B modelQualityNotes
F16~14 GBLosslessReference. Rarely worth the size unless you have plenty of memory.
Q8_0~7 GBNear-losslessGood default when you can afford the disk.
Q6_K~5.5 GBExcellent
Q5_K_M~5 GBVery good
Q4_K_M~4.5 GBGoodThe most popular size/quality compromise.
Q4_K_S~4 GBSolid
Q3_K_M~3.5 GBNoticeable degradationUseful on memory-constrained boxes.
Q2_K~2.5 GBSignificant degradationLast resort.

The K-family (Q4_K_M, etc.) uses k-quants, a smarter scheme than the legacy non-K variants (Q4_0, Q4_1). Prefer K-quants when both are offered for the same model.

Picking a quant

Two questions:

  1. How much memory can you spend? Quants below Q4_K_M save space at increasing quality cost. Above Q4_K_M, the marginal gain per GB shrinks fast.
  2. Is the model small enough that quantization barely matters? For sub-1B models like Qwen3-0.6B, even Q8_0 is ~640 MB — negligible by 2026 standards. Take the quality bump.

A good default rule: Q4_K_M for models > 3B, Q8_0 for smaller models.

What we’ve actually tested against:

Chat (smallest reasonable)

Qwen/Qwen3-0.6B-GGUF — Apache-2.0, 600M params, Q8 ≈ 640 MB. Reasoning model: emits <think>…</think> blocks through its chat template, which Response splits for you. Great for getting started; not great for production-quality answers.

curl -L -o models/Qwen3-0.6B-Q8_0.gguf \
    https://huggingface.co/Qwen/Qwen3-0.6B-GGUF/resolve/main/Qwen3-0.6B-Q8_0.gguf

Chat (production-ish)

bartowski/Qwen3-7B-Instruct-GGUF at Q4_K_M (~4.4 GB) — same family, much better reasoning quality. Or bartowski/Llama-3.2-3B-Instruct-GGUF at Q4_K_M (~1.9 GB) for a smaller, non-reasoning option.

Embedding (small, fast)

Qwen/Qwen3-Embedding-0.6B-GGUF — Apache-2.0, 1024-dim embeddings, last pooling baked into metadata. Same size as the chat model; quality is competitive with BGE/E5 small variants.

curl -L -o models/Qwen3-Embedding-0.6B-Q8_0.gguf \
    https://huggingface.co/Qwen/Qwen3-Embedding-0.6B-GGUF/resolve/main/Qwen3-Embedding-0.6B-Q8_0.gguf

Alternative: CompendiumLabs/bge-small-en-v1.5-gguf — 384-dim, mean pooling, ~130 MB. Lower-quality vectors but tiny.

Where to look for more

  • Hugging Face GGUF taglibrary=gguf filters to GGUF-format models.
  • bartowski — prolific publisher of quantized GGUFs for popular models. Reliable, consistent naming.
  • mradermacher — ditto.
  • The model’s own official GGUF repo when one exists (e.g. Qwen/Qwen3-7B-Instruct-GGUF) — always the most trusted source.

License caveats

GGUF files inherit the underlying model’s license. Some models that are nominally “open” (Llama 3.x, Gemma) ship under custom licenses with use restrictions; others (Qwen, Mistral, several smaller players) are Apache-2.0 / MIT. Check the model card before depending on a model in a commercial deployment.

ext-infer itself is MIT-licensed — the extension doesn’t care which GGUF you load, but downstream concerns are on you.

Next

Options reference

Every option that any ext-infer method accepts, in one table per method. For conceptual context on individual options, follow the links in the rightmost column.

Model::load($path, $options)

The second argument is an associative array. Keys are kept as snake-case strings (like PHP ini settings) because load-time tuning is rare and the array form composes well with config arrays loaded from disk.

KeyTypeDefaultSee
n_gpu_layersint0Performance tuning
use_mmapbooltruePerformance tuning
use_mlockboolfalsePerformance tuning
embeddingboolfalseEmbeddings
poolingstring'unspecified'Embeddings

Validation rules

  • Unknown keys are not rejected — they’re silently ignored. This is deliberate (forward-compatibility for callers loading config from files), but it means typos will be silent. If you suspect a typo, verify with var_dump against the same string before reporting a bug.
  • Type mismatches are rejected, with a clear message: invalid option n_gpu_layers: expected integer.
  • Negative integers and out-of-range values for n_gpu_layers are rejected: invalid option n_gpu_layers: must be non-negative.
  • pooling accepts only the six strings listed in Embeddings → Pooling.

Model::chat($prompt, ...)

Named arguments — no array. PHP 8.0+ named-arguments syntax echoes the ident verbatim, so you write maxTokens: 256 (camelCase, per PSR-12).

ArgumentTypeDefaultSee
$prompt\Displace\Infer\PromptrequiredPrompts
maxTokensint128Chat completions
nCtxint2048Chat completions
temperaturefloat0.0Chat completions
seedint1234Chat completions

Behavior

  • temperature = 0.0 is greedy (deterministic). > 0.0 samples, controlled by seed.
  • seed is only consulted when temperature > 0.
  • maxTokens caps generation. Hitting it sets Response::finishReason() to 'length'.
  • nCtx is the context window for this call. If the rendered prompt exceeds it, InferenceException is raised before generation starts.

Model::raw($prompt, ...)

Same named-argument shape as chat() plus addBos.

ArgumentTypeDefaultSee
$promptstringrequiredRaw completions
maxTokensint128Chat completions
nCtxint2048Chat completions
temperaturefloat0.0Chat completions
seedint1234Chat completions
addBosbooltrueRaw completions → addBos

Model::embed($text)

Just the text. Pooling and embedding-mode are configured at load time (see Model::load above).

ArgumentTypeDefaultSee
$textstringrequiredEmbeddings

Embedding math

Embedding is read-only; the math methods return new instances rather than mutating.

MethodReturns
vector()list<float>
dimensions()int
norm()float
normalize()new Embedding
cosineSimilarity(Embedding $other)float (in [-1, 1])

cosineSimilarity throws InferenceException on a dimension mismatch — see Embeddings → vector math.

Prompt

Static factories + immutable with* builders.

MethodReturns
Prompt::system($content)new Prompt
Prompt::user($content)new Prompt
withSystem($content)new Prompt
withUser($content)new Prompt
withAssistant($content)new Prompt
messages()list<Message>
lastRole()?string
count()int
isEmpty()bool

See Prompts for the immutability semantics.

Response

Read-only. Six getters.

MethodReturns
text()string
reasoning()?string
answer()string
hasReasoning()bool
finishReason()string'eos'/'length'/'stop'
tokensGenerated()int

See Chat completions → Inspecting a Response.

Environment

Not strictly an option, but bears mentioning here:

VariableEffect
EXT_INFER_LOG=1Restore llama.cpp’s verbose stderr logging (silenced by default).

See Environment variables.

Multi-turn chat

The pattern: keep the system message stable, append user/assistant turns as the conversation grows, regenerate the prompt on each user input. Lifts directly from examples/chat-interactive/.

The shape

use Displace\Infer\Model;
use Displace\Infer\Prompt;

$model = Model::load('models/Qwen3-0.6B-Q8_0.gguf');

$base         = Prompt::system('You are a helpful, concise assistant.');
$conversation = $base;

while (($line = readline('> ')) !== false) {
    $line = trim($line);
    if ($line === '' || $line === '/exit') {
        break;
    }

    // /reset is trivial because Prompt is immutable.
    if ($line === '/reset') {
        $conversation = $base;
        continue;
    }

    $conversation = $conversation->withUser($line);

    $response = $model->chat(
        $conversation,
        maxTokens: 512,
        temperature: 0.7,
    );

    echo $response->answer(), PHP_EOL;

    // Feed answer() back, NOT text(). See "Reasoning models" below.
    $conversation = $conversation->withAssistant($response->answer());
}

$model->close();

Three things this gets right

1. The system message is stable

$base is built once and never mutated. Every /reset re-seats the conversation at the original system instruction without re-allocating or re-rendering. If you change the system prompt mid-conversation elsewhere in your app, the immutable shape means concurrent uses of $base aren’t affected.

2. Conversation grows by immutable append

$conversation = $conversation->withUser($line);

Every with* returns a new Prompt. The old $conversation is still valid (and still has the previous turn count); the local just points at the new one. There’s no shared mutable state, so this code is safe to put behind a queue worker or run in parallel.

3. Response::answer() goes back, not text()

$conversation = $conversation->withAssistant($response->answer());

This matters for reasoning models. answer() is the reply with <think>...</think> blocks stripped; text() is the raw output including the thoughts. Feeding text() back means the model sees its own internal monologue on the next turn — and reasoning models tend to treat that as instruction, not history. The output derails fast.

For non-reasoning models, answer() and text() are byte-identical, so the rule is “use answer() always” rather than “use answer() for some models”.

Persisting conversations

If you need to save and restore conversations (e.g. per-user chat history in a database), serialize the message list and rebuild the Prompt:

function loadConversation(string $system, array $history): Prompt
{
    $p = Prompt::system($system);
    foreach ($history as $row) {
        $p = match ($row['role']) {
            'user'      => $p->withUser($row['content']),
            'assistant' => $p->withAssistant($row['content']),
        };
    }
    return $p;
}

function saveConversation(Prompt $p): array
{
    $rows = [];
    foreach ($p->messages() as $msg) {
        $rows[] = ['role' => $msg->role(), 'content' => $msg->content()];
    }
    return $rows;
}

Prompt::messages() walks in chronological order, so saving and re-loading round-trips faithfully.

Common shape: an HTTP turn

For a request/response API where every HTTP call is one turn:

// Inside your controller — assumes $model is injected and reused.
final class ChatController
{
    public function __construct(private Model $model, private HistoryStore $history) {}

    public function turn(Request $req): Response
    {
        $conversationId = $req->session('conversation_id');
        $history        = $this->history->load($conversationId);
        $system         = $req->user()->systemPrompt() ?? 'You are helpful.';

        $prompt = loadConversation($system, $history)
            ->withUser($req->json('message'));

        $reply = $this->model->chat(
            $prompt,
            maxTokens: 1024,
            temperature: 0.5,
        );

        $this->history->append($conversationId, 'user', $req->json('message'));
        $this->history->append($conversationId, 'assistant', $reply->answer());

        return new JsonResponse([
            'answer'    => $reply->answer(),
            'reasoning' => $reply->reasoning(),
            'truncated' => $reply->finishReason() === 'length',
            'tokens'    => $reply->tokensGenerated(),
        ]);
    }
}

The $model is loaded once at FPM-worker boot — not per request — and chat() is called per request. With current ext-infer (no KV-cache reuse yet), each turn re-tokenizes and re-decodes the full history, which is slow for long conversations. A Session object that reuses the underlying llama.cpp context is on the roadmap.

When to use Model::raw() instead

If you have a very specific prompt shape — tool calls, RAG context injected at a non-standard slot, custom format — see Raw completions. The Prompt builder doesn’t support tool-call messages today, so tool-aware conversations need raw() until tool calling lands.

Semantic search

Embed a corpus once, embed user queries on demand, return the closest matches by cosine similarity. The foundation of every “search by meaning, not keywords” pipeline.

Minimal in-memory version

use Displace\Infer\Model;

$model = Model::load('models/Qwen3-Embedding-0.6B-Q8_0.gguf', [
    'embedding' => true,
]);

// Embed the corpus once. In real code, do this offline and cache.
$corpus = [
    'doc-1' => 'PHP is a server-side scripting language.',
    'doc-2' => 'Cats are popular pets known for their independence.',
    'doc-3' => 'Rust provides memory safety without garbage collection.',
    'doc-4' => 'Dogs are descendants of wolves, domesticated millennia ago.',
];
$index = [];
foreach ($corpus as $id => $text) {
    // Normalize once so the search loop is a plain dot product.
    $index[$id] = $model->embed($text)->normalize();
}

// Search.
function search(Model $model, array $index, string $query, int $k = 3): array
{
    $q = $model->embed($query)->normalize();
    $hits = [];
    foreach ($index as $id => $emb) {
        $hits[$id] = $q->cosineSimilarity($emb);
    }
    arsort($hits);
    return array_slice($hits, 0, $k, preserve_keys: true);
}

print_r(search($model, $index, 'a typesafe language'));
// Array
// (
//     [doc-3] => 0.7421
//     [doc-1] => 0.4567
//     [doc-2] => 0.1234
// )

$model->close();

Three things to know

Normalize when you index

Embedding::normalize() returns a unit vector. With both sides normalized, cosine similarity simplifies to a dot product:

cos(a, b) = (a · b) / (||a|| · ||b||)
          = a_unit · b_unit            // if both are normalized

Normalize once at index time so the per-query work is just the dot product. Embedding::cosineSimilarity() does the normalization internally if you skip the explicit step — but you pay for it on every call, which adds up across thousands of documents.

Pick an embedding model, not a chat model

A chat-tuned model loaded with 'embedding' => true will return a vector, but the similarity numbers cluster too tightly to be useful at scale. Use a purpose-built embedding model — see Choosing a model.

What “useful” looks like with a real embedding model (Qwen3-Embedding-0.6B):

cat-mat ↔ feline-rug:      0.72   (paraphrase)
cat-mat ↔ grocery-shop:    0.29   (unrelated)
feline-rug ↔ grocery-shop: 0.26   (unrelated)

Same query with the chat-tuned Qwen3-0.6B (loaded in embedding mode):

cat-mat ↔ feline-rug:      0.66
cat-mat ↔ grocery-shop:    0.51
feline-rug ↔ grocery-shop: 0.50

The chat model preserves the ordering — the related pair scores highest — but the gap is much narrower, so the cut-off threshold between “match” and “not match” is harder to draw.

Cache the index

In production, the in-memory dictionary in the example above doesn’t scale past a few thousand documents — the search loop is O(corpus size). Two upgrade paths:

  • Persist embeddings to disk (a JSON file, SQLite blob column, pickle equivalent). Saves the embed-time cost on subsequent runs.
  • Index with a vector database: pgvector (PostgreSQL extension), sqlite-vec, Qdrant, Pinecone. They handle the nearest-neighbor search far more efficiently than a PHP loop.

See RAG over markdown for a worked example using sqlite-vec.

Re-ranking with a chat model

For higher-quality top-K, embed-rank-then-rerank-with-a-chat-model is the canonical pattern:

// 1. Coarse retrieval — embedding similarity, top 20.
$hits = search($embedModel, $index, $query, k: 20);

// 2. Fine reranking — ask a chat model to score each candidate.
$prompt = Prompt::system(
    'You are a relevance judge. Given a query and a document, ' .
    'respond with a single number between 0 and 1 indicating ' .
    'how relevant the document is to the query.'
);
$rerank = [];
foreach (array_keys($hits) as $docId) {
    $r = $chatModel->chat(
        $prompt->withUser("Query: {$query}\n\nDocument: {$corpus[$docId]}"),
        maxTokens: 8,
        temperature: 0.0,
    );
    $rerank[$docId] = (float) trim($r->answer());
}
arsort($rerank);

That’s two model loads — one embedding, one chat. Reuse handles across requests; loading is the expensive step.

Next

Reasoning models

Qwen3, DeepSeek R1, and other reasoning-tuned models think out loud before answering. When invoked through their chat template, they emit <think>…</think> blocks containing the internal monologue, then the actual reply. ext-infer understands this convention and exposes the two streams separately on Response.

The split, in three calls

use Displace\Infer\Model;
use Displace\Infer\Prompt;

$model = Model::load('models/Qwen3-0.6B-Q8_0.gguf');

$response = $model->chat(
    Prompt::user('What is 2+2?'),
    maxTokens: 512,
);

echo $response->text(), PHP_EOL;
// <think>
// Okay, the user is asking what 2+2 is. This is basic arithmetic.
// I should respond with the correct sum, which is 4. Let me also
// verify there's no trick here — adding two and two definitely
// equals four.
// </think>
//
// 2 + 2 equals 4.

echo $response->answer(), PHP_EOL;
// 2 + 2 equals 4.

echo $response->reasoning(), PHP_EOL;
// Okay, the user is asking what 2+2 is. This is basic arithmetic.
// I should respond with the correct sum, which is 4. ...

echo $response->hasReasoning() ? 'yes' : 'no', PHP_EOL;
// yes

$model->close();

For a non-reasoning model:

  • reasoning() returns null
  • answer() equals text() byte-for-byte
  • hasReasoning() returns false

The split is opt-out: there’s no flag to disable it. If the input doesn’t contain <think>…</think> tags, nothing changes.

When the budget runs out mid-thought

Reasoning chains can be long. If maxTokens exhausts inside a <think> block — before the closing </think> — the split fails gracefully:

  • text() contains the partial reasoning verbatim, with the open <think> tag and no closing tag.
  • reasoning() returns any previous completed reasoning blocks, or null if none.
  • answer() is the input with completed blocks removed and the partial thought left in place. The partial thought is intentionally left in answer() — silently swallowing it would hide the budget problem.
  • finishReason() returns 'length'.

The fix is always “bump maxTokens”. A useful pattern is to surface the truncation explicitly:

$response = $model->chat($prompt, maxTokens: 256);

if ($response->finishReason() === 'length') {
    error_log(sprintf(
        'truncated: model wanted more than 256 tokens for "%s..."',
        substr($prompt->messages()[0]->content(), 0, 40),
    ));
}

The interactive chat example uses a softer hint: “(truncated — bump –max-tokens to see more)”.

When you DON’T want reasoning at all

Two strategies, depending on what “don’t want” means.

Strategy A — hide it in the UI, keep it under the hood

Default everywhere. Show $response->answer() to the end user. Log $response->reasoning() for debugging or display behind a “show thinking” toggle. No model-level change.

Strategy B — tell the model to skip thinking

Qwen3 has a /no_think directive that, when included as a system-message suffix, suppresses the <think>...</think> block entirely. The model still emits an empty <think></think> block (which the split handles — reasoning() ends up being an empty string), but skips the actual monologue:

$prompt = Prompt::system('You are helpful. /no_think')
    ->withUser('What is 2+2?');

$response = $model->chat($prompt);

$response->hasReasoning();   // true (empty block)
$response->reasoning();      // "" (empty string)
$response->answer();         // "2 + 2 equals 4."

This is Qwen3-specific. DeepSeek R1 has a similar concept (/no-cot in some prompts). Other reasoning models vary. Check the model card.

Feeding history back

When building multi-turn conversations against a reasoning model, feed Response::answer() back as the assistant’s reply, not Response::text():

$conversation = $conversation->withAssistant($response->answer());
//                                          ^^^^^^^^^^^^^^^^^^^
//                                          not ->text()

text() includes the <think>…</think> block. Adding it to the conversation means the model sees its own reasoning on the next turn and tends to treat it as instruction rather than history — output quality drops fast.

This is the single most-common mistake when wiring up reasoning models in ext-infer. See Multi-turn chat for the full pattern.

Performance note

Reasoning models spend many tokens on their internal monologue. A typical Qwen3-0.6B answer to “what is 2+2?” generates ~150 tokens of thinking before the 5-token answer. That’s an order of magnitude more work than a non-reasoning model would do for the same question.

If latency matters more than the highest-quality answer:

  • Use /no_think (Strategy B above) to skip the monologue.
  • Pick a non-reasoning model — Llama 3.x Instruct, Mistral Instruct, Qwen 2.5 Instruct (not Qwen3) all chat without thinking out loud.

See Performance tuning for more knobs.

RAG over markdown

Retrieval-Augmented Generation: instead of asking the model what it knows (and getting whatever its training data captured, possibly incorrectly), embed your own documents into a vector store, retrieve the most relevant ones at query time, and feed them to the model as context. The model answers from your data.

This recipe walks through the smallest practical version: a folder of markdown files, indexed once into sqlite-vec, queried on demand.

Prerequisites

  • An embedding model — Qwen3-Embedding-0.6B works well.
  • A chat model — Qwen3-7B-Instruct or similar.
  • The sqlite-vec extension loaded into PHP’s PDO SQLite (or use the sqlite3 CLI tools).
# macOS — Homebrew has it
brew install asg017/sqlite-vec/sqlite-vec
# Linux — see the sqlite-vec README for distro packages

Schema

A single table holds documents and their embeddings:

CREATE TABLE IF NOT EXISTS docs (
    id    INTEGER PRIMARY KEY AUTOINCREMENT,
    path  TEXT UNIQUE NOT NULL,
    body  TEXT NOT NULL
);

-- sqlite-vec virtual table for k-nearest-neighbor search.
-- 1024 dimensions matches Qwen3-Embedding-0.6B.
CREATE VIRTUAL TABLE IF NOT EXISTS doc_vecs USING vec0(
    id    INTEGER PRIMARY KEY,
    embed FLOAT[1024]
);

Indexing

Walk a directory, embed each file, persist:

declare(strict_types=1);

use Displace\Infer\Model;

$embedder = Model::load('models/Qwen3-Embedding-0.6B-Q8_0.gguf', [
    'embedding' => true,
]);

$pdo = new PDO('sqlite:rag.db');
$pdo->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION);
$pdo->sqliteCreateFunction('load_extension', 'sqlite_vec_init', 1);  // see sqlite-vec docs

$pdo->exec(file_get_contents('schema.sql'));

$insertDoc = $pdo->prepare(
    'INSERT INTO docs (path, body) VALUES (:path, :body)
     ON CONFLICT(path) DO UPDATE SET body = excluded.body
     RETURNING id'
);
$insertVec = $pdo->prepare(
    'INSERT OR REPLACE INTO doc_vecs (id, embed) VALUES (:id, :embed)'
);

$root = $argv[1] ?? './notes';
foreach (new RecursiveIteratorIterator(new RecursiveDirectoryIterator($root)) as $f) {
    if ($f->getExtension() !== 'md') {
        continue;
    }
    $body = file_get_contents($f->getPathname());
    $insertDoc->execute([':path' => $f->getPathname(), ':body' => $body]);
    $id = (int) $insertDoc->fetchColumn();

    // Pre-normalize so search is a dot product.
    $vector = $embedder->embed($body)->normalize()->vector();

    $insertVec->execute([
        ':id'    => $id,
        ':embed' => pack('f*', ...$vector),   // sqlite-vec wants float32 bytes
    ]);

    echo "indexed: {$f->getPathname()} ({$id})\n";
}

$embedder->close();

Run once to build the index, again whenever your notes change. For larger corpora, chunk each file into ~500-token sections and embed each chunk separately — sentence-level granularity gives better retrieval than whole-file vectors.

Retrieval + generation

declare(strict_types=1);

use Displace\Infer\Model;
use Displace\Infer\Prompt;

$embedder = Model::load('models/Qwen3-Embedding-0.6B-Q8_0.gguf', [
    'embedding' => true,
]);
$chat = Model::load('models/Qwen3-7B-Instruct-Q4_K_M.gguf');

$pdo = new PDO('sqlite:rag.db');
$pdo->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION);

$query = $argv[1] ?? 'What did I decide about the migration?';

// 1. Embed the query.
$qvec = $embedder->embed($query)->normalize()->vector();

// 2. Top-k retrieval via sqlite-vec.
$stmt = $pdo->prepare(<<<SQL
    SELECT
        docs.path,
        docs.body,
        vec_distance_cosine(doc_vecs.embed, :qvec) AS distance
    FROM doc_vecs
    JOIN docs ON docs.id = doc_vecs.id
    ORDER BY distance ASC
    LIMIT 4
SQL);
$stmt->execute([':qvec' => pack('f*', ...$qvec)]);
$hits = $stmt->fetchAll(PDO::FETCH_ASSOC);

// 3. Build a context-injected prompt.
$context = '';
foreach ($hits as $i => $hit) {
    $context .= sprintf("--- Document %d (%s) ---\n%s\n\n", $i + 1, $hit['path'], $hit['body']);
}

$prompt = Prompt::system(<<<SYS
You answer questions strictly using the provided documents.
If the documents don't contain the answer, say so — do not invent.
Cite the document number when you quote.
SYS)
    ->withUser("Documents:\n\n{$context}\n\nQuestion: {$query}");

// 4. Ask.
$response = $chat->chat($prompt, maxTokens: 1024, temperature: 0.3);

echo $response->answer(), PHP_EOL;

$embedder->close();
$chat->close();

What good output looks like

For a corpus of personal notes, a query like “what did I decide about the migration?” returns:

Based on Document 2 (notes/migration.md), you decided to defer the
schema change to Q3 in favor of shipping the redirect layer first.
The reasoning cited there was that the redirect layer was lower-risk
and would surface the migration's actual hot paths before you
committed to the column rename.

If the corpus doesn’t contain the answer:

The provided documents don't address the migration decision directly.
Document 1 mentions a planned schema change but doesn't record what
was decided. I'd need more context to answer.

That “I don’t know” behavior is what the system prompt enforces. Models will happily make up plausible answers without it.

Knobs worth tuning

KnobEffect
Top-k (the LIMIT 4 above)More context = better answers but slower + risks the model conflating unrelated documents. 3–5 is a good default.
Chunk size at index timeWhole-file is simple but coarse. 500-token chunks give finer retrieval at the cost of ~10x more vectors.
temperature on the chat modelSet low (0.00.3) for factual answers; the model should be quoting, not improvising.
System prompt strictness“Cite documents” + “say so if unknown” is the difference between RAG and a model that just sometimes incorporates your context.

What this recipe doesn’t cover

  • Reranking — top-k by embedding similarity is fast but coarse. See the Semantic search recipe for the chat-model-as-reranker pattern.
  • Streaming responsesModel::chat() is currently synchronous. See the roadmap.
  • Production-grade chunking — markdown-aware splitting that respects code blocks, headers, lists. Worth a library; not in scope for ext-infer itself.

Next

Worker pools

LLM inference is slow: tens of milliseconds at best, often seconds. Running it inline in an FPM worker means that worker is unavailable for any other request until the model is done. For any non-trivial deployment, you want a pool of workers — process-based, thread-based, or queue-based — that absorbs the latency without starving the rest of your app.

ext-infer is designed to slot into all three patterns.

Pattern 1 — FPM workers (process-based)

The simplest production setup: PHP-FPM with pm.max_children set high enough to absorb concurrent slow inference requests.

; php-fpm.d/www.conf
pm = dynamic
pm.max_children = 16
pm.start_servers = 4
pm.min_spare_servers = 2
pm.max_spare_servers = 8
pm.process_idle_timeout = 60s

Each FPM worker is its own OS process. They each load their own Model once at warm-up and reuse it for the lifetime of the worker. The model weights are mmap’d, so the OS shares physical memory across workers — 16 workers loading the same 4 GB model use ~4 GB of RAM total, not 64.

// Shared service container — boot once per worker.
$model = Displace\Infer\Model::load($cfg['model_path']);

// In your request handler:
$response = $model->chat($prompt, maxTokens: 512);

The downside: each worker can handle one inference at a time. If you hit pm.max_children concurrent requests, the (max_children + 1)st request waits. Bump max_children if you have the RAM (the model is shared via mmap; only the KV cache scales with concurrency); push inference to a queue if you don’t.

Sizing

A rough sizing heuristic for FPM with ext-infer:

max_children ≈ (RAM_budget - model_size) / per_request_memory

Where per_request_memory is the KV cache footprint plus PHP’s working set — usually 100–500 MB per worker depending on nCtx.

Pattern 2 — Job queue (process-based, decoupled)

For inference that takes long enough that you don’t want it in the request path at all:

// In the request handler — enqueue, return immediately.
$jobId = $queue->push(InferJob::class, [
    'prompt'  => $prompt,
    'options' => ['maxTokens' => 512],
]);
return new JsonResponse(['job_id' => $jobId, 'status' => 'queued']);

// Client polls /jobs/{id} until status = 'done'.
// In your queue worker — long-lived, model loaded once at boot.
final class InferWorker
{
    public function __construct(private \Displace\Infer\Model $model) {}

    public function process(InferJob $job): InferResult
    {
        $r = $this->model->chat($job->prompt, ...$job->options);
        return new InferResult($r->answer(), $r->finishReason());
    }
}

Any queue runner works — Symfony Messenger, Laravel Horizon, ReactPHP’s react/event-loop, a bespoke pcntl_fork + proc_open script. The pattern is the same: one Model::load() per worker process, reuse across many jobs.

This pattern shines when:

  • Inference latency is unpredictable and you don’t want to hold HTTP connections open.
  • You want to scale inference workers independently of web workers.
  • You want to route inference traffic across heterogeneous workers (CPU-only on cheap nodes, GPU-equipped on others).

Pattern 3 — ZTS + parallel (thread-based)

For latency-sensitive workloads where the IPC overhead of pattern 2 is too much, ext-infer supports concurrent calls within a single process under ZTS PHP with the parallel extension.

This works because ext-infer is thread-safe by design:

  • LlamaBackend is a Sync-guarded process-global singleton.
  • LlamaModel (the weights) is immutable after load; llama.cpp explicitly supports many contexts on one model.
  • Each chat() / raw() / embed() call builds its own per-call LlamaContext and drops it after.

Two threads calling Model::chat() simultaneously on the same handle is the supported, intended shape.

use parallel\Runtime;

// Load the model once in the main thread.
$model = Displace\Infer\Model::load('models/qwen3.gguf');

// Spin up a pool of runtimes.
$runtimes = array_map(fn() => new Runtime(), range(1, 4));

// Dispatch concurrent inferences.
$futures = [];
foreach ($prompts as $i => $prompt) {
    $rt = $runtimes[$i % 4];
    $futures[$i] = $rt->run(function (Model $m, Prompt $p) {
        return $m->chat($p, maxTokens: 512)->answer();
    }, [$model, $prompt]);
}

// Collect.
$answers = array_map(fn($f) => $f->value(), $futures);

$model->close();

Caveats

  • ZTS PHP is uncommon. Most distros ship NTS by default; you’ll have to build ZTS PHP from source (./configure --enable-zts) or use a ZTS-shipping Docker image. PIE’s pre-built binaries target NTS for v0.1; ZTS binaries are on the roadmap.
  • parallel itself requires ZTS. Can’t use it on a standard NTS install.
  • CI doesn’t exercise this yet. ZTS support is enabled in composer.json because the code is thread-safe by construction — but the maintainers have not yet run multi-threaded stress tests in CI. Treat it as “should work, please report bugs” until that changes. See Threading & ZTS for the current state.

Choosing between the patterns

ConcernFPM workersJob queueZTS + parallel
Easy to set up✅ trivial⚠️ some IPC⚠️ ZTS build
Holds HTTP connection during inferenceyesnoyes
Survives PHP being NTS
Shares one model across all concurrencyvia mmapper-workerwithin process
Scales to many concurrent inferences⚠️ workers eat RAM✅ horizontal⚠️ one process
Production-tested in ext-infer⚠️ unexercised

For most teams, FPM with a generous max_children is the right starting point. Move to a queue when latency variance gets too high for the request path. Reach for parallel last, when you’ve measured that IPC overhead is the bottleneck.

Next

API surface

The complete public PHP API in one place. Every method, every argument, every return type. Read this when you know what you’re looking for and just want the signature; read the Guide when you want to understand why.

For an authoritative copy in PHP-stub form (consumed by IDEs and static analyzers like PHPStan), see stubs/infer.stubs.php.

Displace\Infer\Model

final class Model
{
    public static function load(
        string $path,
        array  $options = [],
    ): self;

    public function chat(
        \Displace\Infer\Prompt $prompt,
        int   $maxTokens   = 128,
        int   $nCtx        = 2048,
        float $temperature = 0.0,
        int   $seed        = 1234,
    ): \Displace\Infer\Response;

    public function raw(
        string $prompt,
        int    $maxTokens   = 128,
        int    $nCtx        = 2048,
        float  $temperature = 0.0,
        int    $seed        = 1234,
        bool   $addBos      = true,
    ): string;

    public function embed(
        string $text,
    ): \Displace\Infer\Embedding;

    public function close(): void;
}

new Model() throws — use Model::load(). close() is idempotent (safe to call from finally blocks).

See Choosing a model, Chat completions, Raw completions, Embeddings, and Options reference.

Displace\Infer\Prompt

final class Prompt
{
    public static function system(string $content): self;
    public static function user(string $content): self;

    public function withSystem(string $content): self;
    public function withUser(string $content): self;
    public function withAssistant(string $content): self;

    /** @return list<\Displace\Infer\Message> */
    public function messages(): array;

    public function lastRole(): ?string;
    public function count(): int;
    public function isEmpty(): bool;
}

Immutable. new Prompt() throws — use a factory. See Prompts.

Displace\Infer\Message

final class Message
{
    public function role(): string;    // 'system' | 'user' | 'assistant'
    public function content(): string;
}

Read-only. Constructed only by Prompt; new Message() throws.

Displace\Infer\Response

final class Response
{
    public function text(): string;
    public function reasoning(): ?string;
    public function answer(): string;
    public function hasReasoning(): bool;
    public function finishReason(): string;  // 'eos' | 'length' | 'stop'
    public function tokensGenerated(): int;
}

Read-only. Constructed only by Model::chat(); new Response() throws. See Chat completions.

Displace\Infer\Embedding

final class Embedding
{
    /** @return list<float> */
    public function vector(): array;

    public function dimensions(): int;
    public function norm(): float;
    public function normalize(): self;
    public function cosineSimilarity(\Displace\Infer\Embedding $other): float;
}

Read-only. Constructed only by Model::embed(); new Embedding() throws. See Embeddings.

Exception hierarchy

\RuntimeException
└── Displace\Infer\InferException
    ├── Displace\Infer\ModelLoadException
    └── Displace\Infer\InferenceException

InferException extends PHP’s built-in \RuntimeException, so any generic catch (\RuntimeException $e) clause sees ext-infer errors. See Exceptions for which methods raise which subclass.

Conventions

  • Direct construction is refused on Prompt, Message, Response, Embedding, and Model. Each one throws InferException from its __construct with a hint at the right factory. This is so an arbitrary new Embedding() can’t lie about which model produced it.
  • All with* methods on Prompt return a new instance. They never mutate. This is the only place the API exposes the “build by chaining” pattern; Embedding::normalize() also returns a new instance.
  • Sampling args are named, never positional. Model::chat() and Model::raw() use PHP 8 named arguments (maxTokens: 256, temperature: 0.7) — not an options array. Load options are an array because they’re rare and compose with config-from-disk patterns.

Exceptions

ext-infer raises exceptions for every error condition — no silent false returns, no error codes. The hierarchy is small enough that you can catch precisely or broadly depending on what you’re after.

Hierarchy

\RuntimeException
└── Displace\Infer\InferException
    ├── Displace\Infer\ModelLoadException
    └── Displace\Infer\InferenceException
  • InferException extends PHP’s \RuntimeException. Catching \RuntimeException in generic top-level handlers (e.g. a PSR-15 middleware) sees every ext-infer error.
  • ModelLoadException is raised exclusively from Model::load().
  • InferenceException is raised from Model::chat(), Model::raw(), Model::embed(), and Embedding::cosineSimilarity().
  • InferException itself (the base class, not just an instance of a subclass) is raised for “this method should never have been called” errors — see Direct construction.

Which method raises what

MethodClassCommon causes
Model::load()ModelLoadExceptionMissing file, malformed GGUF, backend init failure.
Model::load()InferExceptionInvalid option type/value (e.g. pooling set to "weighted").
Model::chat()InferenceExceptionModel closed, no chat template, decode failure, prompt over nCtx.
Model::raw()InferenceExceptionModel closed, decode failure, prompt over nCtx.
Model::embed()InferenceExceptionModel closed, model not loaded with embedding: true, decode failure.
Embedding::cosineSimilarity()InferenceExceptionDimension mismatch between the two embeddings.
new Model() / new Prompt() / new Message() / new Response() / new Embedding()InferExceptionDirect construction is refused; use the appropriate factory.

Direct construction

Model, Prompt, Message, Response, and Embedding all refuse direct new. Each throws InferException (the base class) with a hint pointing at the right factory:

new Embedding();
// Displace\Infer\InferException:
//   Displace\Infer\Embedding is produced by Model::embed();
//   do not instantiate directly

This is deliberate: a new Embedding() from PHP code could lie about which model produced it and what pooling strategy was applied — silent mistakes that are hard to debug later. Forcing factory construction keeps the invariants tight.

Catching strategies

Catch broadly at the top

For a request handler that wants to convert any ext-infer failure into a 5xx response:

try {
    $reply = $model->chat($prompt);
} catch (\Displace\Infer\InferException $e) {
    $log->error('inference failed', ['error' => $e->getMessage()]);
    return new Response(500, [], 'Inference temporarily unavailable.');
}

Distinguish load failures from inference failures

For a CLI tool that wants different exit codes:

try {
    $model = Model::load($path);
} catch (\Displace\Infer\ModelLoadException $e) {
    fwrite(STDERR, "model: " . $e->getMessage() . PHP_EOL);
    exit(2);
}

try {
    $r = $model->chat($prompt);
} catch (\Displace\Infer\InferenceException $e) {
    fwrite(STDERR, "inference: " . $e->getMessage() . PHP_EOL);
    exit(3);
}

Retry vs surface

InferenceException covers two flavors of failure:

  • Transient — out-of-memory under load, e.g. with_mlock + a large prompt. Often resolved by reducing nCtx or splitting the work.
  • Permanent — model has no chat template, prompt has null bytes, invalid option. Retrying makes no sense.

The message string is the only signal you have today; structured error codes are on the roadmap. For now, a pragmatic split:

try {
    $r = $model->chat($prompt, maxTokens: $budget);
} catch (\Displace\Infer\InferenceException $e) {
    if (str_contains($e->getMessage(), 'n_ctx')) {
        // prompt too long — surface to caller, don't retry
        throw $e;
    }
    // other inference failure — log + maybe retry
    $log->warning('chat failed, retrying once', ['error' => $e->getMessage()]);
    $r = $model->chat($prompt, maxTokens: $budget);
}

Always-safe patterns

Model::close() is idempotent — calling it on an already-closed model is a no-op. Safe inside finally:

$model = Model::load($path);
try {
    return $model->chat($prompt);
} finally {
    $model->close();
}

After close(), every other method on that Model raises InferenceException with "model has been closed".

Environment variables

The extension reads exactly one environment variable today. We’ll add more as they earn their keep; the conservative approach is to keep configuration in PHP (named arguments, load options) rather than sprinkled across the environment.

EXT_INFER_LOG

Restores llama.cpp’s verbose stderr logging, which is silenced by default.

ValueEffect
(unset)llama.cpp logs are silenced. This is the default.
Any valuellama.cpp logs are passed through to stderr verbatim.
EXT_INFER_LOG=1 php hello.php

Why silence by default?

A single Model::load() + chat() pair against a typical GGUF produces several hundred lines of stderr — model metadata, KV-cache sizing, graph reservation, attention layout, sampler config, and more. For a CLI tool drilling into a problem it’s useful; for a PHP extension running inside a request, it’s structured-log poison.

When to enable it

  • Diagnosing a ModelLoadException. The verbose log dumps the GGUF header before failing, which usually points at the cause (wrong architecture, wrong quant, truncated file).
  • Diagnosing a slow load. The log shows where the time goes — reading from disk, mmap setup, weight copy.
  • Reporting an issue. The first thing maintainers will ask for is the verbose log; capture it once with EXT_INFER_LOG=1 and paste.

How it works

The extension hooks llama_log_set at backend init time, replacing llama.cpp’s default callback with a no-op. The hook is process-global — once installed, it covers every subsequent call. EXT_INFER_LOG is checked only at backend init (the first time Model::load() is called); changing the variable mid-process has no effect.

Reserved for future use

These names are not consumed by the extension today but may be in future versions. Avoid using them as application env vars to keep your forward-upgrade path clean:

  • EXT_INFER_DEFAULT_NCTX
  • EXT_INFER_DEFAULT_TEMPERATURE
  • EXT_INFER_BACKEND (CPU / Metal / CUDA selection at runtime)

If you want any of these to land sooner rather than later, open an issue with the use case.

Compatibility matrix

PHP versions

VersionStatusNotes
8.3✅ supportedSecurity-only upstream through end of 2026.
8.4✅ supportedActive support.
8.5✅ supportedCurrent release.
8.2 and earlier❌ not supportedcomposer.json declares php: ^8.3.

Every released binary is built against a specific PHP minor. A binary built for PHP 8.4 will not load into PHP 8.5 or 8.3. PIE handles this automatically (it picks the right tarball); manual installs need to match versions explicitly.

Operating systems

PlatformStatusNotes
macOS arm64✅ supportedApple Silicon. Tested on macOS 14+.
macOS x86_64⚠️ not in release matrixBuilds from source. We don’t ship binaries.
Linux x86_64 (glibc)✅ supportedUbuntu 22.04+, Debian 12+, RHEL 9+. Most modern distros.
Linux arm64 (glibc)✅ supportedUbuntu 24.04 arm64, Debian 12 arm64, AWS Graviton.
Linux musl (Alpine)⚠️ builds from source.cargo/config.toml has the right crt-static opt-out; no released binary.
FreeBSD / OpenBSD⚠️ builds from sourceUntested but should work; the build script handles non-Linux non-macOS as Linux.
Windows❌ excludedos-families-exclude: ["windows"] in composer.json. Out of scope for v0.1.

Threading

ext-infer is thread-safe by design — the LlamaBackend singleton is guarded by a Sync mutex, the underlying LlamaModel’s weights are read-only after load (llama.cpp explicitly supports many contexts on one model), and each chat() / raw() / embed() call builds its own per-call LlamaContext. Two threads calling Model::chat() concurrently on the same handle is the supported, intended shape.

PHP buildStatusNotes
NTS✅ supported, the defaultWhat every release binary targets today.
ZTS✅ supported (support-zts: true in composer.json)Not yet exercised in CI. See Threading & ZTS.

Acceleration backends

BackendStatusNotes
CPU (default)✅ supported, the defaultPortable, no hardware requirements.
Apple Metal⚠️ opt-in via cargo featuremake release FEATURES=metal. See Apple Metal.
CUDA (NVIDIA GPU)❌ not yetllama-cpp-2 supports it via a cargo feature; we haven’t exposed or tested it.
ROCm / Vulkan❌ not yetSame — supported upstream, not surfaced.

If you want CUDA or other GPU acceleration sooner rather than later, open an issue describing your use case — surfacing the feature is small work; testing it across the GPU landscape is the hard part.

Tested model families

What the maintainers have actually exercised end-to-end. Other GGUF-supported families almost certainly work; this is the “we’ve seen it produce sensible output” list.

FamilyUsed for
Qwen3 (Instruct)Chat completions, reasoning splitting.
Qwen3-EmbeddingEmbeddings, cosine similarity.
Llama 3 / 3.1 / 3.2Chat completions. No reasoning.
MistralChat completions.
BGE / E5 / GTEEmbeddings.

Versioning policy

  • Pre-1.0 (0.x.y), breaking changes happen between minors (0.1.x0.2.x), not patches.
  • Once v1.0.0 ships, the class / method / argument surface is frozen. New features land additively; behavioral changes that affect existing callers wait for the next major.
  • See RELEASE.md for the cut-a-release flow.

Reporting compatibility issues

If you hit a “should work but doesn’t” combination on this matrix, the issue template asks for:

  • PHP version (php --version)
  • OS / arch (uname -a)
  • libc (Linux: ldd --version | head -1)
  • ZTS or NTS (php -i | grep 'Thread Safety')
  • Whether the extension was installed via PIE, make install, or loaded with -d extension=…

Three of those four are usually enough to triage.

Threading & ZTS

ext-infer is thread-safe by design. This page documents what that actually means: where the synchronization happens, what the runtime expectations are, and where the rough edges still are.

The thread-safety story, top to bottom

1. LlamaBackend is a Sync-guarded singleton

llama.cpp’s LlamaBackend::init() is process-global state. Initializing it twice is undefined behavior; not initializing it at all means no inference. ext-infer resolves this with:

#![allow(unused)]
fn main() {
static BACKEND: OnceLock<LlamaBackend> = OnceLock::new();
static BACKEND_INIT: Mutex<()> = Mutex::new(());
}

The first Model::load() call (from any thread) acquires the mutex, checks the OnceLock, calls LlamaBackend::init() if needed, and publishes the result. Every subsequent call sees a populated OnceLock and returns immediately without re-acquiring. The mutex is contended only during cold startup.

OnceLock<T> is Sync as long as T: Send + Sync, which LlamaBackend is.

2. LlamaModel weights are immutable after load

llama.cpp explicitly supports running multiple contexts in parallel against a single loaded model. The weights are read-only after load_from_file returns; only the per-context state (KV cache, sampler state) mutates during inference.

This is what makes the “load once, use from many threads” pattern work without any locking on the model itself.

3. Per-call LlamaContext

Model::chat(), Model::raw(), and Model::embed() each build a fresh LlamaContext for the duration of the call and drop it on the way out. Two threads calling chat() simultaneously get two independent contexts that share the same underlying weights via references.

#![allow(unused)]
fn main() {
// Inside run_completion:
let ctx_params = LlamaContextParams::default().with_n_ctx(Some(n_ctx));
let mut ctx = model.new_context(backend, ctx_params)?;
// ... decode, sample, decode, sample ...
// ctx dropped at function exit
}

No state survives the call. No cleanup is required. No two threads ever touch the same LlamaContext.

4. Model::close() is the one &mut self method

PHP’s runtime serializes calls into the same object method via its own object lock, so close() from one thread while another calls chat() should be safe by the runtime’s invariants — but it’s the one place where the Rust code mutates the Model itself (self.inner = None). The worst case is the user-after-close error, which is what close() is supposed to provoke anyway.

When you actually get concurrency

Three deployment shapes use this thread-safety:

  • PHP-FPM workers (process-based) — each worker is independent; the thread-safety story doesn’t matter, but the mmap-sharing story does. See Worker pools.
  • ZTS PHP + parallel (thread-based) — one PHP process, multiple OS threads, each calling chat() on a shared Model. This is what the thread-safety story is for.
  • Swoole / ReactPHP coroutines (single-threaded but context-switching) — not actually concurrent at the OS level, so thread-safety isn’t strictly required; you’ll still benefit from the per-call context pattern because no global state survives.

ZTS-specific notes

ZTS (Zend Thread Safe) is a PHP build mode that adds TLS storage around engine globals so multiple PHP interpreters can run in one process. It’s required for pthreads (EOL) and the more modern parallel extension.

Detecting ZTS

php -i | grep 'Thread Safety'
# expected: Thread Safety => enabled

Or from PHP:

if (PHP_ZTS) {
    // ZTS build
}

Installing ZTS PHP

Most distros ship NTS PHP. To get ZTS:

  • Ubuntu / Debian: build from source with ./configure --enable-zts. Some PPAs (ondrej/php) ship a ZTS variant under php{X}.{Y}-zts but coverage is spotty.
  • macOS: Homebrew’s php@* formulas are NTS. Use phpbrew install +zts +parallel or build from source.
  • Docker: official php:*-cli images are NTS. The community silkeh/php images include ZTS variants.

ext-infer v0.1 ships NTS-only release binaries. ZTS users need to build from source. The composer.json declares support-zts: true so a future ZTS release can ship without changing the install story.

Loading ext-infer into ZTS PHP

Same extension=infer line in php.ini, plus parallel if you want threading:

extension=infer.so
extension=parallel.so

A minimal parallel test

<?php
use parallel\Runtime;
use Displace\Infer\Model;
use Displace\Infer\Prompt;

$model = Model::load('models/Qwen3-0.6B-Q8_0.gguf');

$rt1 = new Runtime();
$rt2 = new Runtime();

$f1 = $rt1->run(function (Model $m) {
    return $m->chat(Prompt::user('What is the capital of France?'))->answer();
}, [$model]);
$f2 = $rt2->run(function (Model $m) {
    return $m->chat(Prompt::user('What is the capital of Italy?'))->answer();
}, [$model]);

echo "F: ", $f1->value(), PHP_EOL;
echo "I: ", $f2->value(), PHP_EOL;

$model->close();

If this works, you have concurrent inference. If it crashes — please open an issue with the model name, PHP version, build flags, and the crash output. CI doesn’t exercise this path yet, so user reports are the canary.

Future work

Two threading-related items on the roadmap:

CI exercise for ZTS

Add a parallel-driven stress test to CI. Today the matrix only covers NTS. Adding ZTS will require:

  • Building a ZTS-PHP runner image (the maintainers haven’t picked one yet).
  • Adding a ZTS leg to ci.yml and the release matrix in release.yml.

Reusable session contexts

Today, every chat() call rebuilds the LlamaContext from scratch. That drops the KV cache, so multi-turn conversations re-prefill on every turn. A Session abstraction that owns a long-lived context would let users opt into KV-cache reuse for back-to-back turns of the same conversation. Tracked in PLAN.md.

This wouldn’t change the thread-safety story — each Session would be owned by one thread (or guarded by a mutex if shared) — but it would significantly improve multi-turn performance.

Next

Apple Metal

Metal is Apple’s low-level GPU API. On Apple Silicon hardware (M1 / M2 / M3 / M4), llama.cpp uses Metal to offload weight-matrix multiplications to the integrated GPU, which substantially outpaces the CPU for medium-to-large models.

ext-infer exposes Metal as an opt-in cargo feature. It is not enabled by default — the default build is CPU-only and portable to non-Apple platforms.

When Metal helps

Order-of-magnitude rule of thumb on an M-series Mac:

Model sizeCPU tokens/secMetal tokens/secSpeedup
0.6B~80~1201.5×
3B~25~702.8×
7B~12~50
13B+(memory-limited)~25dramatic

Numbers are rough — they depend on quant level, M-series generation, prompt length, and what else the machine is doing. The pattern is clear though: Metal’s value grows with model size.

For small models on a fast CPU, Metal can actually be slower on the first few tokens because of the shader compilation overhead. If you’re running 600M-param models in batch mode, the CPU build is likely fine.

Enabling Metal

The cargo feature is named metal:

make release FEATURES=metal
make install  FEATURES=metal

Or via raw cargo:

cargo build --release --features metal

The release binary is now Metal-enabled. No runtime flag — Metal is used automatically when the cargo feature is on.

Per-layer offload

The Model::load() option n_gpu_layers controls how many transformer layers are offloaded to the GPU. Defaults to 0 (CPU only); set to a high number (the model’s total layer count, or just 999 as a “all of them” shortcut) to offload everything:

$model = Model::load($path, [
    'n_gpu_layers' => 999,   // offload all layers to Metal
]);

For models that fit entirely in unified memory, full offload is almost always what you want. For models that don’t fit, partial offload lets you put the hot lower layers on the GPU and keep the upper layers on CPU. Tune empirically; the upstream llama.cpp Metal docs have more.

Why isn’t it the default?

Three reasons we ship CPU-by-default and Metal-by-opt-in for v0.1:

  1. The release matrix builds on the GitHub macos-14 runner. Its hardware revision and MACOSX_DEPLOYMENT_TARGET are not-fully-pinned — we haven’t validated that a Metal-enabled binary built there actually loads on every customer Mac.
  2. CI doesn’t test Metal output for correctness. Different precision behavior on GPU vs CPU could surface as different sampler output, and we haven’t caught that drift end-to-end yet.
  3. Cold-start cost. Metal shader compilation adds ~1s to the first inference. Acceptable for long-running workers, awkward for a CLI tool people run once.

Making Metal the default for macos-arm64 release tarballs is on the roadmap once those three concerns are resolved.

Verifying Metal is actually being used

Enable EXT_INFER_LOG and look for Metal-specific lines:

EXT_INFER_LOG=1 php hello.php 2>&1 | grep -i metal | head

You should see something like:

ggml_metal_init: GPU name:   Apple M2 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple8 (1008)
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 48318.38 MiB

If you see no Metal lines at all, the cargo feature didn’t get applied — re-check the make release FEATURES=metal invocation.

Memory considerations

Apple Silicon has unified memory — the GPU and CPU share the same physical RAM. There is no “host-to-device” copy step like on discrete GPUs. The trade-off is that GPU memory pressure shows up as overall system memory pressure: a 13B model in Metal mode uses ~8 GB of the same RAM your other apps need.

recommendedMaxWorkingSetSize in the log above is what macOS thinks you should keep the GPU footprint under. Loading a model larger than that will work — Metal pages weights in and out as needed — but performance drops sharply.

Cross-platform note

#[cfg(feature = "metal")] only enables Metal on Apple targets. Building with --features metal on Linux is harmless (the feature is a no-op there), but there’s no reason to do it.

For GPU acceleration on non-Apple hardware (CUDA on NVIDIA, ROCm on AMD, Vulkan as a portable option) — the llama-cpp-2 crate supports all three, but ext-infer hasn’t surfaced them as cargo features yet. If you want one, open an issue.

Next

  • Performance tuning — once Metal is on, the next bottleneck is usually nCtx or maxTokens.
  • Choosing a model — Metal opens up larger models you might not have considered.

Performance tuning

A Model::chat() call has three dominant costs:

  1. Loading — one-time on the first call. Dominated by disk I/O (or mmap setup) for the GGUF.
  2. Prompt prefill — tokenize + forward-pass on the prompt. Scales roughly linearly with prompt length.
  3. Token generation — sample, decode, sample, decode, … Scales linearly with maxTokens (or wherever the model chooses to stop).

This page walks through each, what knobs affect it, and what trade-offs each knob carries.

Reducing load time

Load is the slow part — for a 4 GB model from a cold cache, expect 1–3 seconds on SSD, longer on spinning rust.

use_mmap (default true)

Memory-mapping the GGUF skips the explicit read() syscall and lets the OS page weights in lazily. Always leave this on unless you’re diagnosing a specific mmap issue. Without it, load reads the entire file upfront — slower for large models, identical for small ones once cached.

$model = Model::load($path, ['use_mmap' => true]);   // default

use_mlock (default false)

mlock pins the model’s pages in physical RAM so the OS can’t page them out. Useful when:

  • You’re on a memory-constrained machine and would rather OOM than thrash.
  • You’re serving a large model under unpredictable load and want predictable latency.

The cost: that physical memory is unavailable to anything else on the system. Don’t turn it on unless you know you want it.

$model = Model::load($path, ['use_mlock' => true]);

On Linux, mlock has a per-process limit (RLIMIT_MEMLOCK). For models larger than 64 MB (basically all of them), you’ll need to raise it via /etc/security/limits.conf or ulimit -l unlimited. macOS doesn’t enforce the same limit but may swap aggressively under memory pressure.

Sharing one load across many workers

If you’re running multiple FPM workers, the OS automatically deduplicates mmap’d pages across them. 16 workers loading the same 4 GB model consume ~4 GB of physical memory total, not 64. This is why use_mmap matters even on machines with abundant RAM.

Reducing prompt prefill cost

Prefill cost scales with the number of prompt tokens. The longest prompts come from RAG pipelines that inject document context — see RAG over markdown.

nCtx (default 2048)

The context window for a single call. The rendered prompt + generated tokens must fit. Lower is faster because llama.cpp allocates the KV cache to nCtx, so a 32k context costs 16× more memory than a 2k context even when most of it is unused.

$model->chat($prompt, nCtx: 4096, maxTokens: 1024);

For typical RAG/chat use cases, nCtx = 2048 to 4096 is plenty. Go higher only when the model has been trained for it and you’ve measured a quality benefit.

Prompt length

The fastest prompt is a short prompt. Common ways to compress without losing fidelity:

  • Drop boilerplate from system messages. “You are a helpful assistant. Answer truthfully. Don’t make things up. Be concise. Use markdown formatting. …” is mostly cargo-culted. Test what’s actually load-bearing.
  • Truncate conversation history. Keep the last N turns rather than every turn since the dawn of the conversation. For most chatbots, N = 6–10 is plenty.
  • Summarize old turns. Replace turns 1–50 with “Earlier, the user asked about X and you said Y.” This is what production chatbots do above a certain length.

Reducing token-generation cost

Once prefill is done, each generated token costs roughly the same. Two knobs.

maxTokens (default 128)

The maximum number of generated tokens. Lower is faster. The default is conservative on purpose — bump it for any non-trivial generation:

$model->chat($prompt, maxTokens: 512);   // ~4× the default budget

Set it high enough that legitimate answers complete, low enough that runaway generations (which happen) don’t wedge the worker for minutes. For reasoning models, you’ll want at least 512 — they spend many tokens thinking.

When finishReason() === 'length', you hit this budget. Surface it to the caller so they can decide whether to bump or live with the truncation.

temperature

temperature = 0.0 is greedy — sample the single highest-probability token at every step. It’s also the fastest because the sampler is trivial.

temperature > 0.0 enables the random sampler (with optional seed for reproducibility), which is marginally slower per token. The difference is small enough that you should pick based on output quality, not speed.

Hardware-side knobs

Quantization

A Q4_K_M model is roughly 2× faster than Q8_0 of the same model — fewer bits to fetch from memory per matrix multiply. See Choosing a model for the size/quality table.

If Q4_K_M answers are good enough for your use case, prefer it over Q8_0. The space and speed savings are real; the quality drop is usually small for chat workloads.

GPU offload

The biggest single speedup is moving compute off CPU. On Apple Silicon, see Apple Metaln_gpu_layers: 999 typically gives a 3–4× speedup for medium models.

On Linux + NVIDIA, CUDA support exists in llama-cpp-2 but isn’t surfaced as an ext-infer cargo feature yet. Open an issue if you want it.

Pinning threads to cores

llama.cpp respects the OMP_NUM_THREADS environment variable. Setting it explicitly is sometimes faster than the default (which uses all available cores, including hyperthreads that hurt more than help). For a 4-physical-core box:

OMP_NUM_THREADS=4 php hello.php

Experimentally find the sweet spot for your CPU.

Measuring before tuning

A useful pattern: log latency per call and look for the actual bottleneck before reaching for any of these knobs.

$start = hrtime(as_number: true);
$r = $model->chat($prompt, maxTokens: 512);
$elapsed_ms = (hrtime(true) - $start) / 1_000_000;

error_log(sprintf(
    'chat: %.0fms, %d tokens, %.1f tok/s, finish=%s',
    $elapsed_ms,
    $r->tokensGenerated(),
    $r->tokensGenerated() / ($elapsed_ms / 1000),
    $r->finishReason(),
));

If tokens/sec is low (< 20 on a modern CPU), you’re hardware-bound — quantize down or enable GPU offload. If it’s reasonable (50+) but total time is high, you’re generating too many tokens — reduce maxTokens or compress the prompt.

Future work

Two performance items on the roadmap that aren’t shipping in v0.1 but would change the picture significantly:

  • Reusable session contexts — KV-cache reuse across chat() calls. Multi-turn conversations would skip the prefill cost on every turn after the first.
  • Continuous batching — process N prompts together so the GPU stays saturated. Necessary for any serious inference-as-a-service workload.

Tracked in PLAN.md.

Next

Building from source

The development build is what make build produces — a debug-mode shared library you can load via -d extension=…. The release build is what ships in PIE tarballs.

Prerequisites

  • PHP 8.3+ with php-config on PATH.
  • Rust — installed via rustup. The repo pins the toolchain via rust-toolchain.toml; rustup will fetch it on first build.
  • cmake 3.18+ — llama.cpp’s build system.
  • A C/C++ compiler — Clang (macOS / Linux) or GCC. The build script honors CC / CXX if you need to override.
  • libclang (Linux only) — apt install libclang-dev or distro equivalent. Used by bindgen for the PHP header parse.
  • cargo-phpcargo install cargo-php once.

Verify everything:

php --version
php-config --version
rustup --version
cmake --version
cargo php --version

Cloning

git clone https://github.com/DisplaceTech/ext-infer
cd ext-infer

The repo includes a models/ directory (gitignored) where you can drop GGUFs for testing. The PHPT suite and examples both default to models/Qwen3-0.6B-Q8_0.gguf.

Debug build

make build
# -> target/debug/libinfer.{so,dylib}

Debug builds compile faster but run slower. Use them for iterative development; switch to make release when you’re benchmarking or shipping.

A cold make build takes a few minutes because cargo compiles llama-cpp-sys-2 from source (it vendors all of llama.cpp). Cached incremental rebuilds are sub-minute on a modern laptop.

Release build

make release
# -> target/release/libinfer.{so,dylib}

Use this for installing system-wide via make install, for the performance numbers you’d quote in benchmarks, and for any “production-like” testing.

Optional features

FeatureEffectWhen to use
metalEnables Apple Metal GPU offload on macOS-arm64.When you have an Apple Silicon Mac and want GPU acceleration. See Apple Metal.
make release FEATURES=metal

Loading your build into PHP

Two options.

Without installing

Pass -d extension=… on every PHP invocation:

php -d extension=$PWD/target/debug/libinfer.dylib your-script.php

Substitute .so on Linux. This is what every script in examples/ assumes — you can drop the flag once you make install.

Installing system-wide

make install runs cargo php install --release, which:

  1. Builds release-mode if it hasn’t already.
  2. Drops the binary into PHP’s extension_dir.
  3. Adds extension=infer.so (or .dylib) to a config file in php.ini’s scan directory.
make install
php -m | grep infer
# infer

To revert:

make uninstall

Editor / IDE setup

Rust analyzer

The Rust code lives in src/. Pointing rust-analyzer at Cargo.toml (default) Just Works.

PHP autocomplete

Use the hand-authored stubs at stubs/infer.stubs.php:

// .phpstorm.meta.php / .composer.json autoload config:
{
  "autoload-dev": {
    "files": ["stubs/infer.stubs.php"]
  }
}

Or symlink it into your project. The stubs include full PHPDoc on every method so hovering in your IDE shows the option semantics without flipping to the docs.

Regenerating stubs (rare)

Stubs are hand-authored today because we want richer docblocks than cargo php stubs emits. To regenerate from scratch (e.g. to confirm the stub signatures match what’s actually registered):

make stubs
git diff stubs/infer.stubs.php

Reconcile the generated output with the hand-authored version manually.

Troubleshooting common build failures

ErrorLikely fix
linker 'cc' not found / cc: command not foundInstall Xcode CLT (xcode-select --install) or build-essential (Ubuntu).
cmake: command not foundbrew install cmake or apt install cmake.
libclang.so: cannot open shared objectapt install libclang-dev (Linux). On macOS, libclang comes with the CLT.
php-config: command not foundInstall PHP CLI; on macOS via Homebrew use brew link [email protected] --force.
cargo install cargo-php failsCheck your Rust version. rustup update may help.
undefined symbol: _spl_ce_RuntimeExceptionThe dynamic-lookup link flag didn’t apply. Check build.rs ran; usually a stale target/cargo clean and rebuild.

Next

  • Testing — running PHPT and Rust unit tests.
  • Releasing — cut-a-release process.

Testing

ext-infer has two test layers:

  • PHPT — integration tests that exercise the extension from PHP. This is where the real correctness coverage lives.
  • Rust unit tests — for pure-Rust helpers (currently none; see Why no Rust unit tests? below).

Plus formatting and clippy. CI runs all of the above on every push.

Running PHPT locally

The test harness lives in tests/phpt/. make test runs the full suite against a debug build:

make test

What that command actually does:

  1. Build (cargo build).
  2. Sanity-load — confirm the extension actually loaded into PHP.
  3. Fetch run-tests.php from PHP-src matching the current minor (if not already cached).
  4. Run php run-tests.php -q --show-diff tests/phpt/ with TEST_PHP_EXECUTABLE and TEST_PHP_ARGS set so the freshly built .so / .dylib is loaded.

Tests gated on a real model use the INFER_TEST_MODEL environment variable:

INFER_TEST_MODEL=$PWD/models/Qwen3-0.6B-Q8_0.gguf make test

Without the variable, model-gated tests skip cleanly. CI runs in this “no model” mode by default; setting INFER_TEST_MODEL runs the full suite.

Writing a PHPT test

Files in tests/phpt/ follow the standard PHPT format:

--TEST--
Model::chat() returns a Response with the model's answer
--SKIPIF--
<?php
if (!extension_loaded('infer')) {
    echo 'skip ext-infer not loaded';
    exit;
}
$path = getenv('INFER_TEST_MODEL');
if (!$path || !is_file($path)) {
    echo 'skip INFER_TEST_MODEL not set to an existing GGUF file';
}
?>
--FILE--
<?php
$model = \Displace\Infer\Model::load(getenv('INFER_TEST_MODEL'));
$r = $model->chat(\Displace\Infer\Prompt::user('hi'), maxTokens: 32);
echo $r->finishReason() === 'eos' || $r->finishReason() === 'length' ? "ok\n" : "bad\n";
$model->close();
?>
--EXPECT--
ok

Filename convention: NNN-short-description.phpt. NNN ordering is loose — it determines the order run-tests.php runs them in, which doesn’t really matter.

Three sections every model-gated test needs:

  • --SKIPIF--skip if extension_loaded('infer') is false (the harness invocation always passes -d extension=…, so this catches setup mistakes) and skip if INFER_TEST_MODEL is unset.
  • --FILE-- — the actual PHP under test.
  • --EXPECT-- or --EXPECTF-- — expected output. Use --EXPECTF-- if you need wildcards (%s, %d).

For tests that DON’T need a model, drop the INFER_TEST_MODEL check from --SKIPIF--. They’ll run in CI’s no-model leg.

Running Rust unit tests

cargo test --lib

…would be the command, but see the next section.

Why no Rust unit tests?

Earlier versions had Rust unit tests in src/response.rs and src/embedding.rs covering pure-Rust helpers. They were dropped because cargo test --lib builds an executable that statically links the crate, which pulls in references to the ext-php-rs runtime symbols (zend_throw_exception, _emalloc, …) — symbols only resolved when loaded into a real PHP host. On a clean checkout, cargo test --lib fails to link.

PHPT covers the same correctness ground end-to-end, so this is a net win for CI simplicity. If a pure-Rust helper grows complex enough to warrant unit tests in isolation, the path forward is to factor it into a sibling crate that has no ext-php-rs dependency.

Linting

make fmt-check       # cargo fmt --all --check
make clippy          # cargo clippy --all-targets -- -D warnings

CI runs both with -D warnings. Local lints are pinned to the same Rust toolchain as the build (via rust-toolchain.toml).

CI structure

.github/workflows/ci.yml runs on every push and PR:

  • rustfmt + clippy on ubuntu-latest with PHP 8.4. Fast (~1 minute warm-cache).
  • Test matrix — 6 legs: {ubuntu-latest, macos-14} × {8.3, 8.4, 8.5}. Each builds the extension, loads it, runs the no-model PHPT legs. Cache is scoped per-PHP-minor (see the comment in ci.yml about why this matters for ext-php-rs binding regeneration).

What CI does not do:

  • Run model-gated PHPT tests. Adding a fixture model to CI is on the roadmap; for now, run them locally before tagging.
  • Exercise ZTS PHP. See Threading & ZTS.

Pre-flight checklist

Before opening a PR, the maintainers run:

cargo fmt --all --check                              # no diff
cargo clippy --all-targets -- -D warnings            # clean
INFER_TEST_MODEL=$PWD/models/Qwen3-0.6B-Q8_0.gguf make test  # all green

If any of those fail, the PR will fail CI for the same reason — fix locally first.

Next

  • Releasing — what runs in the release workflow (a different beast than CI).
  • Building from source — getting to the point where make test can even run.

Releasing

The full cut-a-release process lives in RELEASE.md at the repo root. This page is the one-screen version with pointers back into that document.

The five-step shape

# 1. Bump versions
edit Cargo.toml                      # [package].version = "0.1.0"

# 2. Verify locally
cargo fmt --all --check
cargo clippy --all-targets -- -D warnings
INFER_TEST_MODEL=$PWD/models/Qwen3-0.6B-Q8_0.gguf make test
composer validate composer.json

# 3. Land the bump
git commit -am "chore(release): v0.1.0"
git push

# 4. Tag — this is what triggers the release workflow
git tag v0.1.0
git push --tags

# 5. Edit and publish the draft Release on GitHub

Step 4 is the only user-facing action. The release workflow takes it from there.

What the workflow does

For each (PHP minor, OS, arch) in the 9-leg matrix:

  1. Install system deps (cmake, build-essential, …).
  2. Install the matrix PHP via shivammathur/setup-php@v2.
  3. cargo build --release.
  4. Stage infer.so / infer.dylib in the right shape.
  5. Tarball as php_infer-{version}_php{minor}-{arch}-{os}[-{libc}].tar.gz per PIE’s filename convention.
  6. Compute a .sha256 sidecar.
  7. Upload both to a draft GitHub Release.

The first matrix leg creates the draft Release; later legs add files to the same one.

Why “draft”?

Releases ship draft so a maintainer can:

  • Verify all 18 files (9 tarballs + 9 sidecars) are attached.
  • Write release notes — the workflow doesn’t auto-generate them.
  • Spot-check one tarball locally with pie install before exposing it to users.

After the manual review, hit Publish release in the GitHub UI.

Versioning policy

Pre-1.0 (0.x.y), breaking changes happen between minors (0.10.2), not patches. Once v1.0.0 ships, the class / method / argument surface is frozen.

composer.json does NOT carry a version key — that would conflict with the tag-derived version Composer infers. The branch-alias under extra exists only so dev-main resolves to 0.1.x-dev for users pinning a dev branch.

What RELEASE.md covers in more detail

  • Pre-flight checklist (the verify-locally step expanded).
  • Release-notes template.
  • Post-publish smoke test (install via PIE, run hello-world).
  • Hotfix / patch process.
  • Yanking a broken release.
  • Caveats (Windows excluded, ZTS untested, etc.).
  • Symptom → first-thing-to-check table for release failures.

If you’re cutting a release, read RELEASE.md first. This page is the index, not the manual.

Caveats

Three things v0.1 explicitly doesn’t ship and that you should know about before cutting one:

  • No Windows binaries. os-families-exclude: ["windows"] in composer.json makes PIE skip Windows hosts cleanly.
  • No ZTS binaries. The composer.json declares support-zts: true because the code is thread-safe by construction, but the release matrix doesn’t include a ZTS runner. ZTS users need to build from source for now.
  • No musl Linux binaries. The release matrix is glibc only. Musl users build from source; the .cargo/config.toml carries the needed crt-static opt-out.

All three are tracked in PLAN.md.

Next

  • RELEASE.md — the full process document.
  • PLAN.md — what’s in flight after v0.1.