Introduction

ext-infer is a PHP 8.3+ extension that loads a GGUF model and runs LLM inference inside the PHP process via llama.cpp. PHP-native semantic search, RAG pipelines, and CLI / worker inference run without shelling out to Python or hitting a remote API.

It is written in Rust on top of ext-php-rs and the llama-cpp-2 bindings. The public PHP surface is designed to feel native: a fluent, role-aware Prompt builder; a Response that splits reasoning from answer; an Embedding that knows how to normalize itself and compute cosine similarity. You should rarely, if ever, need to think about <|im_start|> tokens.

Why an extension?

Three reasons local inference belongs in PHP rather than next to it:

Latency. A subprocess fork or HTTP roundtrip is at least milliseconds, often tens. An in-process call is bounded only by decode time.
Operational surface. No Python sidecar to package, no daemon to supervise, no inference server to scale alongside FPM. The PHP process is the inference server.
API ergonomics. Calling a local LLM should be as natural in PHP as calling intl or pdo. The extension API is shaped to match that — see Prompts and Chat completions.

What’s here

This guide is split into five layers, navigable from the sidebar:

Section	What you’ll find
Getting Started	Install, run hello-world, verify it loaded.
Guide	Conceptual walkthroughs of each public class. Read in order on first pass.
Recipes	Copy-paste-ready patterns: multi-turn chat, semantic search, RAG, worker pools.
Reference	Complete API listing, exceptions, environment variables, compatibility matrix.
Advanced	Threading model, Apple Metal, performance tuning.

Status

ext-infer is pre-release — the class surface is stable but the first tagged release (v0.1.0) is still in flight. See RELEASE.md for the cut-a-release flow and PLAN.md for what’s coming next.

Conventions in this guide

Code blocks are runnable as written, with one exception: PHP code assumes the extension is loaded. Either install it system-wide or prepend -d extension=… to your php command. See Installation.
Model without a namespace prefix means Displace\Infer\Model; same for Prompt, Response, Embedding. Real code needs the use statement at the top of the file.
CLI snippets are written for a POSIX shell (bash / zsh). Adjust for fish / PowerShell as needed; differences are usually only quoting.

Installation

Two supported install paths:

Via PIE — pulls a pre-built binary for your (php-minor, arch, os, libc) combo. No local C/C++ toolchain. Recommended for application developers.
From source — builds llama.cpp locally via cargo. Needed for contributors, distros without a pre-built artifact, or anyone who wants to enable the metal cargo feature.

Via PIE

Heads up: PIE installation is wired up but the first published release (v0.1.0) is still in flight. Until then, install from source — the pie install flow becomes the recommended path the moment we ship binaries.

PIE (PHP Installer for Extensions) is the official tool for installing PHP extensions from Composer-style metadata. Get it once:

curl -L --output pie.phar \
    https://github.com/php/pie/releases/latest/download/pie.phar
chmod +x pie.phar && sudo mv pie.phar /usr/local/bin/pie

Then install ext-infer:

pie install displace/ext-infer

PIE reads composer.json to learn that ext-infer ships pre-packaged binaries, fetches the right tarball from the matching GitHub Release, extracts infer.so (or infer.dylib on macOS) into the PHP extension directory, and adds it to your php.ini.

Verify the install with php -m:

php -m | grep infer
# infer

From source

Prerequisites

Tool	Purpose	Minimum
PHP CLI	host process	8.3
`php-config`	tells `ext-php-rs` where the PHP headers are	(matches PHP)
Rust toolchain	compiles the extension	1.88
`cmake`	llama.cpp builds via cmake during cargo build	3.18+
C/C++ toolchain	llama.cpp itself	Clang / GCC
`cargo-php`	wraps `make install` to drop the artifact in PHP’s extension dir	0.1+

The Rust toolchain is pinned via rust-toolchain.toml, so you don’t need to install a specific version manually — rustup will fetch it on first build. On macOS, cmake is a brew install cmake away; on Debian/Ubuntu, apt install cmake build-essential libclang-dev.

Install cargo-php once:

cargo install cargo-php

Build and install

git clone https://github.com/DisplaceTech/ext-infer
cd ext-infer
make release          # builds target/release/libinfer.{so,dylib}
make install          # cargo php install --release
php -m | grep infer

A cold build compiles llama.cpp from source — that takes a few minutes on a fresh machine. Subsequent builds reuse cargo’s incremental cache and the rebuilt llama.cpp object files; expect sub-minute rebuilds after the first one.

Without `make install` (development)

If you want to load a freshly built binary without committing to installing it system-wide, pass the path on the PHP command line:

make build       # debug build (faster compile, slower runtime)
php -d extension=$PWD/target/debug/libinfer.dylib your-script.php

Substitute .so for .dylib on Linux. This is the workflow used throughout the examples.

Apple Metal acceleration (opt-in)

The default build is CPU-only and portable. For Apple Silicon GPU acceleration:

make release FEATURES=metal
make install  FEATURES=metal

See Apple Metal for what this does and what trade-offs it implies.

Uninstalling

Via PIE:

pie uninstall displace/ext-infer

From a source install:

make uninstall    # cargo php remove

Either way, confirm with php -m | grep infer (should produce no output).

Troubleshooting

If php -m | grep infer shows nothing after install, see Verifying your install for the diagnostic checklist — it walks through the four or five most common failure modes (extension_dir mismatch, PHP minor mismatch, missing -undefined,dynamic_lookup on macOS, libc mismatch on Linux).

Quick start

This page assumes you’ve already installed the extension. From a cold install to a working answer in under a minute:

1. Grab a model

GGUF files are big. Even the smallest interesting ones are 600 MB quantized. For getting started, Qwen3-0.6B-Q8_0 is a good first model — Apache-2.0 licensed, ~640 MB, fast on CPU, good enough at toy questions:

mkdir -p models
curl -L -o models/Qwen3-0.6B-Q8_0.gguf \
    https://huggingface.co/Qwen/Qwen3-0.6B-GGUF/resolve/main/Qwen3-0.6B-Q8_0.gguf

See Choosing a model for the broader landscape.

2. Write the script

Save the following as hello.php:

<?php

declare(strict_types=1);

use Displace\Infer\Model;
use Displace\Infer\Prompt;

$model = Model::load('models/Qwen3-0.6B-Q8_0.gguf');

$response = $model->chat(
    Prompt::system('You are a helpful, concise assistant.')
        ->withUser('What is 2+2?'),
    maxTokens: 256,
    temperature: 0.0,
);

echo $response->answer(), PHP_EOL;

$model->close();

Three things going on:

Model::load(...) reads the GGUF into memory. Loading is the slow step — for a real app, load once and keep the handle around. See Choosing a model.
Prompt::system(...)->withUser(...) builds a chat prompt without any template tokens. The Prompt is immutable; each with* returns a new instance. See Prompts.
$model->chat($prompt, ...) renders the prompt through whatever chat template the GGUF ships, runs inference, and returns a Response. answer() is the model’s reply with any <think>...</think> reasoning stripped.

3. Run it

If you installed via PIE (or make install), just:

php hello.php

If you’re running against a make build artifact instead:

php -d extension=$(pwd)/target/debug/libinfer.dylib hello.php

Substitute .so on Linux. Expected output:

2 + 2 equals 4.

4. What just happened

llama.cpp normally spams several hundred lines to stderr per inference (model layout, KV-cache sizing, graph reservation). ext-infer silences that by default — it’s noise inside a PHP request and tends to poison structured logs. Bring it back when you need to debug:

EXT_INFER_LOG=1 php hello.php

See Environment variables for the complete list.

Next steps

Verifying your install — the canonical diagnostic checklist if the script above doesn’t work.
Prompts — multi-turn chat, system messages, immutability semantics.
Embeddings — Model::embed() plus cosine similarity.
Multi-turn chat recipe — a ready-to-lift implementation of conversational state.
examples/chat-interactive/ — a Symfony Console standalone app that takes the above further.

Verifying your install

After installing, three things should be true. If any of them isn’t, this page is the checklist.

The fast version

# 1. Is the extension loaded?
php -m | grep infer
# expected: infer

# 2. Are the classes registered?
php -r 'echo class_exists("Displace\\Infer\\Model") ? "yes\n" : "no\n";'
# expected: yes

# 3. Does inference actually work?
php -r '
$m = \Displace\Infer\Model::load("models/Qwen3-0.6B-Q8_0.gguf");
$r = $m->chat(\Displace\Infer\Prompt::user("Say hello."));
echo $r->answer(), PHP_EOL;
$m->close();
'
# expected: a one-line greeting

All three pass → you’re done. Skip to the Guide.

Diagnosis if `php -m | grep infer` is empty

The extension didn’t load. PHP loads extensions from a specific directory and looks for them by exact filename — usually one of these four things is off.

1. PHP can’t find the binary

Confirm where PHP is looking:

php -i | grep -E '^extension_dir|^Loaded Configuration File'

Then confirm the binary is in that directory:

ls -l $(php -r 'echo ini_get("extension_dir");')/infer.*

If the file is missing:

After make install, cargo-php should have placed it there. Try re-running with -v to see where it landed: make install (or cargo php install --release -v).
After pie install, look at PIE’s output for the install path.

If the file is in a different directory than extension_dir, either move it or update extension_dir in your php.ini.

2. PHP minor mismatch

A binary built against PHP 8.4 will not load into PHP 8.5 (and vice versa). Confirm both:

php --version | head -1
# e.g. PHP 8.4.20 (cli)

# For PIE-installed binaries, the tarball filename encodes the PHP
# minor — check the GitHub Release you installed from:
ls -l $(php -r 'echo ini_get("extension_dir");')/infer.*

Cross-check that the binary’s PHP minor matches your running PHP minor. If they disagree, re-install with the right tarball (PIE handles this automatically; manual installs may need pie install --force).

3. macOS: `-undefined dynamic_lookup` missing from the link

The extension uses dlopen-style undefined-symbol resolution against the host PHP binary. If you built from source on macOS and skipped the extension’s own build.rs, the linker errors out at build time with Undefined symbols for architecture arm64. From-source builds via make build / make release configure this automatically. If you invoked cargo build from somewhere unusual (e.g. an IDE), repeat the build via make to be safe.

4. Linux: libc mismatch

The released binaries target glibc Linux. Alpine (musl) is not in the v0.1 release matrix. Confirm your libc:

ldd --version 2>&1 | head -1
# expected: ldd (GNU libc) 2.x
# if you see musl: rebuild from source — see Installation

Building from source on musl works; .cargo/config.toml carries the needed crt-static opt-out.

Diagnosis if classes are missing

If php -m shows infer but class_exists("Displace\\Infer\\Model") returns no, the namespace probably has a typo somewhere upstream of you. The full list:

Displace\Infer\Model
Displace\Infer\Prompt
Displace\Infer\Message
Displace\Infer\Response
Displace\Infer\Embedding
Displace\Infer\InferException
Displace\Infer\ModelLoadException
Displace\Infer\InferenceException

All eight should exist after a successful load. If only some do, you likely have a ext-infer install left over from an older API surface — uninstall the old version (pie uninstall or make uninstall) and reinstall.

Diagnosis if inference fails

If Model::load throws ModelLoadException:

“no such file” — the GGUF path is wrong. PHP resolves relative paths against the working directory, not the script’s directory.
“failed to load model: …” — check that the file isn’t truncated (du -h should match what the publisher lists) and that it really is a GGUF (file <path> should mention “data” or similar; if it says “ASCII text” it’s probably an HTML 404 from a failed download).

If Model::chat throws InferenceException with "model has no embedded chat template", you’ve picked a base model rather than an instruct/chat variant. See Choosing a model or use Model::raw() with your own templating.

If the script segfaults rather than throwing — please open an issue at github.com/DisplaceTech/ext-infer/issues with the model name, PHP version, and OS. That’s a bug.

Enabling verbose logging

llama.cpp’s own diagnostic chatter is silenced by default. To see it (model layout, KV cache sizing, graph reservation, …):

EXT_INFER_LOG=1 php hello.php

A noisy log can sometimes point straight at the issue — e.g. “n_ctx exceeds model’s training context” tells you the model is being asked to handle longer input than it was trained for.

Prompts

Displace\Infer\Prompt is the input to Model::chat(). It represents an ordered list of role-tagged messages — system, user, assistant — that the extension renders into whatever chat-template format the underlying model expects. You never write <|im_start|> (or its Llama 3 / Mistral / Gemma equivalent) by hand.

Two-stage construction

A Prompt starts with a factory — either system() or user() — and grows via with* calls. Each with* returns a new Prompt; the receiver is never modified.

use Displace\Infer\Prompt;

// Start with a system message:
$p = Prompt::system('You are a helpful assistant.')
    ->withUser('What is 2+2?');

// Or start with a user message (no system instruction):
$p = Prompt::user('Hello!');

// Multi-turn replays:
$p = Prompt::system('You are a poet.')
    ->withUser('Write a haiku about Rust.')
    ->withAssistant("Code runs cold and fast,\nMemory safe by the borrow,\nNo crashes today.")
    ->withUser('Now translate it to French.');

Direct new Prompt() is refused at runtime:

new Prompt();
// Displace\Infer\InferException: use Displace\Infer\Prompt::system()
// or Prompt::user() to start a prompt

Why immutable?

The shape mirrors DateTimeImmutable. Two practical consequences:

A Prompt you’ve built once is safe to share across multiple chat() calls, hand to a queue worker, or stash in a class property. Nothing downstream can mutate it.

Branching is free. The multi-turn chat recipe keeps a $base Prompt around (system-message-only) so /reset can drop conversation history without re-rendering the system prompt:

$base         = Prompt::system($systemMessage);
$conversation = $base;
// … many turns …
if ($userTyped === '/reset') {
    $conversation = $base;   // immutable; $base is untouched no
                             // matter how many turns went through it
}

Inspecting a Prompt

$p->messages();    // list<Displace\Infer\Message>
$p->count();       // int — number of messages
$p->isEmpty();     // bool
$p->lastRole();    // ?string — role of the most recent message, or null

Each Message is read-only:

foreach ($p->messages() as $msg) {
    printf("[%s] %s\n", $msg->role(), $msg->content());
}
// [system] You are a helpful assistant.
// [user] What is 2+2?

role() is always one of 'system', 'user', or 'assistant'. Method-name discipline on the construction side (withSystem, withUser, withAssistant) keeps typos from creating fictional roles at compile time.

Role ordering

ext-infer does not enforce role ordering at construction time. You can build:

Prompt::user('hi')->withSystem('be terse');  // legal
Prompt::system('a')->withSystem('b');         // also legal

…and they will be rendered as written. Whether the model accepts the result is a chat-template decision: most modern chat templates require exactly one leading system message (or none) followed by alternating user / assistant turns. Build sequences that match that convention and the chat template will render them; deviate and you may get an error from Model::chat() at call time.

Composition patterns

Pre-baked system prompts

If your application has a few stock personalities, define them once:

final class Personas
{
    public static function poet(): Prompt
    {
        return Prompt::system(
            'You are a haiku poet. Respond in three lines. ' .
            'Five syllables, then seven, then five.'
        );
    }

    public static function reviewer(): Prompt
    {
        return Prompt::system(
            'You review code. Always cite specific line numbers ' .
            'and prefer questions over assertions when uncertain.'
        );
    }
}

$response = $model->chat(Personas::poet()->withUser('Tell me about autumn.'));

Because Prompt is immutable, returning a Prompt from a helper method is safe — callers can’t mutate the cached base.

Replaying history

When you have stored history (e.g. fetched from a database), rebuild the Prompt from scratch each turn:

$prompt = Prompt::system($systemMessage);
foreach ($historyFromDb as $row) {
    $prompt = match ($row['role']) {
        'user'      => $prompt->withUser($row['content']),
        'assistant' => $prompt->withAssistant($row['content']),
    };
}
$prompt = $prompt->withUser($newUserInput);

This is the canonical multi-turn-chat shape. See the multi-turn chat recipe.

Feeding `Response::answer()` back, not `text()`

When you append the assistant’s reply to the prompt for the next turn, use Response::answer() (reasoning stripped), not Response::text():

$response = $model->chat($prompt);
$prompt   = $prompt->withAssistant($response->answer());
//                                          ^^^^^^^^^
//                            not ->text(), which includes <think>…</think>

Feeding <think> blocks back as conversation history derails reasoning models — they see their own thoughts in the transcript and get confused. See Reasoning models.

Chat completions — feeding a Prompt to the model.
Model::raw() — when you want full control over the prompt string instead.

Chat completions

Model::chat() is the main inference entry point. It takes a Prompt and returns a Response:

public function chat(
    \Displace\Infer\Prompt $prompt,
    int   $maxTokens   = 128,
    int   $nCtx        = 2048,
    float $temperature = 0.0,
    int   $seed        = 1234,
): \Displace\Infer\Response;

All four sampling arguments are PHP 8 named arguments — no options array. See Options reference for what each one does.

What chat() does

Three steps happen between the call and the return value:

Render. The Prompt’s messages are fed through llama_chat_apply_template, using the chat template embedded in the GGUF. Qwen3, Llama 3, Mistral, Gemma — each ships its own Jinja template inside the model file. ext-infer reads it and uses it verbatim.
Decode. The rendered prompt is tokenized, decoded through the model in a single batch, then a sampler generates output tokens one by one until either the model emits an end-of-generation token (finishReason = 'eos') or the maxTokens budget is exhausted (finishReason = 'length').
Split. If the generated text contains <think>...</think> blocks (Qwen3 / DeepSeek R1 / other reasoning models), they’re captured into Response::reasoning() and stripped from Response::answer(). See Reasoning models for the details.

Inspecting a Response

Response is read-only. Six getters:

$response->text();              // string — full output, <think>…</think> + answer
$response->reasoning();         // ?string — captured <think>…</think>, or null
$response->answer();            // string — text() minus reasoning, leading WS trimmed
$response->hasReasoning();      // bool
$response->finishReason();      // string — 'eos' | 'length' | 'stop'
$response->tokensGenerated();   // int — generated tokens only, not prompt

Response is created internally — new Response() throws.

`text()` vs `answer()`

For non-reasoning models, the two are byte-identical. For reasoning models invoked through their chat template:

text():     <think>Okay so 2+2…</think>\n\n2 + 2 equals 4.
answer():   2 + 2 equals 4.
reasoning(): Okay so 2+2…

answer() is what end users want to read; reasoning() is what you’d log for debugging or display behind a “show thinking” toggle.

`finishReason()`

Three possible values:

Value	Meaning
`'eos'`	Model emitted an end-of-generation token. Output is complete.
`'length'`	`maxTokens` was hit before EOS. Output is likely truncated mid-thought.
`'stop'`	Reserved for future stop-string support. Currently only reachable when the prompt produced zero tokens (a degenerate input).

When you see 'length', surface it to the user — “hit the token budget, bump maxTokens to see more”. Silently truncating is a bad UX.

`tokensGenerated()`

Counts generated tokens only, not the prompt’s tokens. Useful for billing-like accounting, latency analysis, or capping conversation length.

Calling chat()

A minimal call uses every default:

$response = $model->chat(Prompt::user('Hello!'));

A fully-specified one:

$response = $model->chat(
    Prompt::system('You are a helpful, concise assistant.')
        ->withUser('What is the capital of Antarctica?'),
    maxTokens: 256,
    nCtx: 4096,
    temperature: 0.7,
    seed: 42,
);

Sampling defaults — temperature: 0.0, seed: 1234 — give greedy, deterministic output: the same prompt always produces the same reply. Crank temperature up for varied / creative output; the seed only matters when temperature > 0.

Errors

Model::chat() raises InferenceException for any failure between “the model exists” and “we got tokens back”. The most common message strings:

Substring	Meaning
`model has been closed`	You called `chat()` after `$model->close()`. Reload the model.
`model has no embedded chat template`	The GGUF is a base model, not an instruct/chat variant. Either pick a chat-tuned model or use `Model::raw()`.
`apply_chat_template failed`	The chat template rendered but llama.cpp rejected the result. Usually means the message-role sequence is one the template doesn’t support (e.g. multiple system messages).
`prompt is N tokens but n_ctx is only M`	The rendered prompt is longer than `nCtx`. Bump `nCtx` or shorten the prompt.
`chat message contains a null byte`	A `Prompt`’s content has an embedded `\0`. Strip it before constructing the prompt.

Chat-template errors

If you load a model that doesn’t ship a chat template — typically a “base” or “pretrained” model rather than an instruct variant — you’ll see:

InferenceException: model has no embedded chat template — use
Model::raw() for this model: …

Model::raw() lets you do your own templating. See Raw completions.

Streaming

chat() is currently synchronous: it returns the complete Response once decoding finishes. A streaming variant — likely Model::chatStream(): \Generator — is in the roadmap.

For long generations under a request/response model where blocking is unacceptable, the workable shortcut today is to set a tight maxTokens and call chat() repeatedly with the previous turn appended to the Prompt. That sacrifices KV-cache reuse but works.

Raw completions — the escape hatch for templates the model didn’t bake in.
Choosing a model — chat-tuned vs base, quantization, size.
Multi-turn chat recipe — the immutable-Prompt accumulation pattern.
Reasoning models recipe — making reasoning() / answer() work for you.

Raw completions

Model::raw() is the escape hatch for callers who want full control over the prompt string instead of going through Prompt + Model::chat().

public function raw(
    string $prompt,
    int    $maxTokens   = 128,
    int    $nCtx        = 2048,
    float  $temperature = 0.0,
    int    $seed        = 1234,
    bool   $addBos      = true,
): string;

Returns a plain string — no Response wrapper, no reasoning split. If you want either of those, use chat().

When to use raw()

Three legitimate use cases:

1. Models without a chat template

“Base” / “pretrained” / “foundation” models — Llama 3 base, Mistral base, Qwen base — ship GGUFs that haven’t been instruction-tuned and have no embedded chat template. Model::chat() rejects them with:

InferenceException: model has no embedded chat template — use
Model::raw() for this model

For these models, raw() is the only path. Build prompts in whatever shape the model expects — typically just free-form text-continuation:

$text = $model->raw(
    "The capital of France is",
    maxTokens: 8,
    temperature: 0.0,
);
// " Paris."

2. Custom chat templates

Maybe the model’s embedded chat template doesn’t match what you want — e.g. you want to add a tool-result message that the embedded template doesn’t know about, or you’re injecting RAG context in a non-standard slot. Build the prompt string yourself:

$prompt = <<<TXT
<|im_start|>system
You are a calculator. Only emit JSON: {"result": <number>}.
<|im_end|>
<|im_start|>user
What is 2+2?
<|im_end|>
<|im_start|>assistant

TXT;

$text = $model->raw($prompt, maxTokens: 32, temperature: 0.0);
// '{"result": 4}'

The trade-off: you own template correctness. The chat template that chat() uses is the one the model author tested with; hand-rolling means hand-checking.

3. Stop-sequence simulation

Stop-string support is on the roadmap but not in v0.1. If you need a generation to halt at a specific marker, raw() plus post-processing is the workaround:

$text = $model->raw($promptEndingWithMarker, maxTokens: 256);
$text = substr($text, 0, strpos($text, '</answer>') ?: strlen($text));

What raw() does NOT do

No reasoning split. raw() returns a string, not a Response. If the model emits <think>…</think> blocks, they end up in your string verbatim. You can strip them yourself with a regex if it matters; the canonical case for that is Reasoning models.
No chat-template rendering. What you pass in is what gets tokenized.
No finish-reason or token-count metadata. If you need those, use chat().

addBos

The addBos: true default tells the tokenizer to prepend the model’s beginning-of-sequence token (whatever it is for that family). For most models that’s right. Set addBos: false when:

You’re building a prompt that already starts with the BOS token explicitly (rare).
The model’s tokenizer rejects BOS prepending (also rare).
You’re feeding raw() mid-conversation and don’t want a new BOS in the middle (very rare and probably a sign you should be using a Prompt).

The other named arguments — maxTokens, nCtx, temperature, seed — behave the same as in chat(). See Options reference.

When NOT to use raw()

If the model has a chat template and you’re sending the “system / user / assistant” shape: use chat(). It’s shorter, safer, and the Response it returns gives you reasoning splitting + metadata for free.

raw() exists so escape hatches don’t require dropping the extension entirely. Treat it as the lower-level layer, not the default.

Chat completions — the higher-level surface most code should use.
Options reference — every argument explained side by side.

Structured output

Free-text generation is the wrong tool the moment your code needs to parse what the model says. Grammar-constrained generation flips the guarantee: instead of asking nicely for JSON and validating after the fact, the sampler is constrained so that every token the model can emit keeps the output inside your schema. Malformed output isn’t retried — it’s impossible.

This is the reliability unlock for small local models: a 0.6B model that rambles in free text becomes a dependable structured extractor when it physically cannot produce anything but the requested shape.

JSON Schema (the common case)

Pass a JSON Schema via the schema option — as a PHP array or a JSON string — to chat() or raw():

use Displace\Infer\Model;
use Displace\Infer\Prompt;

$model = Model::load('models/Qwen3-0.6B-Q8_0.gguf');

$response = $model->chat(
    Prompt::system('Extract the data. Output JSON only.')
        ->withUser('Maria is 31 years old and lives in Lisbon.'),
    maxTokens: 128,
    options: ['schema' => [
        'type' => 'object',
        'properties' => [
            'name' => ['type' => 'string'],
            'age'  => ['type' => 'integer'],
            'city' => ['type' => 'string'],
        ],
    ]],
);

$data = json_decode($response->answer(), true, flags: JSON_THROW_ON_ERROR);
// ['name' => 'Maria', 'age' => 31, 'city' => 'Lisbon'] — guaranteed shape

json_decode cannot fail here: the grammar permits only valid JSON matching the schema, with properties in declaration order.

Classification is one enum away:

$sentiment = json_decode($model->raw(
    "Sentiment of: I love this!\n",
    options: ['schema' => ['enum' => ['positive', 'negative', 'neutral']]],
));

The supported schema subset

The converter is strict by design: a keyword it doesn’t implement throws InferException naming the keyword. Silently ignoring a constraint would hand you output that looks validated but isn’t.

Supported	Notes
`type: object` + `properties`	All properties are generated, in declaration order. `required`, when present, must list every property.
`type: array` + `items`	`minItems` of `0` (default) or `1`.
`type: string` / `integer` / `number` / `boolean` / `null`	JSON-strict lexical forms.
`enum` / `const`	Strings, numbers, booleans, null.
`anyOf` / `oneOf`	Compiled as alternation — `['anyOf' => [['type' => 'string'], ['type' => 'null']]]` is the nullable-field idiom.
`type: ["string", "null"]`	Multi-type shorthand, same alternation.

Notably not supported (throws): $ref/$defs, optional properties (a required list that’s a proper subset), pattern, minLength/maxLength, numeric ranges, additionalProperties: true free-form objects, minItems > 1. Annotation-only keywords (title, description, default, examples) are accepted and ignored.

Raw GBNF (full control)

For shapes JSON Schema can’t express, hand llama.cpp a GBNF grammar directly:

$verdict = $model->raw(
    'Is the sky blue? Answer: ',
    maxTokens: 8,
    options: ['grammar' => 'root ::= "yes" | "no"'],
);
// $verdict is exactly "yes" or "no" — nothing else can be sampled

The grammar’s start rule must be named root. grammar and schema are mutually exclusive; passing both throws.

How it interacts with everything else

Reasoning models — the grammar applies from the first generated token, so a Qwen3-style model is prevented from opening a <think> block: it must start emitting the constrained shape immediately. For extraction tasks that’s what you want.
temperature — works as usual; sampling happens over the tokens the grammar allows. 0.0 (greedy) is the right default for extraction.
finishReason() — once the grammar’s root rule is fully matched, only end-of-generation remains legal, so completed constrained runs report 'eos'. A run that hits maxTokens mid- structure reports 'length' and the output is a truncated (invalid) document — size maxTokens generously.
Quality — the grammar guarantees shape, not truth. A model too small for the task will fill your schema with confident nonsense. The schema is the seatbelt, not the driver.

Errors

Condition	Exception
Unsupported schema keyword	`InferException`, names the keyword
Schema not valid JSON / not array-or-string	`InferException`
`grammar` + `schema` together	`InferException`
GBNF llama.cpp can’t parse	`InferException`

See the Options reference for the full option tables.

Embeddings

Model::embed() turns a piece of text into a fixed-length vector of floats. Cosine similarity between two such vectors approximates semantic similarity between the texts they came from — that’s the foundation of every semantic-search / RAG pipeline.

public function embed(string $text): \Displace\Infer\Embedding;

Enable embedding mode at load time

Embedding generation requires a context built with with_embeddings(true) under the hood. Because that conflicts with generation mode for a given context, ext-infer makes the choice explicit at load:

use Displace\Infer\Model;

$model = Model::load('models/embedding-model.gguf', [
    'embedding' => true,
]);

With embedding: true, embed() works. Without it, embed() throws:

InferenceException: Model::embed() requires loading with ['embedding' => true]

chat() and raw() still work on an embedding-loaded handle — they build their own per-call context for generation. So one handle can do both, but you opt in to embed() explicitly.

Pooling

Sentence embeddings need a way to collapse the per-token hidden states into a single vector. Different model families do this differently:

Pooling	Used by
`mean`	BGE, GTE, E5 — average across tokens
`cls`	original BERT — uses the `[CLS]` token’s hidden state
`last`	Qwen3-Embedding — uses the last token’s hidden state
`rank`	rerankers — emits a single score, not a vector
`none`	per-token vectors, no pooling

Modern embedding GGUFs declare their pooling type in metadata. ext-infer’s default is 'unspecified' (trust the metadata):

$model = Model::load($path, ['embedding' => true]);
// pooling: whatever the GGUF says (almost always correct)

Override if a GGUF ships without the metadata or you want to experiment:

$model = Model::load($path, [
    'embedding' => true,
    'pooling'   => 'mean',   // 'unspecified' | 'none' | 'mean' | 'cls' | 'last' | 'rank'
]);

An unknown pooling string is rejected at load time, not at first embed() call:

InferException: invalid option pooling: expected one of
unspecified/none/mean/cls/last/rank, got "weighted"

Generating embeddings

$emb = $model->embed('The cat sat on the mat.');

$emb->vector();        // list<float> — length matches the model's n_embd
$emb->dimensions();    // int — same as count($emb->vector())

Vectors are returned as PHP arrays of floats (doubles); internally we hold Vec<f32> and let ext-php-rs convert f32 → f64 at the boundary, which is lossless.

Vector math, built in

Embedding carries the math you need most of the time so you don’t have to write a numpy-equivalent in PHP:

$emb->norm();              // float — L2 norm: sqrt(sum_i x_i^2)
$emb->normalize();         // new Embedding scaled to unit length
$a->cosineSimilarity($b);  // float in [-1, 1]

normalize() returns a new Embedding — the original is not modified. This matters for caching: cache the normalized form once, then every subsequent cosineSimilarity call is just a dot product.

cosineSimilarity() throws on a dimension mismatch:

InferenceException: cannot compare embeddings of different
dimensions: 1024 vs 384

That’s deliberate — comparing across model families is almost always a bug, and silently returning a number would hide it.

Packed output for vector indexes

packed() returns the vector as a packed little-endian float32 binary string — byte-identical to pack('g*', ...$emb->vector()) and the format every Displace vector API speaks (ext-turbovec indexes, the ai-contracts Embedder interface):

// embed → index, no PHP float arrays anywhere in between:
$index->addWithIds($model->embed($text)->normalize()->packed(), [$id]);

The bytes are produced straight from the float32 vector held on the Rust side, so coordinates are never inflated into PHP values — at 1024 dimensions that’s the difference between one 4KB string and a thousand zvals per document. Prefer packed() over vector() whenever the destination wants bytes.

Why normalize before comparing?

Cosine similarity ignores magnitude — it compares direction. If either vector has magnitude zero, the answer is undefined; we return 0.0 rather than NaN. If both are non-zero, cosineSimilarity does the right thing on un-normalized vectors too. But:

For a fixed corpus you query against, normalizing once is cheap and makes the inner loop a single dot product: array_sum(array_map(fn($x, $y) => $x * $y, $a, $b)).
For pgvector / sqlite-vec storage, you usually want normalized vectors stored so the database can use the inner-product operator (<#> in pgvector) instead of the cosine operator (<=>).

A canonical pipeline:

$query = $model->embed($userQuestion)->normalize();
$best  = null;
$bestScore = -INF;
foreach ($corpusEmbeddings as $docId => $docEmb) {
    // $docEmb is also pre-normalized
    $score = $query->cosineSimilarity($docEmb);
    if ($score > $bestScore) {
        $best = $docId;
        $bestScore = $score;
    }
}

For real-world indexing — even at a few thousand documents — push the storage into a database. See Semantic search and RAG over markdown.

Choosing an embedding model

The chat-tuned models people download for completions (Qwen3-0.6B, Llama 3.2 3B, Mistral 7B) can be loaded with embedding: true and will return a vector — but it’s not what they were trained for, and similarity numbers are noisier than what a purpose-built embedding model produces.

Model family	Dims	Notes
Qwen3-Embedding (0.6B)	1024	Apache-2.0. Same architecture as Qwen3-0.6B, retrained for embeddings. Strong default.
BGE-small / BGE-large	384 / 1024	Beijing Academy of AI. Widely used, mean pooling.
E5-small / E5-large	384 / 1024	Microsoft. Trained on text similarity tasks.
GTE-small / GTE-large	384 / 1024	Alibaba.

See Choosing a model for more on GGUF quants and what size to start with.

Semantic search recipe — embed a corpus, query, sort by similarity.
RAG over markdown — semantic search feeding into Model::chat().
Choosing a model — chat vs embedding, sizes, formats.

Reranking

Embedding similarity is fast but coarse: it compares a query against each document separately, through the bottleneck of two pooled vectors. A reranker reads the query and a candidate document together and scores their actual relevance — far more accurate, far more expensive. The canonical two-stage pipeline plays them to their strengths:

Recall — vector search over the whole corpus returns ~20–50 candidates in microseconds.
Precision — the reranker scores only that short list and reorders it.

RerankModel

RerankModel targets the Qwen3-Reranker GGUF family (0.6B / 4B / 8B):

use Displace\Infer\RerankModel;

$reranker = RerankModel::load('models/Qwen3-Reranker-0.6B-Q8_0.gguf');

// Score one pair — a calibrated probability in (0, 1).
$score = $reranker->score(
    'how do I reset my password?',
    'To reset your password, open Settings > Security > Reset password.',
);  // ≈ 0.999

// Rank a candidate list — best-first ['index' => int, 'score' => float]
// rows, where index points back into your input array.
$rows = $reranker->rank($query, $candidateTexts, topK: 5);

foreach ($rows as $row) {
    printf("%.3f  %s\n", $row['score'], $candidateTexts[$row['index']]);
}

rank()’s row shape is deliberately identical to Displace\AI\Contracts\Reranker::rerank(), so wrapping it in the framework-facing contract is a pass-through.

How scoring works

Qwen3-Reranker is a causal LM fine-tuned to answer a fixed yes/no judgment prompt. ext-infer renders the model card’s template around each (query, document) pair, decodes it, and reads the logits of the single next token: the score is the binary softmax P("yes") / (P("yes") + P("no")).

Because the score is a calibrated probability rather than an arbitrary similarity, thresholding works: “drop everything under 0.3” is a meaningful, corpus-independent filter — something cosine scores can’t give you.

The instruction option

The judgment prompt embeds a task instruction. The default is the generic one the model trains against (“Given a web search query, retrieve relevant passages that answer the query”); tailoring it to your corpus measurably improves separation:

$reranker = RerankModel::load($path, [
    'instruction' => 'Given a customer support question, retrieve KB articles that resolve it',
]);

Sizing and budgets

Every score() / rank() pair renders template + query + document and must fit in n_ctx (default 4096; raise at load time for long documents, or chunk them — the same chunking you indexed with).
Cost scales linearly with candidates: reranking 20 candidates is ~20 forward passes. Keep stage-1 k in the tens, not hundreds.
The 0.6B reranker is the sweet spot for CPU latency; the same code loads the 4B/8B GGUFs when accuracy is worth the milliseconds.

A complete two-stage pipeline

// Stage 1: recall — ext-turbovec over packed embeddings.
$candidateRows = $index->search($embedder->embed($query)->packed(), k: 20);
$candidates = array_map(fn (array $r): string => $texts[$r['id']], iterator_to_array($candidateRows));

// Stage 2: precision — rerank the short list, keep the best 5.
$best = $reranker->rank($query, $candidates, topK: 5);

Errors

Condition	Exception
Model file missing/unreadable	`ModelLoadException`
Vocabulary can’t express single-token yes/no	`ModelLoadException` (not a Qwen3-Reranker-family GGUF)
Pair overflows `n_ctx`	`InferenceException`, suggests raising `n_ctx` or chunking
`topK < 1`	`InferException`
Use after `close()`	`InferenceException`

Choosing a model

ext-infer loads any GGUF file llama.cpp can handle. Picking which GGUF is the most important choice you’ll make — it dominates inference quality, memory footprint, and latency. This page is a tour of the landscape.

What is GGUF?

GGUF (GPT-Generated Unified Format) is llama.cpp’s native model format. A .gguf file packs:

Weights in a specific quantization.
Tokenizer (vocabulary + merges).
Architecture metadata (layer count, hidden size, attention config) so llama.cpp knows how to run the model without a separate config file.
Chat template (for instruct models) so Model::chat() knows how to render messages.
Pooling type (for embedding models) so Model::embed() knows how to collapse hidden states.

GGUF files self-contain everything ext-infer needs. There is no separate config / tokenizer / vocab file to manage.

Model families

There are three broad categories you’ll encounter:

Category	What `ext-infer` method to use	Examples
Base / pretrained	`raw()` only	Llama 3 base, Mistral 7B base, Qwen base
Chat / instruct	`chat()`, `raw()`	Qwen3-Instruct, Llama 3.x Instruct, Mistral Instruct
Embedding / reranker	`embed()`	Qwen3-Embedding, BGE, E5, GTE

A chat model loaded with 'embedding' => true will return a vector, but it’s not what the model was optimized for — the vectors are noisier than what a purpose-built embedding model produces. The reverse (chat() against a pure embedding GGUF) usually fails because embedding-only models don’t ship a chat template.

Quantization

A 7B-parameter model at full precision is ~14 GB on disk. Quantization trades a small amount of quality for a much smaller, faster file. The suffixes you’ll see in GGUF filenames:

Suffix	Approx. size for a 7B model	Quality	Notes
`F16`	~14 GB	Lossless	Reference. Rarely worth the size unless you have plenty of memory.
`Q8_0`	~7 GB	Near-lossless	Good default when you can afford the disk.
`Q6_K`	~5.5 GB	Excellent
`Q5_K_M`	~5 GB	Very good
`Q4_K_M`	~4.5 GB	Good	The most popular size/quality compromise.
`Q4_K_S`	~4 GB	Solid
`Q3_K_M`	~3.5 GB	Noticeable degradation	Useful on memory-constrained boxes.
`Q2_K`	~2.5 GB	Significant degradation	Last resort.

The K-family (Q4_K_M, etc.) uses k-quants, a smarter scheme than the legacy non-K variants (Q4_0, Q4_1). Prefer K-quants when both are offered for the same model.

Picking a quant

Two questions:

How much memory can you spend? Quants below Q4_K_M save space at increasing quality cost. Above Q4_K_M, the marginal gain per GB shrinks fast.
Is the model small enough that quantization barely matters? For sub-1B models like Qwen3-0.6B, even Q8_0 is ~640 MB — negligible by 2026 standards. Take the quality bump.

A good default rule: Q4_K_M for models > 3B, Q8_0 for smaller models.

Recommended starting points

What we’ve actually tested against:

Chat (smallest reasonable)

Qwen/Qwen3-0.6B-GGUF — Apache-2.0, 600M params, Q8 ≈ 640 MB. Reasoning model: emits <think>…</think> blocks through its chat template, which Response splits for you. Great for getting started; not great for production-quality answers.

curl -L -o models/Qwen3-0.6B-Q8_0.gguf \
    https://huggingface.co/Qwen/Qwen3-0.6B-GGUF/resolve/main/Qwen3-0.6B-Q8_0.gguf

Chat (production-ish)

bartowski/Qwen3-7B-Instruct-GGUF at Q4_K_M (~4.4 GB) — same family, much better reasoning quality. Or bartowski/Llama-3.2-3B-Instruct-GGUF at Q4_K_M (~1.9 GB) for a smaller, non-reasoning option.

Embedding (small, fast)

Qwen/Qwen3-Embedding-0.6B-GGUF — Apache-2.0, 1024-dim embeddings, last pooling baked into metadata. Same size as the chat model; quality is competitive with BGE/E5 small variants.

curl -L -o models/Qwen3-Embedding-0.6B-Q8_0.gguf \
    https://huggingface.co/Qwen/Qwen3-Embedding-0.6B-GGUF/resolve/main/Qwen3-Embedding-0.6B-Q8_0.gguf

Alternative: CompendiumLabs/bge-small-en-v1.5-gguf — 384-dim, mean pooling, ~130 MB. Lower-quality vectors but tiny.

Where to look for more

Hugging Face GGUF tag — library=gguf filters to GGUF-format models.
bartowski — prolific publisher of quantized GGUFs for popular models. Reliable, consistent naming.
mradermacher — ditto.
The model’s own official GGUF repo when one exists (e.g. Qwen/Qwen3-7B-Instruct-GGUF) — always the most trusted source.

License caveats

GGUF files inherit the underlying model’s license. Some models that are nominally “open” (Llama 3.x, Gemma) ship under custom licenses with use restrictions; others (Qwen, Mistral, several smaller players) are Apache-2.0 / MIT. Check the model card before depending on a model in a commercial deployment.

ext-infer itself is MIT-licensed — the extension doesn’t care which GGUF you load, but downstream concerns are on you.

Embeddings — when you’ve picked an embedding model.
Chat completions — when you’ve picked a chat model.
Performance tuning — n_gpu_layers, mmap, mlock for the model you ended up with.

Options reference

Every option that any ext-infer method accepts, in one table per method. For conceptual context on individual options, follow the links in the rightmost column.

`Model::load($path, $options)`

The second argument is an associative array. Keys are kept as snake-case strings (like PHP ini settings) because load-time tuning is rare and the array form composes well with config arrays loaded from disk.

Key	Type	Default	See
`n_gpu_layers`	`int`	`0`	Performance tuning
`use_mmap`	`bool`	`true`	Performance tuning
`use_mlock`	`bool`	`false`	Performance tuning
`embedding`	`bool`	`false`	Embeddings
`pooling`	`string`	`'unspecified'`	Embeddings

Validation rules

Unknown keys are not rejected — they’re silently ignored. This is deliberate (forward-compatibility for callers loading config from files), but it means typos will be silent. If you suspect a typo, verify with var_dump against the same string before reporting a bug.
Type mismatches are rejected, with a clear message: invalid option n_gpu_layers: expected integer.
Negative integers and out-of-range values for n_gpu_layers are rejected: invalid option n_gpu_layers: must be non-negative.
pooling accepts only the six strings listed in Embeddings → Pooling.

`Model::chat($prompt, ...)`

Named arguments — no array. PHP 8.0+ named-arguments syntax echoes the ident verbatim, so you write maxTokens: 256 (camelCase, per PSR-12).

Argument	Type	Default	See
`$prompt`	`\Displace\Infer\Prompt`	required	Prompts
`maxTokens`	`int`	`128`	Chat completions
`nCtx`	`int`	`2048`	Chat completions
`temperature`	`float`	`0.0`	Chat completions
`seed`	`int`	`1234`	Chat completions
`options`	`array`	`[]`	Structured output

The trailing options array accepts (keys are mutually exclusive):

Key	Type	Effect
`grammar`	`string`	GBNF grammar constraining every sampled token
`schema`	`array\|string`	JSON Schema (PHP array or JSON text) compiled to GBNF

Behavior

temperature = 0.0 is greedy (deterministic). > 0.0 samples, controlled by seed.
seed is only consulted when temperature > 0.
maxTokens caps generation. Hitting it sets Response::finishReason() to 'length'.
nCtx is the context window for this call. If the rendered prompt exceeds it, InferenceException is raised before generation starts.

`Model::raw($prompt, ...)`

Same named-argument shape as chat() plus addBos.

Argument	Type	Default	See
`$prompt`	`string`	required	Raw completions
`maxTokens`	`int`	`128`	Chat completions
`nCtx`	`int`	`2048`	Chat completions
`temperature`	`float`	`0.0`	Chat completions
`seed`	`int`	`1234`	Chat completions
`addBos`	`bool`	`true`	Raw completions → addBos
`options`	`array`	`[]`	Same `grammar`/`schema` keys as `chat()`

`Model::embed($text)`

Just the text. Pooling and embedding-mode are configured at load time (see Model::load above).

Argument	Type	Default	See
`$text`	`string`	required	Embeddings

`Embedding` math

Embedding is read-only; the math methods return new instances rather than mutating.

Method	Returns
`vector()`	`list<float>`
`packed()`	`string` — little-endian float32, `pack('g*')`-identical
`dimensions()`	`int`
`norm()`	`float`
`normalize()`	new `Embedding`
`cosineSimilarity(Embedding $other)`	`float` (in `[-1, 1]`)

cosineSimilarity throws InferenceException on a dimension mismatch — see Embeddings → vector math.

`RerankModel::load($path, $options)`

Same array shape as Model::load, plus reranker-specific keys.

Key	Type	Default	See
`n_gpu_layers`	`int`	`0`	Performance tuning
`use_mmap`	`bool`	`true`	Performance tuning
`use_mlock`	`bool`	`false`	Performance tuning
`n_ctx`	`int`	`4096`	Reranking → sizing
`instruction`	`string`	model-card default	Reranking → instruction

`RerankModel` methods

Method	Returns
`score(string $query, string $document)`	`float` in `(0, 1)`
`rank(string $query, array $documents, ?int $topK = null)`	`list<array{index: int, score: float}>`, best-first
`close()`	`void` (idempotent)

`Prompt`

Static factories + immutable with* builders.

Method	Returns
`Prompt::system($content)`	new `Prompt`
`Prompt::user($content)`	new `Prompt`
`withSystem($content)`	new `Prompt`
`withUser($content)`	new `Prompt`
`withAssistant($content)`	new `Prompt`
`messages()`	`list<Message>`
`lastRole()`	`?string`
`count()`	`int`
`isEmpty()`	`bool`

See Prompts for the immutability semantics.

`Response`

Read-only. Six getters.

Method	Returns
`text()`	`string`
`reasoning()`	`?string`
`answer()`	`string`
`hasReasoning()`	`bool`
`finishReason()`	`string` — `'eos'`/`'length'`/`'stop'`
`tokensGenerated()`	`int`

See Chat completions → Inspecting a Response.

Environment

Not strictly an option, but bears mentioning here:

Variable	Effect
`EXT_INFER_LOG=1`	Restore llama.cpp’s verbose stderr logging (silenced by default).

See Environment variables.

Multi-turn chat

The pattern: keep the system message stable, append user/assistant turns as the conversation grows, regenerate the prompt on each user input. Lifts directly from examples/chat-interactive/.

The shape

use Displace\Infer\Model;
use Displace\Infer\Prompt;

$model = Model::load('models/Qwen3-0.6B-Q8_0.gguf');

$base         = Prompt::system('You are a helpful, concise assistant.');
$conversation = $base;

while (($line = readline('> ')) !== false) {
    $line = trim($line);
    if ($line === '' || $line === '/exit') {
        break;
    }

    // /reset is trivial because Prompt is immutable.
    if ($line === '/reset') {
        $conversation = $base;
        continue;
    }

    $conversation = $conversation->withUser($line);

    $response = $model->chat(
        $conversation,
        maxTokens: 512,
        temperature: 0.7,
    );

    echo $response->answer(), PHP_EOL;

    // Feed answer() back, NOT text(). See "Reasoning models" below.
    $conversation = $conversation->withAssistant($response->answer());
}

$model->close();

Three things this gets right

1. The system message is stable

$base is built once and never mutated. Every /reset re-seats the conversation at the original system instruction without re-allocating or re-rendering. If you change the system prompt mid-conversation elsewhere in your app, the immutable shape means concurrent uses of $base aren’t affected.

2. Conversation grows by immutable append

$conversation = $conversation->withUser($line);

Every with* returns a new Prompt. The old $conversation is still valid (and still has the previous turn count); the local just points at the new one. There’s no shared mutable state, so this code is safe to put behind a queue worker or run in parallel.

3. `Response::answer()` goes back, not `text()`

$conversation = $conversation->withAssistant($response->answer());

This matters for reasoning models. answer() is the reply with <think>...</think> blocks stripped; text() is the raw output including the thoughts. Feeding text() back means the model sees its own internal monologue on the next turn — and reasoning models tend to treat that as instruction, not history. The output derails fast.

For non-reasoning models, answer() and text() are byte-identical, so the rule is “use answer() always” rather than “use answer() for some models”.

Persisting conversations

If you need to save and restore conversations (e.g. per-user chat history in a database), serialize the message list and rebuild the Prompt:

function loadConversation(string $system, array $history): Prompt
{
    $p = Prompt::system($system);
    foreach ($history as $row) {
        $p = match ($row['role']) {
            'user'      => $p->withUser($row['content']),
            'assistant' => $p->withAssistant($row['content']),
        };
    }
    return $p;
}

function saveConversation(Prompt $p): array
{
    $rows = [];
    foreach ($p->messages() as $msg) {
        $rows[] = ['role' => $msg->role(), 'content' => $msg->content()];
    }
    return $rows;
}

Prompt::messages() walks in chronological order, so saving and re-loading round-trips faithfully.

Common shape: an HTTP turn

For a request/response API where every HTTP call is one turn:

// Inside your controller — assumes $model is injected and reused.
final class ChatController
{
    public function __construct(private Model $model, private HistoryStore $history) {}

    public function turn(Request $req): Response
    {
        $conversationId = $req->session('conversation_id');
        $history        = $this->history->load($conversationId);
        $system         = $req->user()->systemPrompt() ?? 'You are helpful.';

        $prompt = loadConversation($system, $history)
            ->withUser($req->json('message'));

        $reply = $this->model->chat(
            $prompt,
            maxTokens: 1024,
            temperature: 0.5,
        );

        $this->history->append($conversationId, 'user', $req->json('message'));
        $this->history->append($conversationId, 'assistant', $reply->answer());

        return new JsonResponse([
            'answer'    => $reply->answer(),
            'reasoning' => $reply->reasoning(),
            'truncated' => $reply->finishReason() === 'length',
            'tokens'    => $reply->tokensGenerated(),
        ]);
    }
}

The $model is loaded once at FPM-worker boot — not per request — and chat() is called per request. With current ext-infer (no KV-cache reuse yet), each turn re-tokenizes and re-decodes the full history, which is slow for long conversations. A Session object that reuses the underlying llama.cpp context is on the roadmap.

When to use `Model::raw()` instead

If you have a very specific prompt shape — tool calls, RAG context injected at a non-standard slot, custom format — see Raw completions. The Prompt builder doesn’t support tool-call messages today, so tool-aware conversations need raw() until tool calling lands.

Semantic search

Embed a corpus once, embed user queries on demand, return the closest matches by cosine similarity. The foundation of every “search by meaning, not keywords” pipeline.

Minimal in-memory version

use Displace\Infer\Model;

$model = Model::load('models/Qwen3-Embedding-0.6B-Q8_0.gguf', [
    'embedding' => true,
]);

// Embed the corpus once. In real code, do this offline and cache.
$corpus = [
    'doc-1' => 'PHP is a server-side scripting language.',
    'doc-2' => 'Cats are popular pets known for their independence.',
    'doc-3' => 'Rust provides memory safety without garbage collection.',
    'doc-4' => 'Dogs are descendants of wolves, domesticated millennia ago.',
];
$index = [];
foreach ($corpus as $id => $text) {
    // Normalize once so the search loop is a plain dot product.
    $index[$id] = $model->embed($text)->normalize();
}

// Search.
function search(Model $model, array $index, string $query, int $k = 3): array
{
    $q = $model->embed($query)->normalize();
    $hits = [];
    foreach ($index as $id => $emb) {
        $hits[$id] = $q->cosineSimilarity($emb);
    }
    arsort($hits);
    return array_slice($hits, 0, $k, preserve_keys: true);
}

print_r(search($model, $index, 'a typesafe language'));
// Array
// (
//     [doc-3] => 0.7421
//     [doc-1] => 0.4567
//     [doc-2] => 0.1234
// )

$model->close();

Three things to know

Normalize when you index

Embedding::normalize() returns a unit vector. With both sides normalized, cosine similarity simplifies to a dot product:

cos(a, b) = (a · b) / (||a|| · ||b||)
          = a_unit · b_unit            // if both are normalized

Normalize once at index time so the per-query work is just the dot product. Embedding::cosineSimilarity() does the normalization internally if you skip the explicit step — but you pay for it on every call, which adds up across thousands of documents.

Pick an embedding model, not a chat model

A chat-tuned model loaded with 'embedding' => true will return a vector, but the similarity numbers cluster too tightly to be useful at scale. Use a purpose-built embedding model — see Choosing a model.

What “useful” looks like with a real embedding model (Qwen3-Embedding-0.6B):

cat-mat ↔ feline-rug:      0.72   (paraphrase)
cat-mat ↔ grocery-shop:    0.29   (unrelated)
feline-rug ↔ grocery-shop: 0.26   (unrelated)

Same query with the chat-tuned Qwen3-0.6B (loaded in embedding mode):

cat-mat ↔ feline-rug:      0.66
cat-mat ↔ grocery-shop:    0.51
feline-rug ↔ grocery-shop: 0.50

The chat model preserves the ordering — the related pair scores highest — but the gap is much narrower, so the cut-off threshold between “match” and “not match” is harder to draw.

Cache the index

In production, the in-memory dictionary in the example above doesn’t scale past a few thousand documents — the search loop is O(corpus size). Three upgrade paths:

Stay in-process with ext-turbovec — a quantized, SIMD-accelerated index from the same stack. 100K 1024-dim documents fit in ~50MB resident, persist with write()/load(), and search in microseconds. The natural next step when this recipe outgrows the PHP loop.
Persist embeddings to disk (a JSON file, SQLite blob column). Saves the embed-time cost on subsequent runs but keeps the O(n) scan.
Index with a vector database: pgvector (PostgreSQL extension), sqlite-vec, MySQL 9 VECTOR. Right when the database must remain the system of record for vectors too.

See Semantic search with ext-infer for the ext-turbovec pairing, or RAG over markdown for a worked example using sqlite-vec.

Re-ranking with a chat model

For higher-quality top-K, embed-rank-then-rerank-with-a-chat-model is the canonical pattern:

// 1. Coarse retrieval — embedding similarity, top 20.
$hits = search($embedModel, $index, $query, k: 20);

// 2. Fine reranking — ask a chat model to score each candidate.
$prompt = Prompt::system(
    'You are a relevance judge. Given a query and a document, ' .
    'respond with a single number between 0 and 1 indicating ' .
    'how relevant the document is to the query.'
);
$rerank = [];
foreach (array_keys($hits) as $docId) {
    $r = $chatModel->chat(
        $prompt->withUser("Query: {$query}\n\nDocument: {$corpus[$docId]}"),
        maxTokens: 8,
        temperature: 0.0,
    );
    $rerank[$docId] = (float) trim($r->answer());
}
arsort($rerank);

That’s two model loads — one embedding, one chat. Reuse handles across requests; loading is the expensive step.

RAG over markdown — semantic search feeding into Model::chat().
Embeddings guide — the underlying API.
Choosing a model — picking an embedding model.

Reasoning models

Qwen3, DeepSeek R1, and other reasoning-tuned models think out loud before answering. When invoked through their chat template, they emit <think>…</think> blocks containing the internal monologue, then the actual reply. ext-infer understands this convention and exposes the two streams separately on Response.

The split, in three calls

use Displace\Infer\Model;
use Displace\Infer\Prompt;

$model = Model::load('models/Qwen3-0.6B-Q8_0.gguf');

$response = $model->chat(
    Prompt::user('What is 2+2?'),
    maxTokens: 512,
);

echo $response->text(), PHP_EOL;
// <think>
// Okay, the user is asking what 2+2 is. This is basic arithmetic.
// I should respond with the correct sum, which is 4. Let me also
// verify there's no trick here — adding two and two definitely
// equals four.
// </think>
//
// 2 + 2 equals 4.

echo $response->answer(), PHP_EOL;
// 2 + 2 equals 4.

echo $response->reasoning(), PHP_EOL;
// Okay, the user is asking what 2+2 is. This is basic arithmetic.
// I should respond with the correct sum, which is 4. ...

echo $response->hasReasoning() ? 'yes' : 'no', PHP_EOL;
// yes

$model->close();

For a non-reasoning model:

reasoning() returns null
answer() equals text() byte-for-byte
hasReasoning() returns false

The split is opt-out: there’s no flag to disable it. If the input doesn’t contain <think>…</think> tags, nothing changes.

When the budget runs out mid-thought

Reasoning chains can be long. If maxTokens exhausts inside a <think> block — before the closing </think> — the split fails gracefully:

text() contains the partial reasoning verbatim, with the open <think> tag and no closing tag.
reasoning() returns any previous completed reasoning blocks, or null if none.
answer() is the input with completed blocks removed and the partial thought left in place. The partial thought is intentionally left in answer() — silently swallowing it would hide the budget problem.
finishReason() returns 'length'.

The fix is always “bump maxTokens”. A useful pattern is to surface the truncation explicitly:

$response = $model->chat($prompt, maxTokens: 256);

if ($response->finishReason() === 'length') {
    error_log(sprintf(
        'truncated: model wanted more than 256 tokens for "%s..."',
        substr($prompt->messages()[0]->content(), 0, 40),
    ));
}

The interactive chat example uses a softer hint: “(truncated — bump –max-tokens to see more)”.

When you DON’T want reasoning at all

Two strategies, depending on what “don’t want” means.

Strategy A — hide it in the UI, keep it under the hood

Default everywhere. Show $response->answer() to the end user. Log $response->reasoning() for debugging or display behind a “show thinking” toggle. No model-level change.

Strategy B — tell the model to skip thinking

Qwen3 has a /no_think directive that, when included as a system-message suffix, suppresses the <think>...</think> block entirely. The model still emits an empty <think></think> block (which the split handles — reasoning() ends up being an empty string), but skips the actual monologue:

$prompt = Prompt::system('You are helpful. /no_think')
    ->withUser('What is 2+2?');

$response = $model->chat($prompt);

$response->hasReasoning();   // true (empty block)
$response->reasoning();      // "" (empty string)
$response->answer();         // "2 + 2 equals 4."

This is Qwen3-specific. DeepSeek R1 has a similar concept (/no-cot in some prompts). Other reasoning models vary. Check the model card.

Feeding history back

When building multi-turn conversations against a reasoning model, feed Response::answer() back as the assistant’s reply, not Response::text():

$conversation = $conversation->withAssistant($response->answer());
//                                          ^^^^^^^^^^^^^^^^^^^
//                                          not ->text()

text() includes the <think>…</think> block. Adding it to the conversation means the model sees its own reasoning on the next turn and tends to treat it as instruction rather than history — output quality drops fast.

This is the single most-common mistake when wiring up reasoning models in ext-infer. See Multi-turn chat for the full pattern.

Performance note

Reasoning models spend many tokens on their internal monologue. A typical Qwen3-0.6B answer to “what is 2+2?” generates ~150 tokens of thinking before the 5-token answer. That’s an order of magnitude more work than a non-reasoning model would do for the same question.

If latency matters more than the highest-quality answer:

Use /no_think (Strategy B above) to skip the monologue.
Pick a non-reasoning model — Llama 3.x Instruct, Mistral Instruct, Qwen 2.5 Instruct (not Qwen3) all chat without thinking out loud.

See Performance tuning for more knobs.

RAG over markdown

Retrieval-Augmented Generation: instead of asking the model what it knows (and getting whatever its training data captured, possibly incorrectly), embed your own documents into a vector store, retrieve the most relevant ones at query time, and feed them to the model as context. The model answers from your data.

This recipe walks through the smallest practical version: a folder of markdown files, indexed once into sqlite-vec, queried on demand.

Prerequisites

An embedding model — Qwen3-Embedding-0.6B works well.
A chat model — Qwen3-7B-Instruct or similar.
The sqlite-vec extension loaded into PHP’s PDO SQLite (or use the sqlite3 CLI tools).

# macOS — Homebrew has it
brew install asg017/sqlite-vec/sqlite-vec
# Linux — see the sqlite-vec README for distro packages

Schema

A single table holds documents and their embeddings:

CREATE TABLE IF NOT EXISTS docs (
    id    INTEGER PRIMARY KEY AUTOINCREMENT,
    path  TEXT UNIQUE NOT NULL,
    body  TEXT NOT NULL
);

-- sqlite-vec virtual table for k-nearest-neighbor search.
-- 1024 dimensions matches Qwen3-Embedding-0.6B.
CREATE VIRTUAL TABLE IF NOT EXISTS doc_vecs USING vec0(
    id    INTEGER PRIMARY KEY,
    embed FLOAT[1024]
);

Indexing

Walk a directory, embed each file, persist:

declare(strict_types=1);

use Displace\Infer\Model;

$embedder = Model::load('models/Qwen3-Embedding-0.6B-Q8_0.gguf', [
    'embedding' => true,
]);

$pdo = new PDO('sqlite:rag.db');
$pdo->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION);
$pdo->sqliteCreateFunction('load_extension', 'sqlite_vec_init', 1);  // see sqlite-vec docs

$pdo->exec(file_get_contents('schema.sql'));

$insertDoc = $pdo->prepare(
    'INSERT INTO docs (path, body) VALUES (:path, :body)
     ON CONFLICT(path) DO UPDATE SET body = excluded.body
     RETURNING id'
);
$insertVec = $pdo->prepare(
    'INSERT OR REPLACE INTO doc_vecs (id, embed) VALUES (:id, :embed)'
);

$root = $argv[1] ?? './notes';
foreach (new RecursiveIteratorIterator(new RecursiveDirectoryIterator($root)) as $f) {
    if ($f->getExtension() !== 'md') {
        continue;
    }
    $body = file_get_contents($f->getPathname());
    $insertDoc->execute([':path' => $f->getPathname(), ':body' => $body]);
    $id = (int) $insertDoc->fetchColumn();

    // Pre-normalize so search is a dot product.
    $vector = $embedder->embed($body)->normalize()->vector();

    $insertVec->execute([
        ':id'    => $id,
        ':embed' => pack('f*', ...$vector),   // sqlite-vec wants float32 bytes
    ]);

    echo "indexed: {$f->getPathname()} ({$id})\n";
}

$embedder->close();

Run once to build the index, again whenever your notes change. For larger corpora, chunk each file and embed each chunk separately — section-level granularity gives better retrieval than whole-file vectors. displace/ai-toolkit ships the chunkers so you don’t hand-roll the splitting:

use Displace\AI\Toolkit\Text\RecursiveCharacterChunker;

$chunker = new RecursiveCharacterChunker(size: 2000, overlap: 200);

foreach ($chunker->chunk($body) as $i => $chunk) {
    $vector = $embedder->embed($chunk)->normalize()->vector();
    // store with a composite key: document id + chunk position
}

The recursive chunker splits on paragraphs first and only falls back to finer boundaries when a paragraph alone exceeds the budget, so chunks track the document’s own structure. (~2000 characters ≈ 500 tokens of English text.)

Retrieval + generation

declare(strict_types=1);

use Displace\Infer\Model;
use Displace\Infer\Prompt;

$embedder = Model::load('models/Qwen3-Embedding-0.6B-Q8_0.gguf', [
    'embedding' => true,
]);
$chat = Model::load('models/Qwen3-7B-Instruct-Q4_K_M.gguf');

$pdo = new PDO('sqlite:rag.db');
$pdo->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION);

$query = $argv[1] ?? 'What did I decide about the migration?';

// 1. Embed the query.
$qvec = $embedder->embed($query)->normalize()->vector();

// 2. Top-k retrieval via sqlite-vec.
$stmt = $pdo->prepare(<<<SQL
    SELECT
        docs.path,
        docs.body,
        vec_distance_cosine(doc_vecs.embed, :qvec) AS distance
    FROM doc_vecs
    JOIN docs ON docs.id = doc_vecs.id
    ORDER BY distance ASC
    LIMIT 4
SQL);
$stmt->execute([':qvec' => pack('f*', ...$qvec)]);
$hits = $stmt->fetchAll(PDO::FETCH_ASSOC);

// 3. Build a context-injected prompt.
$context = '';
foreach ($hits as $i => $hit) {
    $context .= sprintf("--- Document %d (%s) ---\n%s\n\n", $i + 1, $hit['path'], $hit['body']);
}

$prompt = Prompt::system(<<<SYS
You answer questions strictly using the provided documents.
If the documents don't contain the answer, say so — do not invent.
Cite the document number when you quote.
SYS)
    ->withUser("Documents:\n\n{$context}\n\nQuestion: {$query}");

// 4. Ask.
$response = $chat->chat($prompt, maxTokens: 1024, temperature: 0.3);

echo $response->answer(), PHP_EOL;

$embedder->close();
$chat->close();

What good output looks like

For a corpus of personal notes, a query like “what did I decide about the migration?” returns:

Based on Document 2 (notes/migration.md), you decided to defer the
schema change to Q3 in favor of shipping the redirect layer first.
The reasoning cited there was that the redirect layer was lower-risk
and would surface the migration's actual hot paths before you
committed to the column rename.

If the corpus doesn’t contain the answer:

The provided documents don't address the migration decision directly.
Document 1 mentions a planned schema change but doesn't record what
was decided. I'd need more context to answer.

That “I don’t know” behavior is what the system prompt enforces. Models will happily make up plausible answers without it.

Knobs worth tuning

Knob	Effect
Top-k (the `LIMIT 4` above)	More context = better answers but slower + risks the model conflating unrelated documents. 3–5 is a good default.
Chunk size at index time	Whole-file is simple but coarse. 500-token chunks give finer retrieval at the cost of ~10x more vectors.
`temperature` on the chat model	Set low (`0.0`–`0.3`) for factual answers; the model should be quoting, not improvising.
System prompt strictness	“Cite documents” + “say so if unknown” is the difference between RAG and a model that just sometimes incorporates your context.

What this recipe doesn’t cover

Reranking — top-k by embedding similarity is fast but coarse. See the Semantic search recipe for the chat-model-as-reranker pattern.
Streaming responses — Model::chat() is currently synchronous. See the roadmap.
Markdown-aware chunking — splitting that respects code blocks, headers, and lists. The structure-aware chunkers in displace/ai-toolkit get most of the way there by preferring paragraph boundaries; heading-aware splitting is on its roadmap.
Engine-agnostic wiring — if you want application code that doesn’t name Displace\Infer\Model directly (swappable embedders, framework integration), code against the displace/ai-contracts interfaces and wrap the model in a thin adapter.

Semantic search — the building-block underneath this.
Worker pools — running RAG queries under concurrent load.
Choosing a model — picking the right embedding + chat models for your corpus.

Worker pools

LLM inference is slow: tens of milliseconds at best, often seconds. Running it inline in an FPM worker means that worker is unavailable for any other request until the model is done. For any non-trivial deployment, you want a pool of workers — process-based, thread-based, or queue-based — that absorbs the latency without starving the rest of your app.

ext-infer is designed to slot into all three patterns.

Pattern 1 — FPM workers (process-based)

The simplest production setup: PHP-FPM with pm.max_children set high enough to absorb concurrent slow inference requests.

; php-fpm.d/www.conf
pm = dynamic
pm.max_children = 16
pm.start_servers = 4
pm.min_spare_servers = 2
pm.max_spare_servers = 8
pm.process_idle_timeout = 60s

Each FPM worker is its own OS process. They each load their own Model once at warm-up and reuse it for the lifetime of the worker. The model weights are mmap’d, so the OS shares physical memory across workers — 16 workers loading the same 4 GB model use ~4 GB of RAM total, not 64.

// Shared service container — boot once per worker.
$model = Displace\Infer\Model::load($cfg['model_path']);

// In your request handler:
$response = $model->chat($prompt, maxTokens: 512);

The downside: each worker can handle one inference at a time. If you hit pm.max_children concurrent requests, the (max_children + 1)st request waits. Bump max_children if you have the RAM (the model is shared via mmap; only the KV cache scales with concurrency); push inference to a queue if you don’t.

Sizing

A rough sizing heuristic for FPM with ext-infer:

max_children ≈ (RAM_budget - model_size) / per_request_memory

Where per_request_memory is the KV cache footprint plus PHP’s working set — usually 100–500 MB per worker depending on nCtx.

Pattern 2 — Job queue (process-based, decoupled)

For inference that takes long enough that you don’t want it in the request path at all:

// In the request handler — enqueue, return immediately.
$jobId = $queue->push(InferJob::class, [
    'prompt'  => $prompt,
    'options' => ['maxTokens' => 512],
]);
return new JsonResponse(['job_id' => $jobId, 'status' => 'queued']);

// Client polls /jobs/{id} until status = 'done'.

// In your queue worker — long-lived, model loaded once at boot.
final class InferWorker
{
    public function __construct(private \Displace\Infer\Model $model) {}

    public function process(InferJob $job): InferResult
    {
        $r = $this->model->chat($job->prompt, ...$job->options);
        return new InferResult($r->answer(), $r->finishReason());
    }
}

Any queue runner works — Symfony Messenger, Laravel Horizon, ReactPHP’s react/event-loop, a bespoke pcntl_fork + proc_open script. The pattern is the same: one Model::load() per worker process, reuse across many jobs.

This pattern shines when:

Inference latency is unpredictable and you don’t want to hold HTTP connections open.
You want to scale inference workers independently of web workers.
You want to route inference traffic across heterogeneous workers (CPU-only on cheap nodes, GPU-equipped on others).

Pattern 3 — ZTS + `parallel` (thread-based)

For latency-sensitive workloads where the IPC overhead of pattern 2 is too much, ext-infer supports concurrent calls within a single process under ZTS PHP with the parallel extension.

This works because ext-infer is thread-safe by design:

LlamaBackend is a Sync-guarded process-global singleton.
LlamaModel (the weights) is immutable after load; llama.cpp explicitly supports many contexts on one model.
Each chat() / raw() / embed() call builds its own per-call LlamaContext and drops it after.

Two threads calling Model::chat() simultaneously on the same handle is the supported, intended shape.

use parallel\Runtime;

// Load the model once in the main thread.
$model = Displace\Infer\Model::load('models/qwen3.gguf');

// Spin up a pool of runtimes.
$runtimes = array_map(fn() => new Runtime(), range(1, 4));

// Dispatch concurrent inferences.
$futures = [];
foreach ($prompts as $i => $prompt) {
    $rt = $runtimes[$i % 4];
    $futures[$i] = $rt->run(function (Model $m, Prompt $p) {
        return $m->chat($p, maxTokens: 512)->answer();
    }, [$model, $prompt]);
}

// Collect.
$answers = array_map(fn($f) => $f->value(), $futures);

$model->close();

Caveats

ZTS PHP is uncommon. Most distros ship NTS by default; you’ll have to build ZTS PHP from source (./configure --enable-zts) or use a ZTS-shipping Docker image. PIE’s pre-built binaries target NTS for v0.1; ZTS binaries are on the roadmap.
parallel itself requires ZTS. Can’t use it on a standard NTS install.
CI doesn’t exercise this yet. ZTS support is enabled in composer.json because the code is thread-safe by construction — but the maintainers have not yet run multi-threaded stress tests in CI. Treat it as “should work, please report bugs” until that changes. See Threading & ZTS for the current state.

Choosing between the patterns

Concern	FPM workers	Job queue	ZTS + parallel
Easy to set up	✅ trivial	⚠️ some IPC	⚠️ ZTS build
Holds HTTP connection during inference	yes	no	yes
Survives PHP being NTS	✅	✅	❌
Shares one model across all concurrency	via mmap	per-worker	within process
Scales to many concurrent inferences	⚠️ workers eat RAM	✅ horizontal	⚠️ one process
Production-tested in `ext-infer`	✅	✅	⚠️ unexercised

For most teams, FPM with a generous max_children is the right starting point. Move to a queue when latency variance gets too high for the request path. Reach for parallel last, when you’ve measured that IPC overhead is the bottleneck.

Threading & ZTS — what makes the parallel story actually work.
Performance tuning — what knobs to pull when each worker is too slow.

API surface

The complete public PHP API in one place. Every method, every argument, every return type. Read this when you know what you’re looking for and just want the signature; read the Guide when you want to understand why.

For an authoritative copy in PHP-stub form (consumed by IDEs and static analyzers like PHPStan), see stubs/infer.stubs.php.

`Displace\Infer\Model`

final class Model
{
    public static function load(
        string $path,
        array  $options = [],
    ): self;

    public function chat(
        \Displace\Infer\Prompt $prompt,
        int   $maxTokens   = 128,
        int   $nCtx        = 2048,
        float $temperature = 0.0,
        int   $seed        = 1234,
        array $options     = [],   // ['grammar' => gbnf] xor ['schema' => jsonSchema]
    ): \Displace\Infer\Response;

    public function raw(
        string $prompt,
        int    $maxTokens   = 128,
        int    $nCtx        = 2048,
        float  $temperature = 0.0,
        int    $seed        = 1234,
        bool   $addBos      = true,
        array  $options     = [],  // same grammar/schema keys as chat()
    ): string;

    public function embed(
        string $text,
    ): \Displace\Infer\Embedding;

    public function close(): void;
}

new Model() throws — use Model::load(). close() is idempotent (safe to call from finally blocks).

See Choosing a model, Chat completions, Raw completions, Structured output, Embeddings, and Options reference.

`Displace\Infer\RerankModel`

final class RerankModel
{
    public static function load(
        string $path,
        array  $options = [],  // n_gpu_layers, use_mmap, use_mlock, n_ctx, instruction
    ): self;

    public function score(string $query, string $document): float;  // (0, 1)

    /**
     * @param list<string> $documents
     * @return list<array{index: int, score: float}>  best-first
     */
    public function rank(string $query, array $documents, ?int $topK = null): array;

    public function close(): void;
}

new RerankModel() throws — use RerankModel::load(). Targets the Qwen3-Reranker GGUF family; rank()’s rows are shaped like Displace\AI\Contracts\Reranker::rerank(). See Reranking.

`Displace\Infer\Prompt`

final class Prompt
{
    public static function system(string $content): self;
    public static function user(string $content): self;

    public function withSystem(string $content): self;
    public function withUser(string $content): self;
    public function withAssistant(string $content): self;

    /** @return list<\Displace\Infer\Message> */
    public function messages(): array;

    public function lastRole(): ?string;
    public function count(): int;
    public function isEmpty(): bool;
}

Immutable. new Prompt() throws — use a factory. See Prompts.

`Displace\Infer\Message`

final class Message
{
    public function role(): string;    // 'system' | 'user' | 'assistant'
    public function content(): string;
}

Read-only. Constructed only by Prompt; new Message() throws.

`Displace\Infer\Response`

final class Response
{
    public function text(): string;
    public function reasoning(): ?string;
    public function answer(): string;
    public function hasReasoning(): bool;
    public function finishReason(): string;  // 'eos' | 'length' | 'stop'
    public function tokensGenerated(): int;
}

Read-only. Constructed only by Model::chat(); new Response() throws. See Chat completions.

`Displace\Infer\Embedding`

final class Embedding
{
    /** @return list<float> */
    public function vector(): array;

    public function packed(): string;  // little-endian float32, pack('g*')-identical

    public function dimensions(): int;
    public function norm(): float;
    public function normalize(): self;
    public function cosineSimilarity(\Displace\Infer\Embedding $other): float;
}

Read-only. Constructed only by Model::embed(); new Embedding() throws. See Embeddings.

Exception hierarchy

\RuntimeException
└── Displace\Infer\InferException
    ├── Displace\Infer\ModelLoadException
    └── Displace\Infer\InferenceException

InferException extends PHP’s built-in \RuntimeException, so any generic catch (\RuntimeException $e) clause sees ext-infer errors. See Exceptions for which methods raise which subclass.

Conventions

Direct construction is refused on Prompt, Message, Response, Embedding, and Model. Each one throws InferException from its __construct with a hint at the right factory. This is so an arbitrary new Embedding() can’t lie about which model produced it.
All with* methods on Prompt return a new instance. They never mutate. This is the only place the API exposes the “build by chaining” pattern; Embedding::normalize() also returns a new instance.
Sampling args are named, never positional. Model::chat() and Model::raw() use PHP 8 named arguments (maxTokens: 256, temperature: 0.7). Load options — and the constraint options (grammar/schema) on chat()/raw() — are arrays because they’re rare and compose with config-from-disk patterns.

Exceptions

ext-infer raises exceptions for every error condition — no silent false returns, no error codes. The hierarchy is small enough that you can catch precisely or broadly depending on what you’re after.

Hierarchy

\RuntimeException
└── Displace\Infer\InferException
    ├── Displace\Infer\ModelLoadException
    └── Displace\Infer\InferenceException

InferException extends PHP’s \RuntimeException. Catching \RuntimeException in generic top-level handlers (e.g. a PSR-15 middleware) sees every ext-infer error.
ModelLoadException is raised exclusively from Model::load().
InferenceException is raised from Model::chat(), Model::raw(), Model::embed(), and Embedding::cosineSimilarity().
InferException itself (the base class, not just an instance of a subclass) is raised for “this method should never have been called” errors — see Direct construction.

Which method raises what

Method	Class	Common causes
`Model::load()`	`ModelLoadException`	Missing file, malformed GGUF, backend init failure.
`Model::load()`	`InferException`	Invalid option type/value (e.g. `pooling` set to `"weighted"`).
`Model::chat()`	`InferenceException`	Model closed, no chat template, decode failure, prompt over `nCtx`.
`Model::raw()`	`InferenceException`	Model closed, decode failure, prompt over `nCtx`.
`Model::embed()`	`InferenceException`	Model closed, model not loaded with `embedding: true`, decode failure.
`Embedding::cosineSimilarity()`	`InferenceException`	Dimension mismatch between the two embeddings.
`new Model()` / `new Prompt()` / `new Message()` / `new Response()` / `new Embedding()`	`InferException`	Direct construction is refused; use the appropriate factory.

Direct construction

Model, Prompt, Message, Response, and Embedding all refuse direct new. Each throws InferException (the base class) with a hint pointing at the right factory:

new Embedding();
// Displace\Infer\InferException:
//   Displace\Infer\Embedding is produced by Model::embed();
//   do not instantiate directly

This is deliberate: a new Embedding() from PHP code could lie about which model produced it and what pooling strategy was applied — silent mistakes that are hard to debug later. Forcing factory construction keeps the invariants tight.

Catching strategies

Catch broadly at the top

For a request handler that wants to convert any ext-infer failure into a 5xx response:

try {
    $reply = $model->chat($prompt);
} catch (\Displace\Infer\InferException $e) {
    $log->error('inference failed', ['error' => $e->getMessage()]);
    return new Response(500, [], 'Inference temporarily unavailable.');
}

Distinguish load failures from inference failures

For a CLI tool that wants different exit codes:

try {
    $model = Model::load($path);
} catch (\Displace\Infer\ModelLoadException $e) {
    fwrite(STDERR, "model: " . $e->getMessage() . PHP_EOL);
    exit(2);
}

try {
    $r = $model->chat($prompt);
} catch (\Displace\Infer\InferenceException $e) {
    fwrite(STDERR, "inference: " . $e->getMessage() . PHP_EOL);
    exit(3);
}

Retry vs surface

InferenceException covers two flavors of failure:

Transient — out-of-memory under load, e.g. with_mlock + a large prompt. Often resolved by reducing nCtx or splitting the work.
Permanent — model has no chat template, prompt has null bytes, invalid option. Retrying makes no sense.

The message string is the only signal you have today; structured error codes are on the roadmap. For now, a pragmatic split:

try {
    $r = $model->chat($prompt, maxTokens: $budget);
} catch (\Displace\Infer\InferenceException $e) {
    if (str_contains($e->getMessage(), 'n_ctx')) {
        // prompt too long — surface to caller, don't retry
        throw $e;
    }
    // other inference failure — log + maybe retry
    $log->warning('chat failed, retrying once', ['error' => $e->getMessage()]);
    $r = $model->chat($prompt, maxTokens: $budget);
}

Always-safe patterns

Model::close() is idempotent — calling it on an already-closed model is a no-op. Safe inside finally:

$model = Model::load($path);
try {
    return $model->chat($prompt);
} finally {
    $model->close();
}

After close(), every other method on that Model raises InferenceException with "model has been closed".

Environment variables

The extension reads exactly one environment variable today. We’ll add more as they earn their keep; the conservative approach is to keep configuration in PHP (named arguments, load options) rather than sprinkled across the environment.

`EXT_INFER_LOG`

Restores llama.cpp’s verbose stderr logging, which is silenced by default.

Value	Effect
(unset)	llama.cpp logs are silenced. This is the default.
Any value	llama.cpp logs are passed through to stderr verbatim.

EXT_INFER_LOG=1 php hello.php

Why silence by default?

A single Model::load() + chat() pair against a typical GGUF produces several hundred lines of stderr — model metadata, KV-cache sizing, graph reservation, attention layout, sampler config, and more. For a CLI tool drilling into a problem it’s useful; for a PHP extension running inside a request, it’s structured-log poison.

When to enable it

Diagnosing a ModelLoadException. The verbose log dumps the GGUF header before failing, which usually points at the cause (wrong architecture, wrong quant, truncated file).
Diagnosing a slow load. The log shows where the time goes — reading from disk, mmap setup, weight copy.
Reporting an issue. The first thing maintainers will ask for is the verbose log; capture it once with EXT_INFER_LOG=1 and paste.

How it works

The extension hooks llama_log_set at backend init time, replacing llama.cpp’s default callback with a no-op. The hook is process-global — once installed, it covers every subsequent call. EXT_INFER_LOG is checked only at backend init (the first time Model::load() is called); changing the variable mid-process has no effect.

Reserved for future use

These names are not consumed by the extension today but may be in future versions. Avoid using them as application env vars to keep your forward-upgrade path clean:

EXT_INFER_DEFAULT_NCTX
EXT_INFER_DEFAULT_TEMPERATURE
EXT_INFER_BACKEND (CPU / Metal / CUDA selection at runtime)

If you want any of these to land sooner rather than later, open an issue with the use case.

Compatibility matrix

PHP versions

Version	Status	Notes
8.3	✅ supported	Security-only upstream through end of 2026.
8.4	✅ supported	Active support.
8.5	✅ supported	Current release.
8.2 and earlier	❌ not supported	`composer.json` declares `php: ^8.3`.

Every released binary is built against a specific PHP minor. A binary built for PHP 8.4 will not load into PHP 8.5 or 8.3. PIE handles this automatically (it picks the right tarball); manual installs need to match versions explicitly.

Operating systems

Platform	Status	Notes
macOS arm64	✅ supported	Apple Silicon. Tested on macOS 14+.
macOS x86_64	⚠️ not in release matrix	Builds from source. We don’t ship binaries.
Linux x86_64 (glibc)	✅ supported	Ubuntu 22.04+, Debian 12+, RHEL 9+. Most modern distros.
Linux arm64 (glibc)	✅ supported	Ubuntu 24.04 arm64, Debian 12 arm64, AWS Graviton.
Linux musl (Alpine)	⚠️ builds from source	`.cargo/config.toml` has the right `crt-static` opt-out; no released binary.
FreeBSD / OpenBSD	⚠️ builds from source	Untested but should work; the build script handles non-Linux non-macOS as Linux.
Windows	❌ excluded	`os-families-exclude: ["windows"]` in `composer.json`. Out of scope for v0.1.

Threading

ext-infer is thread-safe by design — the LlamaBackend singleton is guarded by a Sync mutex, the underlying LlamaModel’s weights are read-only after load (llama.cpp explicitly supports many contexts on one model), and each chat() / raw() / embed() call builds its own per-call LlamaContext. Two threads calling Model::chat() concurrently on the same handle is the supported, intended shape.

PHP build	Status	Notes
NTS	✅ supported, the default	What every release binary targets today.
ZTS	✅ supported (`support-zts: true` in `composer.json`)	Not yet exercised in CI. See Threading & ZTS.

Acceleration backends

Backend	Status	Notes
CPU (default)	✅ supported, the default	Portable, no hardware requirements.
Apple Metal	⚠️ opt-in via cargo feature	`make release FEATURES=metal`. See Apple Metal.
CUDA (NVIDIA GPU)	❌ not yet	llama-cpp-2 supports it via a cargo feature; we haven’t exposed or tested it.
ROCm / Vulkan	❌ not yet	Same — supported upstream, not surfaced.

If you want CUDA or other GPU acceleration sooner rather than later, open an issue describing your use case — surfacing the feature is small work; testing it across the GPU landscape is the hard part.

Tested model families

What the maintainers have actually exercised end-to-end. Other GGUF-supported families almost certainly work; this is the “we’ve seen it produce sensible output” list.

Family	Used for
Qwen3 (Instruct)	Chat completions, reasoning splitting.
Qwen3-Embedding	Embeddings, cosine similarity.
Llama 3 / 3.1 / 3.2	Chat completions. No reasoning.
Mistral	Chat completions.
BGE / E5 / GTE	Embeddings.

Versioning policy

Pre-1.0 (0.x.y), breaking changes happen between minors (0.1.x → 0.2.x), not patches.
Once v1.0.0 ships, the class / method / argument surface is frozen. New features land additively; behavioral changes that affect existing callers wait for the next major.
See RELEASE.md for the cut-a-release flow.

Reporting compatibility issues

If you hit a “should work but doesn’t” combination on this matrix, the issue template asks for:

PHP version (php --version)
OS / arch (uname -a)
libc (Linux: ldd --version | head -1)
ZTS or NTS (php -i | grep 'Thread Safety')
Whether the extension was installed via PIE, make install, or loaded with -d extension=…

Three of those four are usually enough to triage.

Threading & ZTS

ext-infer is thread-safe by design. This page documents what that actually means: where the synchronization happens, what the runtime expectations are, and where the rough edges still are.

The thread-safety story, top to bottom

1. `LlamaBackend` is a `Sync`-guarded singleton

llama.cpp’s LlamaBackend::init() is process-global state. Initializing it twice is undefined behavior; not initializing it at all means no inference. ext-infer resolves this with:

#![allow(unused)]
fn main() {
static BACKEND: OnceLock<LlamaBackend> = OnceLock::new();
static BACKEND_INIT: Mutex<()> = Mutex::new(());
}

The first Model::load() call (from any thread) acquires the mutex, checks the OnceLock, calls LlamaBackend::init() if needed, and publishes the result. Every subsequent call sees a populated OnceLock and returns immediately without re-acquiring. The mutex is contended only during cold startup.

OnceLock<T> is Sync as long as T: Send + Sync, which LlamaBackend is.

2. `LlamaModel` weights are immutable after load

llama.cpp explicitly supports running multiple contexts in parallel against a single loaded model. The weights are read-only after load_from_file returns; only the per-context state (KV cache, sampler state) mutates during inference.

This is what makes the “load once, use from many threads” pattern work without any locking on the model itself.

3. Per-call `LlamaContext`

Model::chat(), Model::raw(), and Model::embed() each build a fresh LlamaContext for the duration of the call and drop it on the way out. Two threads calling chat() simultaneously get two independent contexts that share the same underlying weights via references.

#![allow(unused)]
fn main() {
// Inside run_completion:
let ctx_params = LlamaContextParams::default().with_n_ctx(Some(n_ctx));
let mut ctx = model.new_context(backend, ctx_params)?;
// ... decode, sample, decode, sample ...
// ctx dropped at function exit
}

No state survives the call. No cleanup is required. No two threads ever touch the same LlamaContext.

4. `Model::close()` is the one `&mut self` method

PHP’s runtime serializes calls into the same object method via its own object lock, so close() from one thread while another calls chat() should be safe by the runtime’s invariants — but it’s the one place where the Rust code mutates the Model itself (self.inner = None). The worst case is the user-after-close error, which is what close() is supposed to provoke anyway.

When you actually get concurrency

Three deployment shapes use this thread-safety:

PHP-FPM workers (process-based) — each worker is independent; the thread-safety story doesn’t matter, but the mmap-sharing story does. See Worker pools.
ZTS PHP + parallel (thread-based) — one PHP process, multiple OS threads, each calling chat() on a shared Model. This is what the thread-safety story is for.
Swoole / ReactPHP coroutines (single-threaded but context-switching) — not actually concurrent at the OS level, so thread-safety isn’t strictly required; you’ll still benefit from the per-call context pattern because no global state survives.

ZTS-specific notes

ZTS (Zend Thread Safe) is a PHP build mode that adds TLS storage around engine globals so multiple PHP interpreters can run in one process. It’s required for pthreads (EOL) and the more modern parallel extension.

Detecting ZTS

php -i | grep 'Thread Safety'
# expected: Thread Safety => enabled

Or from PHP:

if (PHP_ZTS) {
    // ZTS build
}

Installing ZTS PHP

Most distros ship NTS PHP. To get ZTS:

Ubuntu / Debian: build from source with ./configure --enable-zts. Some PPAs (ondrej/php) ship a ZTS variant under php{X}.{Y}-zts but coverage is spotty.
macOS: Homebrew’s php@* formulas are NTS. Use phpbrew install +zts +parallel or build from source.
Docker: official php:*-cli images are NTS. The community silkeh/php images include ZTS variants.

ext-infer v0.1 ships NTS-only release binaries. ZTS users need to build from source. The composer.json declares support-zts: true so a future ZTS release can ship without changing the install story.

Loading `ext-infer` into ZTS PHP

Same extension=infer line in php.ini, plus parallel if you want threading:

extension=infer.so
extension=parallel.so

A minimal `parallel` test

<?php
use parallel\Runtime;
use Displace\Infer\Model;
use Displace\Infer\Prompt;

$model = Model::load('models/Qwen3-0.6B-Q8_0.gguf');

$rt1 = new Runtime();
$rt2 = new Runtime();

$f1 = $rt1->run(function (Model $m) {
    return $m->chat(Prompt::user('What is the capital of France?'))->answer();
}, [$model]);
$f2 = $rt2->run(function (Model $m) {
    return $m->chat(Prompt::user('What is the capital of Italy?'))->answer();
}, [$model]);

echo "F: ", $f1->value(), PHP_EOL;
echo "I: ", $f2->value(), PHP_EOL;

$model->close();

If this works, you have concurrent inference. If it crashes — please open an issue with the model name, PHP version, build flags, and the crash output. CI doesn’t exercise this path yet, so user reports are the canary.

Future work

Two threading-related items on the roadmap:

CI exercise for ZTS

Add a parallel-driven stress test to CI. Today the matrix only covers NTS. Adding ZTS will require:

Building a ZTS-PHP runner image (the maintainers haven’t picked one yet).
Adding a ZTS leg to ci.yml and the release matrix in release.yml.

Reusable session contexts

Today, every chat() call rebuilds the LlamaContext from scratch. That drops the KV cache, so multi-turn conversations re-prefill on every turn. A Session abstraction that owns a long-lived context would let users opt into KV-cache reuse for back-to-back turns of the same conversation. Tracked in PLAN.md.

This wouldn’t change the thread-safety story — each Session would be owned by one thread (or guarded by a mutex if shared) — but it would significantly improve multi-turn performance.

Worker pools recipe — practical patterns for concurrency in production.
Performance tuning — knobs that matter once you’ve got the concurrency story right.

Apple Metal

Metal is Apple’s low-level GPU API. On Apple Silicon hardware (M1 / M2 / M3 / M4), llama.cpp uses Metal to offload weight-matrix multiplications to the integrated GPU, which substantially outpaces the CPU for medium-to-large models.

ext-infer exposes Metal as an opt-in cargo feature. It is not enabled by default — the default build is CPU-only and portable to non-Apple platforms.

When Metal helps

Order-of-magnitude rule of thumb on an M-series Mac:

Model size	CPU tokens/sec	Metal tokens/sec	Speedup
0.6B	~80	~120	1.5×
3B	~25	~70	2.8×
7B	~12	~50	4×
13B+	(memory-limited)	~25	dramatic

Numbers are rough — they depend on quant level, M-series generation, prompt length, and what else the machine is doing. The pattern is clear though: Metal’s value grows with model size.

For small models on a fast CPU, Metal can actually be slower on the first few tokens because of the shader compilation overhead. If you’re running 600M-param models in batch mode, the CPU build is likely fine.

Enabling Metal

The cargo feature is named metal:

make release FEATURES=metal
make install  FEATURES=metal

Or via raw cargo:

cargo build --release --features metal

The release binary is now Metal-enabled. No runtime flag — Metal is used automatically when the cargo feature is on.

Per-layer offload

The Model::load() option n_gpu_layers controls how many transformer layers are offloaded to the GPU. Defaults to 0 (CPU only); set to a high number (the model’s total layer count, or just 999 as a “all of them” shortcut) to offload everything:

$model = Model::load($path, [
    'n_gpu_layers' => 999,   // offload all layers to Metal
]);

For models that fit entirely in unified memory, full offload is almost always what you want. For models that don’t fit, partial offload lets you put the hot lower layers on the GPU and keep the upper layers on CPU. Tune empirically; the upstream llama.cpp Metal docs have more.

Why isn’t it the default?

Three reasons we ship CPU-by-default and Metal-by-opt-in for v0.1:

The release matrix builds on the GitHub macos-14 runner. Its hardware revision and MACOSX_DEPLOYMENT_TARGET are not-fully-pinned — we haven’t validated that a Metal-enabled binary built there actually loads on every customer Mac.
CI doesn’t test Metal output for correctness. Different precision behavior on GPU vs CPU could surface as different sampler output, and we haven’t caught that drift end-to-end yet.
Cold-start cost. Metal shader compilation adds ~1s to the first inference. Acceptable for long-running workers, awkward for a CLI tool people run once.

Making Metal the default for macos-arm64 release tarballs is on the roadmap once those three concerns are resolved.

Verifying Metal is actually being used

Enable EXT_INFER_LOG and look for Metal-specific lines:

EXT_INFER_LOG=1 php hello.php 2>&1 | grep -i metal | head

You should see something like:

ggml_metal_init: GPU name:   Apple M2 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple8 (1008)
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 48318.38 MiB

If you see no Metal lines at all, the cargo feature didn’t get applied — re-check the make release FEATURES=metal invocation.

Memory considerations

Apple Silicon has unified memory — the GPU and CPU share the same physical RAM. There is no “host-to-device” copy step like on discrete GPUs. The trade-off is that GPU memory pressure shows up as overall system memory pressure: a 13B model in Metal mode uses ~8 GB of the same RAM your other apps need.

recommendedMaxWorkingSetSize in the log above is what macOS thinks you should keep the GPU footprint under. Loading a model larger than that will work — Metal pages weights in and out as needed — but performance drops sharply.

Cross-platform note

#[cfg(feature = "metal")] only enables Metal on Apple targets. Building with --features metal on Linux is harmless (the feature is a no-op there), but there’s no reason to do it.

For GPU acceleration on non-Apple hardware (CUDA on NVIDIA, ROCm on AMD, Vulkan as a portable option) — the llama-cpp-2 crate supports all three, but ext-infer hasn’t surfaced them as cargo features yet. If you want one, open an issue.

Performance tuning — once Metal is on, the next bottleneck is usually nCtx or maxTokens.
Choosing a model — Metal opens up larger models you might not have considered.

Performance tuning

A Model::chat() call has three dominant costs:

Loading — one-time on the first call. Dominated by disk I/O (or mmap setup) for the GGUF.
Prompt prefill — tokenize + forward-pass on the prompt. Scales roughly linearly with prompt length.
Token generation — sample, decode, sample, decode, … Scales linearly with maxTokens (or wherever the model chooses to stop).

This page walks through each, what knobs affect it, and what trade-offs each knob carries.

Reducing load time

Load is the slow part — for a 4 GB model from a cold cache, expect 1–3 seconds on SSD, longer on spinning rust.

`use_mmap` (default `true`)

Memory-mapping the GGUF skips the explicit read() syscall and lets the OS page weights in lazily. Always leave this on unless you’re diagnosing a specific mmap issue. Without it, load reads the entire file upfront — slower for large models, identical for small ones once cached.

$model = Model::load($path, ['use_mmap' => true]);   // default

`use_mlock` (default `false`)

mlock pins the model’s pages in physical RAM so the OS can’t page them out. Useful when:

You’re on a memory-constrained machine and would rather OOM than thrash.
You’re serving a large model under unpredictable load and want predictable latency.

The cost: that physical memory is unavailable to anything else on the system. Don’t turn it on unless you know you want it.

$model = Model::load($path, ['use_mlock' => true]);

On Linux, mlock has a per-process limit (RLIMIT_MEMLOCK). For models larger than 64 MB (basically all of them), you’ll need to raise it via /etc/security/limits.conf or ulimit -l unlimited. macOS doesn’t enforce the same limit but may swap aggressively under memory pressure.

If you’re running multiple FPM workers, the OS automatically deduplicates mmap’d pages across them. 16 workers loading the same 4 GB model consume ~4 GB of physical memory total, not 64. This is why use_mmap matters even on machines with abundant RAM.

Reducing prompt prefill cost

Prefill cost scales with the number of prompt tokens. The longest prompts come from RAG pipelines that inject document context — see RAG over markdown.

`nCtx` (default `2048`)

The context window for a single call. The rendered prompt + generated tokens must fit. Lower is faster because llama.cpp allocates the KV cache to nCtx, so a 32k context costs 16× more memory than a 2k context even when most of it is unused.

$model->chat($prompt, nCtx: 4096, maxTokens: 1024);

For typical RAG/chat use cases, nCtx = 2048 to 4096 is plenty. Go higher only when the model has been trained for it and you’ve measured a quality benefit.

Prompt length

The fastest prompt is a short prompt. Common ways to compress without losing fidelity:

Drop boilerplate from system messages. “You are a helpful assistant. Answer truthfully. Don’t make things up. Be concise. Use markdown formatting. …” is mostly cargo-culted. Test what’s actually load-bearing.
Truncate conversation history. Keep the last N turns rather than every turn since the dawn of the conversation. For most chatbots, N = 6–10 is plenty.
Summarize old turns. Replace turns 1–50 with “Earlier, the user asked about X and you said Y.” This is what production chatbots do above a certain length.

Reducing token-generation cost

Once prefill is done, each generated token costs roughly the same. Two knobs.

`maxTokens` (default `128`)

The maximum number of generated tokens. Lower is faster. The default is conservative on purpose — bump it for any non-trivial generation:

$model->chat($prompt, maxTokens: 512);   // ~4× the default budget

Set it high enough that legitimate answers complete, low enough that runaway generations (which happen) don’t wedge the worker for minutes. For reasoning models, you’ll want at least 512 — they spend many tokens thinking.

When finishReason() === 'length', you hit this budget. Surface it to the caller so they can decide whether to bump or live with the truncation.

`temperature`

temperature = 0.0 is greedy — sample the single highest-probability token at every step. It’s also the fastest because the sampler is trivial.

temperature > 0.0 enables the random sampler (with optional seed for reproducibility), which is marginally slower per token. The difference is small enough that you should pick based on output quality, not speed.

Hardware-side knobs

Quantization

A Q4_K_M model is roughly 2× faster than Q8_0 of the same model — fewer bits to fetch from memory per matrix multiply. See Choosing a model for the size/quality table.

If Q4_K_M answers are good enough for your use case, prefer it over Q8_0. The space and speed savings are real; the quality drop is usually small for chat workloads.

GPU offload

The biggest single speedup is moving compute off CPU. On Apple Silicon, see Apple Metal — n_gpu_layers: 999 typically gives a 3–4× speedup for medium models.

On Linux + NVIDIA, CUDA support exists in llama-cpp-2 but isn’t surfaced as an ext-infer cargo feature yet. Open an issue if you want it.

Pinning threads to cores

llama.cpp respects the OMP_NUM_THREADS environment variable. Setting it explicitly is sometimes faster than the default (which uses all available cores, including hyperthreads that hurt more than help). For a 4-physical-core box:

OMP_NUM_THREADS=4 php hello.php

Experimentally find the sweet spot for your CPU.

Measuring before tuning

A useful pattern: log latency per call and look for the actual bottleneck before reaching for any of these knobs.

$start = hrtime(as_number: true);
$r = $model->chat($prompt, maxTokens: 512);
$elapsed_ms = (hrtime(true) - $start) / 1_000_000;

error_log(sprintf(
    'chat: %.0fms, %d tokens, %.1f tok/s, finish=%s',
    $elapsed_ms,
    $r->tokensGenerated(),
    $r->tokensGenerated() / ($elapsed_ms / 1000),
    $r->finishReason(),
));

If tokens/sec is low (< 20 on a modern CPU), you’re hardware-bound — quantize down or enable GPU offload. If it’s reasonable (50+) but total time is high, you’re generating too many tokens — reduce maxTokens or compress the prompt.

Future work

Two performance items on the roadmap that aren’t shipping in v0.1 but would change the picture significantly:

Reusable session contexts — KV-cache reuse across chat() calls. Multi-turn conversations would skip the prefill cost on every turn after the first.
Continuous batching — process N prompts together so the GPU stays saturated. Necessary for any serious inference-as-a-service workload.

Tracked in PLAN.md.

Apple Metal — usually the largest single improvement on macOS.
Choosing a model — the model you pick caps everything else.
Worker pools recipe — when per-call tuning isn’t enough.

Building from source

The development build is what make build produces — a debug-mode shared library you can load via -d extension=…. The release build is what ships in PIE tarballs.

Prerequisites

PHP 8.3+ with php-config on PATH.
Rust — installed via rustup. The repo pins the toolchain via rust-toolchain.toml; rustup will fetch it on first build.
cmake 3.18+ — llama.cpp’s build system.
A C/C++ compiler — Clang (macOS / Linux) or GCC. The build script honors CC / CXX if you need to override.
libclang (Linux only) — apt install libclang-dev or distro equivalent. Used by bindgen for the PHP header parse.
cargo-php — cargo install cargo-php once.

Verify everything:

php --version
php-config --version
rustup --version
cmake --version
cargo php --version

Cloning

git clone https://github.com/DisplaceTech/ext-infer
cd ext-infer

The repo includes a models/ directory (gitignored) where you can drop GGUFs for testing. The PHPT suite and examples both default to models/Qwen3-0.6B-Q8_0.gguf.

Debug build

make build
# -> target/debug/libinfer.{so,dylib}

Debug builds compile faster but run slower. Use them for iterative development; switch to make release when you’re benchmarking or shipping.

A cold make build takes a few minutes because cargo compiles llama-cpp-sys-2 from source (it vendors all of llama.cpp). Cached incremental rebuilds are sub-minute on a modern laptop.

Release build

make release
# -> target/release/libinfer.{so,dylib}

Use this for installing system-wide via make install, for the performance numbers you’d quote in benchmarks, and for any “production-like” testing.

Optional features

Feature	Effect	When to use
`metal`	Enables Apple Metal GPU offload on macOS-arm64.	When you have an Apple Silicon Mac and want GPU acceleration. See Apple Metal.

make release FEATURES=metal

Loading your build into PHP

Two options.

Without installing

Pass -d extension=… on every PHP invocation:

php -d extension=$PWD/target/debug/libinfer.dylib your-script.php

Substitute .so on Linux. This is what every script in examples/ assumes — you can drop the flag once you make install.

Installing system-wide

make install runs cargo php install --release, which:

Builds release-mode if it hasn’t already.
Drops the binary into PHP’s extension_dir.
Adds extension=infer.so (or .dylib) to a config file in php.ini’s scan directory.

make install
php -m | grep infer
# infer

To revert:

make uninstall

Editor / IDE setup

Rust analyzer

The Rust code lives in src/. Pointing rust-analyzer at Cargo.toml (default) Just Works.

PHP autocomplete

Use the hand-authored stubs at stubs/infer.stubs.php:

// .phpstorm.meta.php / .composer.json autoload config:
{
  "autoload-dev": {
    "files": ["stubs/infer.stubs.php"]
  }
}

Or symlink it into your project. The stubs include full PHPDoc on every method so hovering in your IDE shows the option semantics without flipping to the docs.

Regenerating stubs (rare)

Stubs are hand-authored today because we want richer docblocks than cargo php stubs emits. To regenerate from scratch (e.g. to confirm the stub signatures match what’s actually registered):

make stubs
git diff stubs/infer.stubs.php

Reconcile the generated output with the hand-authored version manually.

Troubleshooting common build failures

Error	Likely fix
`linker 'cc' not found` / `cc: command not found`	Install Xcode CLT (`xcode-select --install`) or `build-essential` (Ubuntu).
`cmake: command not found`	`brew install cmake` or `apt install cmake`.
`libclang.so: cannot open shared object`	`apt install libclang-dev` (Linux). On macOS, libclang comes with the CLT.
`php-config: command not found`	Install PHP CLI; on macOS via Homebrew use `brew link [email protected] --force`.
`cargo install cargo-php` fails	Check your Rust version. `rustup update` may help.
`undefined symbol: _spl_ce_RuntimeException`	The dynamic-lookup link flag didn’t apply. Check `build.rs` ran; usually a stale `target/` — `cargo clean` and rebuild.

Testing — running PHPT and Rust unit tests.
Releasing — cut-a-release process.

Testing

ext-infer has two test layers:

PHPT — integration tests that exercise the extension from PHP. This is where the real correctness coverage lives.
Rust unit tests — for pure-Rust helpers (currently none; see Why no Rust unit tests? below).

Plus formatting and clippy. CI runs all of the above on every push.

Running PHPT locally

The test harness lives in tests/phpt/. make test runs the full suite against a debug build:

make test

What that command actually does:

Build (cargo build).
Sanity-load — confirm the extension actually loaded into PHP.
Fetch run-tests.php from PHP-src matching the current minor (if not already cached).
Run php run-tests.php -q --show-diff tests/phpt/ with TEST_PHP_EXECUTABLE and TEST_PHP_ARGS set so the freshly built .so / .dylib is loaded.

Tests gated on a real model use the INFER_TEST_MODEL environment variable; the reranker tests use INFER_TEST_RERANK_MODEL:

INFER_TEST_MODEL=$PWD/models/Qwen3-0.6B-Q8_0.gguf \
INFER_TEST_RERANK_MODEL=$PWD/models/Qwen3-Reranker-0.6B-Q8_0.gguf \
    make test

Without the variables, model-gated tests skip cleanly. CI runs in this “no model” mode by default; setting both variables runs the full suite. The reranker model is the official llama.cpp conversion: ggml-org/Qwen3-Reranker-0.6B-Q8_0-GGUF.

Writing a PHPT test

Files in tests/phpt/ follow the standard PHPT format:

--TEST--
Model::chat() returns a Response with the model's answer
--SKIPIF--
<?php
if (!extension_loaded('infer')) {
    echo 'skip ext-infer not loaded';
    exit;
}
$path = getenv('INFER_TEST_MODEL');
if (!$path || !is_file($path)) {
    echo 'skip INFER_TEST_MODEL not set to an existing GGUF file';
}
?>
--FILE--
<?php
$model = \Displace\Infer\Model::load(getenv('INFER_TEST_MODEL'));
$r = $model->chat(\Displace\Infer\Prompt::user('hi'), maxTokens: 32);
echo $r->finishReason() === 'eos' || $r->finishReason() === 'length' ? "ok\n" : "bad\n";
$model->close();
?>
--EXPECT--
ok

Filename convention: NNN-short-description.phpt. NNN ordering is loose — it determines the order run-tests.php runs them in, which doesn’t really matter.

Three sections every model-gated test needs:

--SKIPIF-- — skip if extension_loaded('infer') is false (the harness invocation always passes -d extension=…, so this catches setup mistakes) and skip if INFER_TEST_MODEL is unset.
--FILE-- — the actual PHP under test.
--EXPECT-- or --EXPECTF-- — expected output. Use --EXPECTF-- if you need wildcards (%s, %d).

For tests that DON’T need a model, drop the INFER_TEST_MODEL check from --SKIPIF--. They’ll run in CI’s no-model leg.

Running Rust unit tests

cargo test --lib

…would be the command, but see the next section.

Why no Rust unit tests?

Earlier versions had Rust unit tests in src/response.rs and src/embedding.rs covering pure-Rust helpers. They were dropped because cargo test --lib builds an executable that statically links the crate, which pulls in references to the ext-php-rs runtime symbols (zend_throw_exception, _emalloc, …) — symbols only resolved when loaded into a real PHP host. On a clean checkout, cargo test --lib fails to link.

PHPT covers the same correctness ground end-to-end, so this is a net win for CI simplicity. If a pure-Rust helper grows complex enough to warrant unit tests in isolation, the path forward is to factor it into a sibling crate that has no ext-php-rs dependency.

Linting

make fmt-check       # cargo fmt --all --check
make clippy          # cargo clippy --all-targets -- -D warnings

CI runs both with -D warnings. Local lints are pinned to the same Rust toolchain as the build (via rust-toolchain.toml).

CI structure

.github/workflows/ci.yml runs on every push and PR:

rustfmt + clippy on ubuntu-latest with PHP 8.4. Fast (~1 minute warm-cache).
Test matrix — 6 legs: {ubuntu-latest, macos-14} × {8.3, 8.4, 8.5}. Each builds the extension, loads it, runs the no-model PHPT legs. Cache is scoped per-PHP-minor (see the comment in ci.yml about why this matters for ext-php-rs binding regeneration).

What CI does not do:

Run model-gated PHPT tests. Adding a fixture model to CI is on the roadmap; for now, run them locally before tagging.
Exercise ZTS PHP. See Threading & ZTS.

Pre-flight checklist

Before opening a PR, the maintainers run:

cargo fmt --all --check                              # no diff
cargo clippy --all-targets -- -D warnings            # clean
INFER_TEST_MODEL=$PWD/models/Qwen3-0.6B-Q8_0.gguf make test  # all green

If any of those fail, the PR will fail CI for the same reason — fix locally first.

Releasing — what runs in the release workflow (a different beast than CI).
Building from source — getting to the point where make test can even run.

Releasing

The full cut-a-release process lives in RELEASE.md at the repo root. This page is the one-screen version with pointers back into that document.

The five-step shape

# 1. Bump versions
edit Cargo.toml                      # [package].version = "0.1.0"

# 2. Verify locally
cargo fmt --all --check
cargo clippy --all-targets -- -D warnings
INFER_TEST_MODEL=$PWD/models/Qwen3-0.6B-Q8_0.gguf make test
composer validate composer.json

# 3. Land the bump
git commit -am "chore(release): v0.1.0"
git push

# 4. Tag — this is what triggers the release workflow
git tag v0.1.0
git push --tags

# 5. Edit and publish the draft Release on GitHub

Step 4 is the only user-facing action. The release workflow takes it from there.

What the workflow does

For each (PHP minor, OS, arch) in the 9-leg matrix:

Install system deps (cmake, build-essential, …).
Install the matrix PHP via shivammathur/setup-php@v2.
cargo build --release.
Stage infer.so / infer.dylib in the right shape.
Tarball as php_infer-{version}_php{minor}-{arch}-{os}[-{libc}].tar.gz per PIE’s filename convention.
Compute a .sha256 sidecar.
Upload both to a draft GitHub Release.

The first matrix leg creates the draft Release; later legs add files to the same one.

Why “draft”?

Releases ship draft so a maintainer can:

Verify all 18 files (9 tarballs + 9 sidecars) are attached.
Write release notes — the workflow doesn’t auto-generate them.
Spot-check one tarball locally with pie install before exposing it to users.

After the manual review, hit Publish release in the GitHub UI.

Versioning policy

Pre-1.0 (0.x.y), breaking changes happen between minors (0.1 → 0.2), not patches. Once v1.0.0 ships, the class / method / argument surface is frozen.

composer.json does NOT carry a version key — that would conflict with the tag-derived version Composer infers. The branch-alias under extra exists only so dev-main resolves to 0.1.x-dev for users pinning a dev branch.

What `RELEASE.md` covers in more detail

Pre-flight checklist (the verify-locally step expanded).
Release-notes template.
Post-publish smoke test (install via PIE, run hello-world).
Hotfix / patch process.
Yanking a broken release.
Caveats (Windows excluded, ZTS untested, etc.).
Symptom → first-thing-to-check table for release failures.

If you’re cutting a release, read RELEASE.md first. This page is the index, not the manual.

Caveats

Three things v0.1 explicitly doesn’t ship and that you should know about before cutting one:

No Windows binaries. os-families-exclude: ["windows"] in composer.json makes PIE skip Windows hosts cleanly.
No ZTS binaries. The composer.json declares support-zts: true because the code is thread-safe by construction, but the release matrix doesn’t include a ZTS runner. ZTS users need to build from source for now.
No musl Linux binaries. The release matrix is glibc only. Musl users build from source; the .cargo/config.toml carries the needed crt-static opt-out.

All three are tracked in PLAN.md.

RELEASE.md — the full process document.
PLAN.md — what’s in flight after v0.1.

Keyboard shortcuts

ext-infer