Introduction
ext-infer is a PHP 8.3+ extension that loads a GGUF
model and runs LLM inference inside the PHP process via
llama.cpp. PHP-native semantic
search, RAG pipelines, and CLI / worker inference run without shelling
out to Python or hitting a remote API.
It is written in Rust on top of
ext-php-rs and the
llama-cpp-2 bindings. The public
PHP surface is designed to feel native: a fluent, role-aware Prompt
builder; a Response that splits reasoning from answer; an Embedding
that knows how to normalize itself and compute cosine similarity. You
should rarely, if ever, need to think about <|im_start|> tokens.
Why an extension?
Three reasons local inference belongs in PHP rather than next to it:
- Latency. A subprocess fork or HTTP roundtrip is at least milliseconds, often tens. An in-process call is bounded only by decode time.
- Operational surface. No Python sidecar to package, no daemon to supervise, no inference server to scale alongside FPM. The PHP process is the inference server.
- API ergonomics. Calling a local LLM should be as natural in PHP
as calling
intlorpdo. The extension API is shaped to match that — see Prompts and Chat completions.
What’s here
This guide is split into five layers, navigable from the sidebar:
| Section | What you’ll find |
|---|---|
| Getting Started | Install, run hello-world, verify it loaded. |
| Guide | Conceptual walkthroughs of each public class. Read in order on first pass. |
| Recipes | Copy-paste-ready patterns: multi-turn chat, semantic search, RAG, worker pools. |
| Reference | Complete API listing, exceptions, environment variables, compatibility matrix. |
| Advanced | Threading model, Apple Metal, performance tuning. |
Status
ext-infer is pre-release — the class surface is stable but the
first tagged release (v0.1.0) is still in flight. See
RELEASE.md
for the cut-a-release flow and PLAN.md
for what’s coming next.
Conventions in this guide
- Code blocks are runnable as written, with one exception: PHP code
assumes the extension is loaded. Either install it system-wide or
prepend
-d extension=…to yourphpcommand. See Installation. Modelwithout a namespace prefix meansDisplace\Infer\Model; same forPrompt,Response,Embedding. Real code needs theusestatement at the top of the file.- CLI snippets are written for a POSIX shell (bash / zsh). Adjust for fish / PowerShell as needed; differences are usually only quoting.
Installation
Two supported install paths:
- Via PIE — pulls a pre-built binary for your
(php-minor, arch, os, libc)combo. No local C/C++ toolchain. Recommended for application developers. - From source — builds llama.cpp locally via cargo. Needed for contributors, distros without a pre-built artifact, or anyone who wants to enable the
metalcargo feature.
Via PIE
Heads up: PIE installation is wired up but the first published release (
v0.1.0) is still in flight. Until then, install from source — thepie installflow becomes the recommended path the moment we ship binaries.
PIE (PHP Installer for Extensions) is the official tool for installing PHP extensions from Composer-style metadata. Get it once:
curl -L --output pie.phar \
https://github.com/php/pie/releases/latest/download/pie.phar
chmod +x pie.phar && sudo mv pie.phar /usr/local/bin/pie
Then install ext-infer:
pie install displace/ext-infer
PIE reads composer.json
to learn that ext-infer ships pre-packaged binaries, fetches the
right tarball from the matching GitHub Release,
extracts infer.so (or infer.dylib on macOS) into the PHP extension
directory, and adds it to your php.ini.
Verify the install with php -m:
php -m | grep infer
# infer
From source
Prerequisites
| Tool | Purpose | Minimum |
|---|---|---|
| PHP CLI | host process | 8.3 |
php-config | tells ext-php-rs where the PHP headers are | (matches PHP) |
| Rust toolchain | compiles the extension | 1.88 |
cmake | llama.cpp builds via cmake during cargo build | 3.18+ |
| C/C++ toolchain | llama.cpp itself | Clang / GCC |
cargo-php | wraps make install to drop the artifact in PHP’s extension dir | 0.1+ |
The Rust toolchain is pinned via rust-toolchain.toml, so you don’t
need to install a specific version manually — rustup will fetch it on
first build. On macOS, cmake is a brew install cmake away; on
Debian/Ubuntu, apt install cmake build-essential libclang-dev.
Install cargo-php once:
cargo install cargo-php
Build and install
git clone https://github.com/DisplaceTech/ext-infer
cd ext-infer
make release # builds target/release/libinfer.{so,dylib}
make install # cargo php install --release
php -m | grep infer
A cold build compiles llama.cpp from source — that takes a few minutes on a fresh machine. Subsequent builds reuse cargo’s incremental cache and the rebuilt llama.cpp object files; expect sub-minute rebuilds after the first one.
Without make install (development)
If you want to load a freshly built binary without committing to installing it system-wide, pass the path on the PHP command line:
make build # debug build (faster compile, slower runtime)
php -d extension=$PWD/target/debug/libinfer.dylib your-script.php
Substitute .so for .dylib on Linux. This is the workflow used
throughout the examples.
Apple Metal acceleration (opt-in)
The default build is CPU-only and portable. For Apple Silicon GPU acceleration:
make release FEATURES=metal
make install FEATURES=metal
See Apple Metal for what this does and what trade-offs it implies.
Uninstalling
Via PIE:
pie uninstall displace/ext-infer
From a source install:
make uninstall # cargo php remove
Either way, confirm with php -m | grep infer (should produce no
output).
Troubleshooting
If php -m | grep infer shows nothing after install, see
Verifying your install for the diagnostic checklist —
it walks through the four or five most common failure modes
(extension_dir mismatch, PHP minor mismatch, missing
-undefined,dynamic_lookup on macOS, libc mismatch on Linux).
Quick start
This page assumes you’ve already installed the extension. From a cold install to a working answer in under a minute:
1. Grab a model
GGUF files are big. Even the smallest interesting ones are 600 MB quantized. For getting started, Qwen3-0.6B-Q8_0 is a good first model — Apache-2.0 licensed, ~640 MB, fast on CPU, good enough at toy questions:
mkdir -p models
curl -L -o models/Qwen3-0.6B-Q8_0.gguf \
https://huggingface.co/Qwen/Qwen3-0.6B-GGUF/resolve/main/Qwen3-0.6B-Q8_0.gguf
See Choosing a model for the broader landscape.
2. Write the script
Save the following as hello.php:
<?php
declare(strict_types=1);
use Displace\Infer\Model;
use Displace\Infer\Prompt;
$model = Model::load('models/Qwen3-0.6B-Q8_0.gguf');
$response = $model->chat(
Prompt::system('You are a helpful, concise assistant.')
->withUser('What is 2+2?'),
maxTokens: 256,
temperature: 0.0,
);
echo $response->answer(), PHP_EOL;
$model->close();
Three things going on:
Model::load(...)reads the GGUF into memory. Loading is the slow step — for a real app, load once and keep the handle around. See Choosing a model.Prompt::system(...)->withUser(...)builds a chat prompt without any template tokens. ThePromptis immutable; eachwith*returns a new instance. See Prompts.$model->chat($prompt, ...)renders the prompt through whatever chat template the GGUF ships, runs inference, and returns aResponse.answer()is the model’s reply with any<think>...</think>reasoning stripped.
3. Run it
If you installed via PIE (or make install), just:
php hello.php
If you’re running against a make build artifact instead:
php -d extension=$(pwd)/target/debug/libinfer.dylib hello.php
Substitute .so on Linux. Expected output:
2 + 2 equals 4.
4. What just happened
llama.cpp normally spams several hundred lines to stderr per inference
(model layout, KV-cache sizing, graph reservation). ext-infer
silences that by default — it’s noise inside a PHP request and tends
to poison structured logs. Bring it back when you need to debug:
EXT_INFER_LOG=1 php hello.php
See Environment variables for the complete list.
Next steps
- Verifying your install — the canonical diagnostic checklist if the script above doesn’t work.
- Prompts — multi-turn chat, system messages, immutability semantics.
- Embeddings —
Model::embed()plus cosine similarity. - Multi-turn chat recipe — a ready-to-lift implementation of conversational state.
examples/chat-interactive/— a Symfony Console standalone app that takes the above further.
Verifying your install
After installing, three things should be true. If any of them isn’t, this page is the checklist.
The fast version
# 1. Is the extension loaded?
php -m | grep infer
# expected: infer
# 2. Are the classes registered?
php -r 'echo class_exists("Displace\\Infer\\Model") ? "yes\n" : "no\n";'
# expected: yes
# 3. Does inference actually work?
php -r '
$m = \Displace\Infer\Model::load("models/Qwen3-0.6B-Q8_0.gguf");
$r = $m->chat(\Displace\Infer\Prompt::user("Say hello."));
echo $r->answer(), PHP_EOL;
$m->close();
'
# expected: a one-line greeting
All three pass → you’re done. Skip to the Guide.
Diagnosis if php -m | grep infer is empty
The extension didn’t load. PHP loads extensions from a specific directory and looks for them by exact filename — usually one of these four things is off.
1. PHP can’t find the binary
Confirm where PHP is looking:
php -i | grep -E '^extension_dir|^Loaded Configuration File'
Then confirm the binary is in that directory:
ls -l $(php -r 'echo ini_get("extension_dir");')/infer.*
If the file is missing:
- After
make install,cargo-phpshould have placed it there. Try re-running with-vto see where it landed:make install(orcargo php install --release -v). - After
pie install, look at PIE’s output for the install path.
If the file is in a different directory than extension_dir, either
move it or update extension_dir in your php.ini.
2. PHP minor mismatch
A binary built against PHP 8.4 will not load into PHP 8.5 (and vice versa). Confirm both:
php --version | head -1
# e.g. PHP 8.4.20 (cli)
# For PIE-installed binaries, the tarball filename encodes the PHP
# minor — check the GitHub Release you installed from:
ls -l $(php -r 'echo ini_get("extension_dir");')/infer.*
Cross-check that the binary’s PHP minor matches your running PHP minor.
If they disagree, re-install with the right tarball (PIE handles this
automatically; manual installs may need pie install --force).
3. macOS: -undefined dynamic_lookup missing from the link
The extension uses dlopen-style undefined-symbol resolution against
the host PHP binary. If you built from source on macOS and skipped the
extension’s own build.rs, the linker errors out at build time with
Undefined symbols for architecture arm64. From-source builds via
make build / make release configure this automatically. If you
invoked cargo build from somewhere unusual (e.g. an IDE), repeat the
build via make to be safe.
4. Linux: libc mismatch
The released binaries target glibc Linux. Alpine (musl) is not in the v0.1 release matrix. Confirm your libc:
ldd --version 2>&1 | head -1
# expected: ldd (GNU libc) 2.x
# if you see musl: rebuild from source — see Installation
Building from source on musl works; .cargo/config.toml carries the
needed crt-static opt-out.
Diagnosis if classes are missing
If php -m shows infer but class_exists("Displace\\Infer\\Model")
returns no, the namespace probably has a typo somewhere upstream of
you. The full list:
Displace\Infer\Model
Displace\Infer\Prompt
Displace\Infer\Message
Displace\Infer\Response
Displace\Infer\Embedding
Displace\Infer\InferException
Displace\Infer\ModelLoadException
Displace\Infer\InferenceException
All eight should exist after a successful load. If only some do, you
likely have a ext-infer install left over from an older API surface —
uninstall the old version (pie uninstall or make uninstall) and
reinstall.
Diagnosis if inference fails
If Model::load throws ModelLoadException:
- “no such file” — the GGUF path is wrong. PHP resolves relative paths against the working directory, not the script’s directory.
- “failed to load model: …” — check that the file isn’t truncated
(
du -hshould match what the publisher lists) and that it really is a GGUF (file <path>should mention “data” or similar; if it says “ASCII text” it’s probably an HTML 404 from a failed download).
If Model::chat throws InferenceException with "model has no embedded chat template", you’ve picked a base model rather than an
instruct/chat variant. See Choosing a model or
use Model::raw() with your own templating.
If the script segfaults rather than throwing — please open an issue at github.com/DisplaceTech/ext-infer/issues with the model name, PHP version, and OS. That’s a bug.
Enabling verbose logging
llama.cpp’s own diagnostic chatter is silenced by default. To see it (model layout, KV cache sizing, graph reservation, …):
EXT_INFER_LOG=1 php hello.php
A noisy log can sometimes point straight at the issue — e.g. “n_ctx exceeds model’s training context” tells you the model is being asked to handle longer input than it was trained for.
Prompts
Displace\Infer\Prompt is the input to Model::chat(). It
represents an ordered list of role-tagged messages — system, user,
assistant — that the extension renders into whatever chat-template
format the underlying model expects. You never write <|im_start|> (or
its Llama 3 / Mistral / Gemma equivalent) by hand.
Two-stage construction
A Prompt starts with a factory — either system() or user() —
and grows via with* calls. Each with* returns a new Prompt;
the receiver is never modified.
use Displace\Infer\Prompt;
// Start with a system message:
$p = Prompt::system('You are a helpful assistant.')
->withUser('What is 2+2?');
// Or start with a user message (no system instruction):
$p = Prompt::user('Hello!');
// Multi-turn replays:
$p = Prompt::system('You are a poet.')
->withUser('Write a haiku about Rust.')
->withAssistant("Code runs cold and fast,\nMemory safe by the borrow,\nNo crashes today.")
->withUser('Now translate it to French.');
Direct new Prompt() is refused at runtime:
new Prompt();
// Displace\Infer\InferException: use Displace\Infer\Prompt::system()
// or Prompt::user() to start a prompt
Why immutable?
The shape mirrors DateTimeImmutable. Two practical consequences:
-
A
Promptyou’ve built once is safe to share across multiplechat()calls, hand to a queue worker, or stash in a class property. Nothing downstream can mutate it. -
Branching is free. The multi-turn chat recipe keeps a
$basePromptaround (system-message-only) so/resetcan drop conversation history without re-rendering the system prompt:$base = Prompt::system($systemMessage); $conversation = $base; // … many turns … if ($userTyped === '/reset') { $conversation = $base; // immutable; $base is untouched no // matter how many turns went through it }
Inspecting a Prompt
$p->messages(); // list<Displace\Infer\Message>
$p->count(); // int — number of messages
$p->isEmpty(); // bool
$p->lastRole(); // ?string — role of the most recent message, or null
Each Message is read-only:
foreach ($p->messages() as $msg) {
printf("[%s] %s\n", $msg->role(), $msg->content());
}
// [system] You are a helpful assistant.
// [user] What is 2+2?
role() is always one of 'system', 'user', or 'assistant'.
Method-name discipline on the construction side (withSystem,
withUser, withAssistant) keeps typos from creating fictional roles
at compile time.
Role ordering
ext-infer does not enforce role ordering at construction time. You
can build:
Prompt::user('hi')->withSystem('be terse'); // legal
Prompt::system('a')->withSystem('b'); // also legal
…and they will be rendered as written. Whether the model accepts the
result is a chat-template decision: most modern chat templates require
exactly one leading system message (or none) followed by alternating
user / assistant turns. Build sequences that match that convention
and the chat template will render them; deviate and you may get an
error from
Model::chat() at call time.
Composition patterns
Pre-baked system prompts
If your application has a few stock personalities, define them once:
final class Personas
{
public static function poet(): Prompt
{
return Prompt::system(
'You are a haiku poet. Respond in three lines. ' .
'Five syllables, then seven, then five.'
);
}
public static function reviewer(): Prompt
{
return Prompt::system(
'You review code. Always cite specific line numbers ' .
'and prefer questions over assertions when uncertain.'
);
}
}
$response = $model->chat(Personas::poet()->withUser('Tell me about autumn.'));
Because Prompt is immutable, returning a Prompt from a helper
method is safe — callers can’t mutate the cached base.
Replaying history
When you have stored history (e.g. fetched from a database), rebuild
the Prompt from scratch each turn:
$prompt = Prompt::system($systemMessage);
foreach ($historyFromDb as $row) {
$prompt = match ($row['role']) {
'user' => $prompt->withUser($row['content']),
'assistant' => $prompt->withAssistant($row['content']),
};
}
$prompt = $prompt->withUser($newUserInput);
This is the canonical multi-turn-chat shape. See the multi-turn chat recipe.
Feeding Response::answer() back, not text()
When you append the assistant’s reply to the prompt for the next turn,
use Response::answer() (reasoning stripped), not
Response::text():
$response = $model->chat($prompt);
$prompt = $prompt->withAssistant($response->answer());
// ^^^^^^^^^
// not ->text(), which includes <think>…</think>
Feeding <think> blocks back as conversation history derails reasoning
models — they see their own thoughts in the transcript and get
confused. See Reasoning models.
Next
- Chat completions — feeding a
Promptto the model. Model::raw()— when you want full control over the prompt string instead.
Chat completions
Model::chat() is the main inference entry point. It takes a
Prompt and returns a Response:
public function chat(
\Displace\Infer\Prompt $prompt,
int $maxTokens = 128,
int $nCtx = 2048,
float $temperature = 0.0,
int $seed = 1234,
): \Displace\Infer\Response;
All four sampling arguments are PHP 8 named arguments — no options array. See Options reference for what each one does.
What chat() does
Three steps happen between the call and the return value:
-
Render. The
Prompt’s messages are fed throughllama_chat_apply_template, using the chat template embedded in the GGUF. Qwen3, Llama 3, Mistral, Gemma — each ships its own Jinja template inside the model file.ext-inferreads it and uses it verbatim. -
Decode. The rendered prompt is tokenized, decoded through the model in a single batch, then a sampler generates output tokens one by one until either the model emits an end-of-generation token (
finishReason = 'eos') or themaxTokensbudget is exhausted (finishReason = 'length'). -
Split. If the generated text contains
<think>...</think>blocks (Qwen3 / DeepSeek R1 / other reasoning models), they’re captured intoResponse::reasoning()and stripped fromResponse::answer(). See Reasoning models for the details.
Inspecting a Response
Response is read-only. Six getters:
$response->text(); // string — full output, <think>…</think> + answer
$response->reasoning(); // ?string — captured <think>…</think>, or null
$response->answer(); // string — text() minus reasoning, leading WS trimmed
$response->hasReasoning(); // bool
$response->finishReason(); // string — 'eos' | 'length' | 'stop'
$response->tokensGenerated(); // int — generated tokens only, not prompt
Response is created internally — new Response() throws.
text() vs answer()
For non-reasoning models, the two are byte-identical. For reasoning models invoked through their chat template:
text(): <think>Okay so 2+2…</think>\n\n2 + 2 equals 4.
answer(): 2 + 2 equals 4.
reasoning(): Okay so 2+2…
answer() is what end users want to read; reasoning() is what you’d
log for debugging or display behind a “show thinking” toggle.
finishReason()
Three possible values:
| Value | Meaning |
|---|---|
'eos' | Model emitted an end-of-generation token. Output is complete. |
'length' | maxTokens was hit before EOS. Output is likely truncated mid-thought. |
'stop' | Reserved for future stop-string support. Currently only reachable when the prompt produced zero tokens (a degenerate input). |
When you see 'length', surface it to the user — “hit the token budget,
bump maxTokens to see more”. Silently truncating is a bad UX.
tokensGenerated()
Counts generated tokens only, not the prompt’s tokens. Useful for billing-like accounting, latency analysis, or capping conversation length.
Calling chat()
A minimal call uses every default:
$response = $model->chat(Prompt::user('Hello!'));
A fully-specified one:
$response = $model->chat(
Prompt::system('You are a helpful, concise assistant.')
->withUser('What is the capital of Antarctica?'),
maxTokens: 256,
nCtx: 4096,
temperature: 0.7,
seed: 42,
);
Sampling defaults — temperature: 0.0, seed: 1234 — give greedy,
deterministic output: the same prompt always produces the same reply.
Crank temperature up for varied / creative output; the seed only
matters when temperature > 0.
Errors
Model::chat() raises InferenceException
for any failure between “the model exists” and “we got tokens back”.
The most common message strings:
| Substring | Meaning |
|---|---|
model has been closed | You called chat() after $model->close(). Reload the model. |
model has no embedded chat template | The GGUF is a base model, not an instruct/chat variant. Either pick a chat-tuned model or use Model::raw(). |
apply_chat_template failed | The chat template rendered but llama.cpp rejected the result. Usually means the message-role sequence is one the template doesn’t support (e.g. multiple system messages). |
prompt is N tokens but n_ctx is only M | The rendered prompt is longer than nCtx. Bump nCtx or shorten the prompt. |
chat message contains a null byte | A Prompt’s content has an embedded \0. Strip it before constructing the prompt. |
Chat-template errors
If you load a model that doesn’t ship a chat template — typically a “base” or “pretrained” model rather than an instruct variant — you’ll see:
InferenceException: model has no embedded chat template — use
Model::raw() for this model: …
Model::raw() lets you do your own templating. See
Raw completions.
Streaming
chat() is currently synchronous: it returns the complete Response
once decoding finishes. A streaming variant — likely
Model::chatStream(): \Generator — is in the roadmap.
For long generations under a request/response model where blocking is
unacceptable, the workable shortcut today is to set a tight maxTokens
and call chat() repeatedly with the previous turn appended to the
Prompt. That sacrifices KV-cache reuse but works.
Next
- Raw completions — the escape hatch for templates the model didn’t bake in.
- Choosing a model — chat-tuned vs base, quantization, size.
- Multi-turn chat recipe — the immutable-Prompt accumulation pattern.
- Reasoning models recipe — making
reasoning()/answer()work for you.
Raw completions
Model::raw() is the escape hatch for callers who want full control
over the prompt string instead of going through
Prompt + Model::chat().
public function raw(
string $prompt,
int $maxTokens = 128,
int $nCtx = 2048,
float $temperature = 0.0,
int $seed = 1234,
bool $addBos = true,
): string;
Returns a plain string — no Response wrapper, no reasoning split. If
you want either of those, use chat().
When to use raw()
Three legitimate use cases:
1. Models without a chat template
“Base” / “pretrained” / “foundation” models — Llama 3 base, Mistral
base, Qwen base — ship GGUFs that haven’t been instruction-tuned and
have no embedded chat template. Model::chat() rejects them with:
InferenceException: model has no embedded chat template — use
Model::raw() for this model
For these models, raw() is the only path. Build prompts in
whatever shape the model expects — typically just free-form
text-continuation:
$text = $model->raw(
"The capital of France is",
maxTokens: 8,
temperature: 0.0,
);
// " Paris."
2. Custom chat templates
Maybe the model’s embedded chat template doesn’t match what you want — e.g. you want to add a tool-result message that the embedded template doesn’t know about, or you’re injecting RAG context in a non-standard slot. Build the prompt string yourself:
$prompt = <<<TXT
<|im_start|>system
You are a calculator. Only emit JSON: {"result": <number>}.
<|im_end|>
<|im_start|>user
What is 2+2?
<|im_end|>
<|im_start|>assistant
TXT;
$text = $model->raw($prompt, maxTokens: 32, temperature: 0.0);
// '{"result": 4}'
The trade-off: you own template correctness. The chat template that
chat() uses is the one the model author tested with;
hand-rolling means hand-checking.
3. Stop-sequence simulation
Stop-string support is on the roadmap but not in v0.1. If you need a
generation to halt at a specific marker, raw() plus post-processing
is the workaround:
$text = $model->raw($promptEndingWithMarker, maxTokens: 256);
$text = substr($text, 0, strpos($text, '</answer>') ?: strlen($text));
What raw() does NOT do
- No reasoning split.
raw()returns a string, not aResponse. If the model emits<think>…</think>blocks, they end up in your string verbatim. You can strip them yourself with a regex if it matters; the canonical case for that is Reasoning models. - No chat-template rendering. What you pass in is what gets tokenized.
- No finish-reason or token-count metadata. If you need those,
use
chat().
addBos
The addBos: true default tells the tokenizer to prepend the model’s
beginning-of-sequence token (whatever it is for that family). For most
models that’s right. Set addBos: false when:
- You’re building a prompt that already starts with the BOS token explicitly (rare).
- The model’s tokenizer rejects BOS prepending (also rare).
- You’re feeding
raw()mid-conversation and don’t want a new BOS in the middle (very rare and probably a sign you should be using aPrompt).
The other named arguments — maxTokens, nCtx, temperature, seed
— behave the same as in chat(). See Options
reference.
When NOT to use raw()
If the model has a chat template and you’re sending the
“system / user / assistant” shape: use chat(). It’s
shorter, safer, and the Response it returns gives you reasoning
splitting + metadata for free.
raw() exists so escape hatches don’t require dropping the extension
entirely. Treat it as the lower-level layer, not the default.
Next
- Chat completions — the higher-level surface most code should use.
- Options reference — every argument explained side by side.
Embeddings
Model::embed() turns a piece of text into a fixed-length vector of
floats. Cosine similarity between two such vectors approximates
semantic similarity between the texts they came from — that’s the
foundation of every semantic-search / RAG pipeline.
public function embed(string $text): \Displace\Infer\Embedding;
Enable embedding mode at load time
Embedding generation requires a context built with
with_embeddings(true) under the hood. Because that conflicts with
generation mode for a given context, ext-infer makes the choice
explicit at load:
use Displace\Infer\Model;
$model = Model::load('models/embedding-model.gguf', [
'embedding' => true,
]);
With embedding: true, embed() works. Without it, embed() throws:
InferenceException: Model::embed() requires loading with ['embedding' => true]
chat() and raw() still work on an embedding-loaded handle — they
build their own per-call context for generation. So one handle can do
both, but you opt in to embed() explicitly.
Pooling
Sentence embeddings need a way to collapse the per-token hidden states into a single vector. Different model families do this differently:
| Pooling | Used by |
|---|---|
mean | BGE, GTE, E5 — average across tokens |
cls | original BERT — uses the [CLS] token’s hidden state |
last | Qwen3-Embedding — uses the last token’s hidden state |
rank | rerankers — emits a single score, not a vector |
none | per-token vectors, no pooling |
Modern embedding GGUFs declare their pooling type in metadata.
ext-infer’s default is 'unspecified' (trust the metadata):
$model = Model::load($path, ['embedding' => true]);
// pooling: whatever the GGUF says (almost always correct)
Override if a GGUF ships without the metadata or you want to experiment:
$model = Model::load($path, [
'embedding' => true,
'pooling' => 'mean', // 'unspecified' | 'none' | 'mean' | 'cls' | 'last' | 'rank'
]);
An unknown pooling string is rejected at load time, not at first
embed() call:
InferException: invalid option pooling: expected one of
unspecified/none/mean/cls/last/rank, got "weighted"
Generating embeddings
$emb = $model->embed('The cat sat on the mat.');
$emb->vector(); // list<float> — length matches the model's n_embd
$emb->dimensions(); // int — same as count($emb->vector())
Vectors are returned as PHP arrays of floats (doubles); internally we
hold Vec<f32> and let ext-php-rs convert f32 → f64 at the boundary,
which is lossless.
Vector math, built in
Embedding carries the math you need most of the time so you don’t
have to write a numpy-equivalent in PHP:
$emb->norm(); // float — L2 norm: sqrt(sum_i x_i^2)
$emb->normalize(); // new Embedding scaled to unit length
$a->cosineSimilarity($b); // float in [-1, 1]
normalize() returns a new Embedding — the original is not modified.
This matters for caching: cache the normalized form once, then every
subsequent cosineSimilarity call is just a dot product.
cosineSimilarity() throws on a dimension mismatch:
InferenceException: cannot compare embeddings of different
dimensions: 1024 vs 384
That’s deliberate — comparing across model families is almost always a bug, and silently returning a number would hide it.
Why normalize before comparing?
Cosine similarity ignores magnitude — it compares direction. If
either vector has magnitude zero, the answer is undefined; we return
0.0 rather than NaN. If both are non-zero, cosineSimilarity does
the right thing on un-normalized vectors too. But:
- For a fixed corpus you query against, normalizing once is cheap and
makes the inner loop a single dot product:
array_sum(array_map(fn($x, $y) => $x * $y, $a, $b)). - For
pgvector/sqlite-vecstorage, you usually want normalized vectors stored so the database can use the inner-product operator (<#>in pgvector) instead of the cosine operator (<=>).
A canonical pipeline:
$query = $model->embed($userQuestion)->normalize();
$best = null;
$bestScore = -INF;
foreach ($corpusEmbeddings as $docId => $docEmb) {
// $docEmb is also pre-normalized
$score = $query->cosineSimilarity($docEmb);
if ($score > $bestScore) {
$best = $docId;
$bestScore = $score;
}
}
For real-world indexing — even at a few thousand documents — push the storage into a database. See Semantic search and RAG over markdown.
Choosing an embedding model
The chat-tuned models people download for completions (Qwen3-0.6B,
Llama 3.2 3B, Mistral 7B) can be loaded with embedding: true and will
return a vector — but it’s not what they were trained for, and
similarity numbers are noisier than what a purpose-built embedding
model produces.
| Model family | Dims | Notes |
|---|---|---|
| Qwen3-Embedding (0.6B) | 1024 | Apache-2.0. Same architecture as Qwen3-0.6B, retrained for embeddings. Strong default. |
| BGE-small / BGE-large | 384 / 1024 | Beijing Academy of AI. Widely used, mean pooling. |
| E5-small / E5-large | 384 / 1024 | Microsoft. Trained on text similarity tasks. |
| GTE-small / GTE-large | 384 / 1024 | Alibaba. |
See Choosing a model for more on GGUF quants and what size to start with.
Next
- Semantic search recipe — embed a corpus, query, sort by similarity.
- RAG over markdown — semantic search
feeding into
Model::chat(). - Choosing a model — chat vs embedding, sizes, formats.
Choosing a model
ext-infer loads any GGUF
file llama.cpp can handle. Picking which GGUF is the most important
choice you’ll make — it dominates inference quality, memory footprint,
and latency. This page is a tour of the landscape.
What is GGUF?
GGUF (GPT-Generated Unified Format) is llama.cpp’s native model
format. A .gguf file packs:
- Weights in a specific quantization.
- Tokenizer (vocabulary + merges).
- Architecture metadata (layer count, hidden size, attention config) so llama.cpp knows how to run the model without a separate config file.
- Chat template (for instruct models) so
Model::chat()knows how to render messages. - Pooling type (for embedding models) so
Model::embed()knows how to collapse hidden states.
GGUF files self-contain everything ext-infer needs. There is no
separate config / tokenizer / vocab file to manage.
Model families
There are three broad categories you’ll encounter:
| Category | What ext-infer method to use | Examples |
|---|---|---|
| Base / pretrained | raw() only | Llama 3 base, Mistral 7B base, Qwen base |
| Chat / instruct | chat(), raw() | Qwen3-Instruct, Llama 3.x Instruct, Mistral Instruct |
| Embedding / reranker | embed() | Qwen3-Embedding, BGE, E5, GTE |
A chat model loaded with 'embedding' => true will return a vector,
but it’s not what the model was optimized for — the vectors are
noisier than what a purpose-built embedding model produces. The
reverse (chat() against a pure embedding GGUF) usually fails because
embedding-only models don’t ship a chat template.
Quantization
A 7B-parameter model at full precision is ~14 GB on disk. Quantization trades a small amount of quality for a much smaller, faster file. The suffixes you’ll see in GGUF filenames:
| Suffix | Approx. size for a 7B model | Quality | Notes |
|---|---|---|---|
F16 | ~14 GB | Lossless | Reference. Rarely worth the size unless you have plenty of memory. |
Q8_0 | ~7 GB | Near-lossless | Good default when you can afford the disk. |
Q6_K | ~5.5 GB | Excellent | |
Q5_K_M | ~5 GB | Very good | |
Q4_K_M | ~4.5 GB | Good | The most popular size/quality compromise. |
Q4_K_S | ~4 GB | Solid | |
Q3_K_M | ~3.5 GB | Noticeable degradation | Useful on memory-constrained boxes. |
Q2_K | ~2.5 GB | Significant degradation | Last resort. |
The K-family (Q4_K_M, etc.) uses k-quants, a smarter scheme than
the legacy non-K variants (Q4_0, Q4_1). Prefer K-quants when both
are offered for the same model.
Picking a quant
Two questions:
- How much memory can you spend? Quants below
Q4_K_Msave space at increasing quality cost. AboveQ4_K_M, the marginal gain per GB shrinks fast. - Is the model small enough that quantization barely matters?
For sub-1B models like Qwen3-0.6B, even
Q8_0is ~640 MB — negligible by 2026 standards. Take the quality bump.
A good default rule: Q4_K_M for models > 3B, Q8_0 for smaller
models.
Recommended starting points
What we’ve actually tested against:
Chat (smallest reasonable)
Qwen/Qwen3-0.6B-GGUF — Apache-2.0, 600M params, Q8 ≈ 640 MB.
Reasoning model: emits <think>…</think> blocks through its chat
template, which Response splits
for you. Great for getting started; not great for production-quality
answers.
curl -L -o models/Qwen3-0.6B-Q8_0.gguf \
https://huggingface.co/Qwen/Qwen3-0.6B-GGUF/resolve/main/Qwen3-0.6B-Q8_0.gguf
Chat (production-ish)
bartowski/Qwen3-7B-Instruct-GGUF at Q4_K_M (~4.4 GB) — same
family, much better reasoning quality. Or bartowski/Llama-3.2-3B-Instruct-GGUF
at Q4_K_M (~1.9 GB) for a smaller, non-reasoning option.
Embedding (small, fast)
Qwen/Qwen3-Embedding-0.6B-GGUF — Apache-2.0, 1024-dim
embeddings, last pooling baked into metadata. Same size as the chat
model; quality is competitive with BGE/E5 small variants.
curl -L -o models/Qwen3-Embedding-0.6B-Q8_0.gguf \
https://huggingface.co/Qwen/Qwen3-Embedding-0.6B-GGUF/resolve/main/Qwen3-Embedding-0.6B-Q8_0.gguf
Alternative: CompendiumLabs/bge-small-en-v1.5-gguf — 384-dim,
mean pooling, ~130 MB. Lower-quality vectors but tiny.
Where to look for more
- Hugging Face GGUF tag
—
library=gguffilters to GGUF-format models. - bartowski — prolific publisher of quantized GGUFs for popular models. Reliable, consistent naming.
- mradermacher — ditto.
- The model’s own official GGUF repo when one exists (e.g.
Qwen/Qwen3-7B-Instruct-GGUF) — always the most trusted source.
License caveats
GGUF files inherit the underlying model’s license. Some models that are nominally “open” (Llama 3.x, Gemma) ship under custom licenses with use restrictions; others (Qwen, Mistral, several smaller players) are Apache-2.0 / MIT. Check the model card before depending on a model in a commercial deployment.
ext-infer itself is MIT-licensed — the extension doesn’t care which
GGUF you load, but downstream concerns are on you.
Next
- Embeddings — when you’ve picked an embedding model.
- Chat completions — when you’ve picked a chat model.
- Performance tuning —
n_gpu_layers,mmap,mlockfor the model you ended up with.
Options reference
Every option that any ext-infer method accepts, in one table per
method. For conceptual context on individual options, follow the
links in the rightmost column.
Model::load($path, $options)
The second argument is an associative array. Keys are kept as snake-case strings (like PHP ini settings) because load-time tuning is rare and the array form composes well with config arrays loaded from disk.
| Key | Type | Default | See |
|---|---|---|---|
n_gpu_layers | int | 0 | Performance tuning |
use_mmap | bool | true | Performance tuning |
use_mlock | bool | false | Performance tuning |
embedding | bool | false | Embeddings |
pooling | string | 'unspecified' | Embeddings |
Validation rules
- Unknown keys are not rejected — they’re silently ignored. This is
deliberate (forward-compatibility for callers loading config from
files), but it means typos will be silent. If you suspect a typo,
verify with
var_dumpagainst the same string before reporting a bug. - Type mismatches are rejected, with a clear message:
invalid option n_gpu_layers: expected integer. - Negative integers and out-of-range values for
n_gpu_layersare rejected:invalid option n_gpu_layers: must be non-negative. poolingaccepts only the six strings listed in Embeddings → Pooling.
Model::chat($prompt, ...)
Named arguments — no array. PHP 8.0+ named-arguments syntax echoes the
ident verbatim, so you write maxTokens: 256 (camelCase, per PSR-12).
| Argument | Type | Default | See |
|---|---|---|---|
$prompt | \Displace\Infer\Prompt | required | Prompts |
maxTokens | int | 128 | Chat completions |
nCtx | int | 2048 | Chat completions |
temperature | float | 0.0 | Chat completions |
seed | int | 1234 | Chat completions |
Behavior
temperature = 0.0is greedy (deterministic).> 0.0samples, controlled byseed.seedis only consulted whentemperature > 0.maxTokenscaps generation. Hitting it setsResponse::finishReason()to'length'.nCtxis the context window for this call. If the rendered prompt exceeds it,InferenceExceptionis raised before generation starts.
Model::raw($prompt, ...)
Same named-argument shape as chat() plus addBos.
| Argument | Type | Default | See |
|---|---|---|---|
$prompt | string | required | Raw completions |
maxTokens | int | 128 | Chat completions |
nCtx | int | 2048 | Chat completions |
temperature | float | 0.0 | Chat completions |
seed | int | 1234 | Chat completions |
addBos | bool | true | Raw completions → addBos |
Model::embed($text)
Just the text. Pooling and embedding-mode are configured at load time
(see Model::load above).
| Argument | Type | Default | See |
|---|---|---|---|
$text | string | required | Embeddings |
Embedding math
Embedding is read-only; the math methods return new instances rather
than mutating.
| Method | Returns |
|---|---|
vector() | list<float> |
dimensions() | int |
norm() | float |
normalize() | new Embedding |
cosineSimilarity(Embedding $other) | float (in [-1, 1]) |
cosineSimilarity throws InferenceException
on a dimension mismatch — see
Embeddings → vector math.
Prompt
Static factories + immutable with* builders.
| Method | Returns |
|---|---|
Prompt::system($content) | new Prompt |
Prompt::user($content) | new Prompt |
withSystem($content) | new Prompt |
withUser($content) | new Prompt |
withAssistant($content) | new Prompt |
messages() | list<Message> |
lastRole() | ?string |
count() | int |
isEmpty() | bool |
See Prompts for the immutability semantics.
Response
Read-only. Six getters.
| Method | Returns |
|---|---|
text() | string |
reasoning() | ?string |
answer() | string |
hasReasoning() | bool |
finishReason() | string — 'eos'/'length'/'stop' |
tokensGenerated() | int |
See Chat completions → Inspecting a Response.
Environment
Not strictly an option, but bears mentioning here:
| Variable | Effect |
|---|---|
EXT_INFER_LOG=1 | Restore llama.cpp’s verbose stderr logging (silenced by default). |
Multi-turn chat
The pattern: keep the system message stable, append user/assistant
turns as the conversation grows, regenerate the prompt on each user
input. Lifts directly from examples/chat-interactive/.
The shape
use Displace\Infer\Model;
use Displace\Infer\Prompt;
$model = Model::load('models/Qwen3-0.6B-Q8_0.gguf');
$base = Prompt::system('You are a helpful, concise assistant.');
$conversation = $base;
while (($line = readline('> ')) !== false) {
$line = trim($line);
if ($line === '' || $line === '/exit') {
break;
}
// /reset is trivial because Prompt is immutable.
if ($line === '/reset') {
$conversation = $base;
continue;
}
$conversation = $conversation->withUser($line);
$response = $model->chat(
$conversation,
maxTokens: 512,
temperature: 0.7,
);
echo $response->answer(), PHP_EOL;
// Feed answer() back, NOT text(). See "Reasoning models" below.
$conversation = $conversation->withAssistant($response->answer());
}
$model->close();
Three things this gets right
1. The system message is stable
$base is built once and never mutated. Every /reset re-seats the
conversation at the original system instruction without re-allocating
or re-rendering. If you change the system prompt mid-conversation
elsewhere in your app, the immutable shape means concurrent uses of
$base aren’t affected.
2. Conversation grows by immutable append
$conversation = $conversation->withUser($line);
Every with* returns a new Prompt. The old $conversation is
still valid (and still has the previous turn count); the local just
points at the new one. There’s no shared mutable state, so this code
is safe to put behind a queue worker or run in parallel.
3. Response::answer() goes back, not text()
$conversation = $conversation->withAssistant($response->answer());
This matters for reasoning models. answer() is the reply with
<think>...</think> blocks stripped; text() is the raw output
including the thoughts. Feeding text() back means the model sees its
own internal monologue on the next turn — and reasoning models tend to
treat that as instruction, not history. The output derails fast.
For non-reasoning models, answer() and text() are byte-identical,
so the rule is “use answer() always” rather than “use answer() for
some models”.
Persisting conversations
If you need to save and restore conversations (e.g. per-user chat
history in a database), serialize the message list and rebuild the
Prompt:
function loadConversation(string $system, array $history): Prompt
{
$p = Prompt::system($system);
foreach ($history as $row) {
$p = match ($row['role']) {
'user' => $p->withUser($row['content']),
'assistant' => $p->withAssistant($row['content']),
};
}
return $p;
}
function saveConversation(Prompt $p): array
{
$rows = [];
foreach ($p->messages() as $msg) {
$rows[] = ['role' => $msg->role(), 'content' => $msg->content()];
}
return $rows;
}
Prompt::messages() walks in chronological order, so saving and
re-loading round-trips faithfully.
Common shape: an HTTP turn
For a request/response API where every HTTP call is one turn:
// Inside your controller — assumes $model is injected and reused.
final class ChatController
{
public function __construct(private Model $model, private HistoryStore $history) {}
public function turn(Request $req): Response
{
$conversationId = $req->session('conversation_id');
$history = $this->history->load($conversationId);
$system = $req->user()->systemPrompt() ?? 'You are helpful.';
$prompt = loadConversation($system, $history)
->withUser($req->json('message'));
$reply = $this->model->chat(
$prompt,
maxTokens: 1024,
temperature: 0.5,
);
$this->history->append($conversationId, 'user', $req->json('message'));
$this->history->append($conversationId, 'assistant', $reply->answer());
return new JsonResponse([
'answer' => $reply->answer(),
'reasoning' => $reply->reasoning(),
'truncated' => $reply->finishReason() === 'length',
'tokens' => $reply->tokensGenerated(),
]);
}
}
The $model is loaded once at FPM-worker boot — not per request — and
chat() is called per request. With current ext-infer (no
KV-cache reuse yet), each turn re-tokenizes and re-decodes the full
history, which is slow for long conversations. A Session object that
reuses the underlying llama.cpp context is on the roadmap.
When to use Model::raw() instead
If you have a very specific prompt shape — tool calls, RAG context
injected at a non-standard slot, custom format — see
Raw completions. The Prompt builder doesn’t
support tool-call messages today, so tool-aware conversations need
raw() until tool calling
lands.
Semantic search
Embed a corpus once, embed user queries on demand, return the closest matches by cosine similarity. The foundation of every “search by meaning, not keywords” pipeline.
Minimal in-memory version
use Displace\Infer\Model;
$model = Model::load('models/Qwen3-Embedding-0.6B-Q8_0.gguf', [
'embedding' => true,
]);
// Embed the corpus once. In real code, do this offline and cache.
$corpus = [
'doc-1' => 'PHP is a server-side scripting language.',
'doc-2' => 'Cats are popular pets known for their independence.',
'doc-3' => 'Rust provides memory safety without garbage collection.',
'doc-4' => 'Dogs are descendants of wolves, domesticated millennia ago.',
];
$index = [];
foreach ($corpus as $id => $text) {
// Normalize once so the search loop is a plain dot product.
$index[$id] = $model->embed($text)->normalize();
}
// Search.
function search(Model $model, array $index, string $query, int $k = 3): array
{
$q = $model->embed($query)->normalize();
$hits = [];
foreach ($index as $id => $emb) {
$hits[$id] = $q->cosineSimilarity($emb);
}
arsort($hits);
return array_slice($hits, 0, $k, preserve_keys: true);
}
print_r(search($model, $index, 'a typesafe language'));
// Array
// (
// [doc-3] => 0.7421
// [doc-1] => 0.4567
// [doc-2] => 0.1234
// )
$model->close();
Three things to know
Normalize when you index
Embedding::normalize() returns a unit vector. With both sides
normalized, cosine similarity simplifies to a dot product:
cos(a, b) = (a · b) / (||a|| · ||b||)
= a_unit · b_unit // if both are normalized
Normalize once at index time so the per-query work is just the dot
product. Embedding::cosineSimilarity() does the normalization
internally if you skip the explicit step — but you pay for it on
every call, which adds up across thousands of documents.
Pick an embedding model, not a chat model
A chat-tuned model loaded with 'embedding' => true will return a
vector, but the similarity numbers cluster too tightly to be useful at
scale. Use a purpose-built embedding model — see
Choosing a model.
What “useful” looks like with a real embedding model (Qwen3-Embedding-0.6B):
cat-mat ↔ feline-rug: 0.72 (paraphrase)
cat-mat ↔ grocery-shop: 0.29 (unrelated)
feline-rug ↔ grocery-shop: 0.26 (unrelated)
Same query with the chat-tuned Qwen3-0.6B (loaded in embedding mode):
cat-mat ↔ feline-rug: 0.66
cat-mat ↔ grocery-shop: 0.51
feline-rug ↔ grocery-shop: 0.50
The chat model preserves the ordering — the related pair scores highest — but the gap is much narrower, so the cut-off threshold between “match” and “not match” is harder to draw.
Cache the index
In production, the in-memory dictionary in the example above doesn’t scale past a few thousand documents — the search loop is O(corpus size). Two upgrade paths:
- Persist embeddings to disk (a JSON file, SQLite blob column, pickle equivalent). Saves the embed-time cost on subsequent runs.
- Index with a vector database:
pgvector(PostgreSQL extension),sqlite-vec, Qdrant, Pinecone. They handle the nearest-neighbor search far more efficiently than a PHP loop.
See RAG over markdown for a worked example using
sqlite-vec.
Re-ranking with a chat model
For higher-quality top-K, embed-rank-then-rerank-with-a-chat-model is the canonical pattern:
// 1. Coarse retrieval — embedding similarity, top 20.
$hits = search($embedModel, $index, $query, k: 20);
// 2. Fine reranking — ask a chat model to score each candidate.
$prompt = Prompt::system(
'You are a relevance judge. Given a query and a document, ' .
'respond with a single number between 0 and 1 indicating ' .
'how relevant the document is to the query.'
);
$rerank = [];
foreach (array_keys($hits) as $docId) {
$r = $chatModel->chat(
$prompt->withUser("Query: {$query}\n\nDocument: {$corpus[$docId]}"),
maxTokens: 8,
temperature: 0.0,
);
$rerank[$docId] = (float) trim($r->answer());
}
arsort($rerank);
That’s two model loads — one embedding, one chat. Reuse handles across requests; loading is the expensive step.
Next
- RAG over markdown — semantic search feeding
into
Model::chat(). - Embeddings guide — the underlying API.
- Choosing a model — picking an embedding model.
Reasoning models
Qwen3, DeepSeek R1, and other reasoning-tuned models think out loud
before answering. When invoked through their chat template, they emit
<think>…</think> blocks containing the internal monologue, then the
actual reply. ext-infer understands this convention and exposes the
two streams separately on Response.
The split, in three calls
use Displace\Infer\Model;
use Displace\Infer\Prompt;
$model = Model::load('models/Qwen3-0.6B-Q8_0.gguf');
$response = $model->chat(
Prompt::user('What is 2+2?'),
maxTokens: 512,
);
echo $response->text(), PHP_EOL;
// <think>
// Okay, the user is asking what 2+2 is. This is basic arithmetic.
// I should respond with the correct sum, which is 4. Let me also
// verify there's no trick here — adding two and two definitely
// equals four.
// </think>
//
// 2 + 2 equals 4.
echo $response->answer(), PHP_EOL;
// 2 + 2 equals 4.
echo $response->reasoning(), PHP_EOL;
// Okay, the user is asking what 2+2 is. This is basic arithmetic.
// I should respond with the correct sum, which is 4. ...
echo $response->hasReasoning() ? 'yes' : 'no', PHP_EOL;
// yes
$model->close();
For a non-reasoning model:
reasoning()returnsnullanswer()equalstext()byte-for-bytehasReasoning()returnsfalse
The split is opt-out: there’s no flag to disable it. If the input
doesn’t contain <think>…</think> tags, nothing changes.
When the budget runs out mid-thought
Reasoning chains can be long. If maxTokens exhausts inside a
<think> block — before the closing </think> — the split fails
gracefully:
text()contains the partial reasoning verbatim, with the open<think>tag and no closing tag.reasoning()returns any previous completed reasoning blocks, ornullif none.answer()is the input with completed blocks removed and the partial thought left in place. The partial thought is intentionally left inanswer()— silently swallowing it would hide the budget problem.finishReason()returns'length'.
The fix is always “bump maxTokens”. A useful pattern is to surface
the truncation explicitly:
$response = $model->chat($prompt, maxTokens: 256);
if ($response->finishReason() === 'length') {
error_log(sprintf(
'truncated: model wanted more than 256 tokens for "%s..."',
substr($prompt->messages()[0]->content(), 0, 40),
));
}
The interactive chat example uses a softer hint: “(truncated — bump –max-tokens to see more)”.
When you DON’T want reasoning at all
Two strategies, depending on what “don’t want” means.
Strategy A — hide it in the UI, keep it under the hood
Default everywhere. Show $response->answer() to the end user. Log
$response->reasoning() for debugging or display behind a “show
thinking” toggle. No model-level change.
Strategy B — tell the model to skip thinking
Qwen3 has a /no_think directive that, when included as a system-message
suffix, suppresses the <think>...</think> block entirely. The model
still emits an empty <think></think> block (which the split handles —
reasoning() ends up being an empty string), but skips the actual
monologue:
$prompt = Prompt::system('You are helpful. /no_think')
->withUser('What is 2+2?');
$response = $model->chat($prompt);
$response->hasReasoning(); // true (empty block)
$response->reasoning(); // "" (empty string)
$response->answer(); // "2 + 2 equals 4."
This is Qwen3-specific. DeepSeek R1 has a similar concept (/no-cot
in some prompts). Other reasoning models vary. Check the model card.
Feeding history back
When building multi-turn conversations against a reasoning model, feed
Response::answer() back as the assistant’s reply, not
Response::text():
$conversation = $conversation->withAssistant($response->answer());
// ^^^^^^^^^^^^^^^^^^^
// not ->text()
text() includes the <think>…</think> block. Adding it to the
conversation means the model sees its own reasoning on the next turn
and tends to treat it as instruction rather than history — output
quality drops fast.
This is the single most-common mistake when wiring up reasoning models
in ext-infer. See Multi-turn chat for the
full pattern.
Performance note
Reasoning models spend many tokens on their internal monologue. A typical Qwen3-0.6B answer to “what is 2+2?” generates ~150 tokens of thinking before the 5-token answer. That’s an order of magnitude more work than a non-reasoning model would do for the same question.
If latency matters more than the highest-quality answer:
- Use
/no_think(Strategy B above) to skip the monologue. - Pick a non-reasoning model — Llama 3.x Instruct, Mistral Instruct, Qwen 2.5 Instruct (not Qwen3) all chat without thinking out loud.
See Performance tuning for more knobs.
RAG over markdown
Retrieval-Augmented Generation: instead of asking the model what it knows (and getting whatever its training data captured, possibly incorrectly), embed your own documents into a vector store, retrieve the most relevant ones at query time, and feed them to the model as context. The model answers from your data.
This recipe walks through the smallest practical version: a folder of
markdown files, indexed once into sqlite-vec, queried on demand.
Prerequisites
- An embedding model — Qwen3-Embedding-0.6B works well.
- A chat model — Qwen3-7B-Instruct or similar.
- The
sqlite-vecextension loaded into PHP’s PDO SQLite (or use thesqlite3CLI tools).
# macOS — Homebrew has it
brew install asg017/sqlite-vec/sqlite-vec
# Linux — see the sqlite-vec README for distro packages
Schema
A single table holds documents and their embeddings:
CREATE TABLE IF NOT EXISTS docs (
id INTEGER PRIMARY KEY AUTOINCREMENT,
path TEXT UNIQUE NOT NULL,
body TEXT NOT NULL
);
-- sqlite-vec virtual table for k-nearest-neighbor search.
-- 1024 dimensions matches Qwen3-Embedding-0.6B.
CREATE VIRTUAL TABLE IF NOT EXISTS doc_vecs USING vec0(
id INTEGER PRIMARY KEY,
embed FLOAT[1024]
);
Indexing
Walk a directory, embed each file, persist:
declare(strict_types=1);
use Displace\Infer\Model;
$embedder = Model::load('models/Qwen3-Embedding-0.6B-Q8_0.gguf', [
'embedding' => true,
]);
$pdo = new PDO('sqlite:rag.db');
$pdo->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION);
$pdo->sqliteCreateFunction('load_extension', 'sqlite_vec_init', 1); // see sqlite-vec docs
$pdo->exec(file_get_contents('schema.sql'));
$insertDoc = $pdo->prepare(
'INSERT INTO docs (path, body) VALUES (:path, :body)
ON CONFLICT(path) DO UPDATE SET body = excluded.body
RETURNING id'
);
$insertVec = $pdo->prepare(
'INSERT OR REPLACE INTO doc_vecs (id, embed) VALUES (:id, :embed)'
);
$root = $argv[1] ?? './notes';
foreach (new RecursiveIteratorIterator(new RecursiveDirectoryIterator($root)) as $f) {
if ($f->getExtension() !== 'md') {
continue;
}
$body = file_get_contents($f->getPathname());
$insertDoc->execute([':path' => $f->getPathname(), ':body' => $body]);
$id = (int) $insertDoc->fetchColumn();
// Pre-normalize so search is a dot product.
$vector = $embedder->embed($body)->normalize()->vector();
$insertVec->execute([
':id' => $id,
':embed' => pack('f*', ...$vector), // sqlite-vec wants float32 bytes
]);
echo "indexed: {$f->getPathname()} ({$id})\n";
}
$embedder->close();
Run once to build the index, again whenever your notes change. For larger corpora, chunk each file into ~500-token sections and embed each chunk separately — sentence-level granularity gives better retrieval than whole-file vectors.
Retrieval + generation
declare(strict_types=1);
use Displace\Infer\Model;
use Displace\Infer\Prompt;
$embedder = Model::load('models/Qwen3-Embedding-0.6B-Q8_0.gguf', [
'embedding' => true,
]);
$chat = Model::load('models/Qwen3-7B-Instruct-Q4_K_M.gguf');
$pdo = new PDO('sqlite:rag.db');
$pdo->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION);
$query = $argv[1] ?? 'What did I decide about the migration?';
// 1. Embed the query.
$qvec = $embedder->embed($query)->normalize()->vector();
// 2. Top-k retrieval via sqlite-vec.
$stmt = $pdo->prepare(<<<SQL
SELECT
docs.path,
docs.body,
vec_distance_cosine(doc_vecs.embed, :qvec) AS distance
FROM doc_vecs
JOIN docs ON docs.id = doc_vecs.id
ORDER BY distance ASC
LIMIT 4
SQL);
$stmt->execute([':qvec' => pack('f*', ...$qvec)]);
$hits = $stmt->fetchAll(PDO::FETCH_ASSOC);
// 3. Build a context-injected prompt.
$context = '';
foreach ($hits as $i => $hit) {
$context .= sprintf("--- Document %d (%s) ---\n%s\n\n", $i + 1, $hit['path'], $hit['body']);
}
$prompt = Prompt::system(<<<SYS
You answer questions strictly using the provided documents.
If the documents don't contain the answer, say so — do not invent.
Cite the document number when you quote.
SYS)
->withUser("Documents:\n\n{$context}\n\nQuestion: {$query}");
// 4. Ask.
$response = $chat->chat($prompt, maxTokens: 1024, temperature: 0.3);
echo $response->answer(), PHP_EOL;
$embedder->close();
$chat->close();
What good output looks like
For a corpus of personal notes, a query like “what did I decide about the migration?” returns:
Based on Document 2 (notes/migration.md), you decided to defer the
schema change to Q3 in favor of shipping the redirect layer first.
The reasoning cited there was that the redirect layer was lower-risk
and would surface the migration's actual hot paths before you
committed to the column rename.
If the corpus doesn’t contain the answer:
The provided documents don't address the migration decision directly.
Document 1 mentions a planned schema change but doesn't record what
was decided. I'd need more context to answer.
That “I don’t know” behavior is what the system prompt enforces. Models
will happily make up plausible answers without it.
Knobs worth tuning
| Knob | Effect |
|---|---|
Top-k (the LIMIT 4 above) | More context = better answers but slower + risks the model conflating unrelated documents. 3–5 is a good default. |
| Chunk size at index time | Whole-file is simple but coarse. 500-token chunks give finer retrieval at the cost of ~10x more vectors. |
temperature on the chat model | Set low (0.0–0.3) for factual answers; the model should be quoting, not improvising. |
| System prompt strictness | “Cite documents” + “say so if unknown” is the difference between RAG and a model that just sometimes incorporates your context. |
What this recipe doesn’t cover
- Reranking — top-k by embedding similarity is fast but coarse. See the Semantic search recipe for the chat-model-as-reranker pattern.
- Streaming responses —
Model::chat()is currently synchronous. See the roadmap. - Production-grade chunking — markdown-aware splitting that respects code blocks, headers, lists. Worth a library; not in scope for ext-infer itself.
Next
- Semantic search — the building-block underneath this.
- Worker pools — running RAG queries under concurrent load.
- Choosing a model — picking the right embedding + chat models for your corpus.
Worker pools
LLM inference is slow: tens of milliseconds at best, often seconds. Running it inline in an FPM worker means that worker is unavailable for any other request until the model is done. For any non-trivial deployment, you want a pool of workers — process-based, thread-based, or queue-based — that absorbs the latency without starving the rest of your app.
ext-infer is designed to slot into all three patterns.
Pattern 1 — FPM workers (process-based)
The simplest production setup: PHP-FPM with pm.max_children set
high enough to absorb concurrent slow inference requests.
; php-fpm.d/www.conf
pm = dynamic
pm.max_children = 16
pm.start_servers = 4
pm.min_spare_servers = 2
pm.max_spare_servers = 8
pm.process_idle_timeout = 60s
Each FPM worker is its own OS process. They each load their own
Model once at warm-up and reuse it for the lifetime of the worker.
The model weights are mmap’d, so the OS shares physical memory across
workers — 16 workers loading the same 4 GB model use ~4 GB of RAM
total, not 64.
// Shared service container — boot once per worker.
$model = Displace\Infer\Model::load($cfg['model_path']);
// In your request handler:
$response = $model->chat($prompt, maxTokens: 512);
The downside: each worker can handle one inference at a time. If you
hit pm.max_children concurrent requests, the (max_children + 1)st
request waits. Bump max_children if you have the RAM (the model is
shared via mmap; only the KV cache scales with concurrency); push
inference to a queue if you don’t.
Sizing
A rough sizing heuristic for FPM with ext-infer:
max_children ≈ (RAM_budget - model_size) / per_request_memory
Where per_request_memory is the KV cache footprint plus PHP’s
working set — usually 100–500 MB per worker depending on nCtx.
Pattern 2 — Job queue (process-based, decoupled)
For inference that takes long enough that you don’t want it in the request path at all:
// In the request handler — enqueue, return immediately.
$jobId = $queue->push(InferJob::class, [
'prompt' => $prompt,
'options' => ['maxTokens' => 512],
]);
return new JsonResponse(['job_id' => $jobId, 'status' => 'queued']);
// Client polls /jobs/{id} until status = 'done'.
// In your queue worker — long-lived, model loaded once at boot.
final class InferWorker
{
public function __construct(private \Displace\Infer\Model $model) {}
public function process(InferJob $job): InferResult
{
$r = $this->model->chat($job->prompt, ...$job->options);
return new InferResult($r->answer(), $r->finishReason());
}
}
Any queue runner works — Symfony Messenger, Laravel Horizon,
ReactPHP’s react/event-loop, a bespoke
pcntl_fork + proc_open script. The pattern is the same: one
Model::load() per worker process, reuse across many jobs.
This pattern shines when:
- Inference latency is unpredictable and you don’t want to hold HTTP connections open.
- You want to scale inference workers independently of web workers.
- You want to route inference traffic across heterogeneous workers (CPU-only on cheap nodes, GPU-equipped on others).
Pattern 3 — ZTS + parallel (thread-based)
For latency-sensitive workloads where the IPC overhead of pattern 2 is
too much, ext-infer supports concurrent calls within a single
process under ZTS PHP with the parallel
extension.
This works because ext-infer is thread-safe by design:
LlamaBackendis aSync-guarded process-global singleton.LlamaModel(the weights) is immutable after load; llama.cpp explicitly supports many contexts on one model.- Each
chat()/raw()/embed()call builds its own per-callLlamaContextand drops it after.
Two threads calling Model::chat() simultaneously on the same handle
is the supported, intended shape.
use parallel\Runtime;
// Load the model once in the main thread.
$model = Displace\Infer\Model::load('models/qwen3.gguf');
// Spin up a pool of runtimes.
$runtimes = array_map(fn() => new Runtime(), range(1, 4));
// Dispatch concurrent inferences.
$futures = [];
foreach ($prompts as $i => $prompt) {
$rt = $runtimes[$i % 4];
$futures[$i] = $rt->run(function (Model $m, Prompt $p) {
return $m->chat($p, maxTokens: 512)->answer();
}, [$model, $prompt]);
}
// Collect.
$answers = array_map(fn($f) => $f->value(), $futures);
$model->close();
Caveats
- ZTS PHP is uncommon. Most distros ship NTS by default; you’ll
have to build ZTS PHP from source (
./configure --enable-zts) or use a ZTS-shipping Docker image. PIE’s pre-built binaries target NTS for v0.1; ZTS binaries are on the roadmap. parallelitself requires ZTS. Can’t use it on a standard NTS install.- CI doesn’t exercise this yet. ZTS support is enabled in
composer.jsonbecause the code is thread-safe by construction — but the maintainers have not yet run multi-threaded stress tests in CI. Treat it as “should work, please report bugs” until that changes. See Threading & ZTS for the current state.
Choosing between the patterns
| Concern | FPM workers | Job queue | ZTS + parallel |
|---|---|---|---|
| Easy to set up | ✅ trivial | ⚠️ some IPC | ⚠️ ZTS build |
| Holds HTTP connection during inference | yes | no | yes |
| Survives PHP being NTS | ✅ | ✅ | ❌ |
| Shares one model across all concurrency | via mmap | per-worker | within process |
| Scales to many concurrent inferences | ⚠️ workers eat RAM | ✅ horizontal | ⚠️ one process |
Production-tested in ext-infer | ✅ | ✅ | ⚠️ unexercised |
For most teams, FPM with a generous max_children is the right
starting point. Move to a queue when latency variance gets too high
for the request path. Reach for parallel last, when you’ve measured
that IPC overhead is the bottleneck.
Next
- Threading & ZTS — what makes the
parallelstory actually work. - Performance tuning — what knobs to pull when each worker is too slow.
API surface
The complete public PHP API in one place. Every method, every argument, every return type. Read this when you know what you’re looking for and just want the signature; read the Guide when you want to understand why.
For an authoritative copy in PHP-stub form (consumed by IDEs and
static analyzers like PHPStan), see
stubs/infer.stubs.php.
Displace\Infer\Model
final class Model
{
public static function load(
string $path,
array $options = [],
): self;
public function chat(
\Displace\Infer\Prompt $prompt,
int $maxTokens = 128,
int $nCtx = 2048,
float $temperature = 0.0,
int $seed = 1234,
): \Displace\Infer\Response;
public function raw(
string $prompt,
int $maxTokens = 128,
int $nCtx = 2048,
float $temperature = 0.0,
int $seed = 1234,
bool $addBos = true,
): string;
public function embed(
string $text,
): \Displace\Infer\Embedding;
public function close(): void;
}
new Model() throws — use Model::load(). close() is idempotent
(safe to call from finally blocks).
See Choosing a model, Chat completions, Raw completions, Embeddings, and Options reference.
Displace\Infer\Prompt
final class Prompt
{
public static function system(string $content): self;
public static function user(string $content): self;
public function withSystem(string $content): self;
public function withUser(string $content): self;
public function withAssistant(string $content): self;
/** @return list<\Displace\Infer\Message> */
public function messages(): array;
public function lastRole(): ?string;
public function count(): int;
public function isEmpty(): bool;
}
Immutable. new Prompt() throws — use a factory. See
Prompts.
Displace\Infer\Message
final class Message
{
public function role(): string; // 'system' | 'user' | 'assistant'
public function content(): string;
}
Read-only. Constructed only by Prompt; new Message() throws.
Displace\Infer\Response
final class Response
{
public function text(): string;
public function reasoning(): ?string;
public function answer(): string;
public function hasReasoning(): bool;
public function finishReason(): string; // 'eos' | 'length' | 'stop'
public function tokensGenerated(): int;
}
Read-only. Constructed only by Model::chat(); new Response()
throws. See Chat completions.
Displace\Infer\Embedding
final class Embedding
{
/** @return list<float> */
public function vector(): array;
public function dimensions(): int;
public function norm(): float;
public function normalize(): self;
public function cosineSimilarity(\Displace\Infer\Embedding $other): float;
}
Read-only. Constructed only by Model::embed(); new Embedding()
throws. See Embeddings.
Exception hierarchy
\RuntimeException
└── Displace\Infer\InferException
├── Displace\Infer\ModelLoadException
└── Displace\Infer\InferenceException
InferException extends PHP’s built-in \RuntimeException, so any
generic catch (\RuntimeException $e) clause sees ext-infer errors.
See Exceptions for which methods raise which
subclass.
Conventions
- Direct construction is refused on
Prompt,Message,Response,Embedding, andModel. Each one throwsInferExceptionfrom its__constructwith a hint at the right factory. This is so an arbitrarynew Embedding()can’t lie about which model produced it. - All
with*methods onPromptreturn a new instance. They never mutate. This is the only place the API exposes the “build by chaining” pattern;Embedding::normalize()also returns a new instance. - Sampling args are named, never positional.
Model::chat()andModel::raw()use PHP 8 named arguments (maxTokens: 256, temperature: 0.7) — not an options array. Load options are an array because they’re rare and compose with config-from-disk patterns.
Exceptions
ext-infer raises exceptions for every error condition — no silent
false returns, no error codes. The hierarchy is small enough that
you can catch precisely or broadly depending on what you’re after.
Hierarchy
\RuntimeException
└── Displace\Infer\InferException
├── Displace\Infer\ModelLoadException
└── Displace\Infer\InferenceException
InferExceptionextends PHP’s\RuntimeException. Catching\RuntimeExceptionin generic top-level handlers (e.g. a PSR-15 middleware) sees everyext-infererror.ModelLoadExceptionis raised exclusively fromModel::load().InferenceExceptionis raised fromModel::chat(),Model::raw(),Model::embed(), andEmbedding::cosineSimilarity().InferExceptionitself (the base class, not just an instance of a subclass) is raised for “this method should never have been called” errors — see Direct construction.
Which method raises what
| Method | Class | Common causes |
|---|---|---|
Model::load() | ModelLoadException | Missing file, malformed GGUF, backend init failure. |
Model::load() | InferException | Invalid option type/value (e.g. pooling set to "weighted"). |
Model::chat() | InferenceException | Model closed, no chat template, decode failure, prompt over nCtx. |
Model::raw() | InferenceException | Model closed, decode failure, prompt over nCtx. |
Model::embed() | InferenceException | Model closed, model not loaded with embedding: true, decode failure. |
Embedding::cosineSimilarity() | InferenceException | Dimension mismatch between the two embeddings. |
new Model() / new Prompt() / new Message() / new Response() / new Embedding() | InferException | Direct construction is refused; use the appropriate factory. |
Direct construction
Model, Prompt, Message, Response, and Embedding all refuse
direct new. Each throws InferException (the base class) with a
hint pointing at the right factory:
new Embedding();
// Displace\Infer\InferException:
// Displace\Infer\Embedding is produced by Model::embed();
// do not instantiate directly
This is deliberate: a new Embedding() from PHP code could lie about
which model produced it and what pooling strategy was applied — silent
mistakes that are hard to debug later. Forcing factory construction
keeps the invariants tight.
Catching strategies
Catch broadly at the top
For a request handler that wants to convert any ext-infer failure
into a 5xx response:
try {
$reply = $model->chat($prompt);
} catch (\Displace\Infer\InferException $e) {
$log->error('inference failed', ['error' => $e->getMessage()]);
return new Response(500, [], 'Inference temporarily unavailable.');
}
Distinguish load failures from inference failures
For a CLI tool that wants different exit codes:
try {
$model = Model::load($path);
} catch (\Displace\Infer\ModelLoadException $e) {
fwrite(STDERR, "model: " . $e->getMessage() . PHP_EOL);
exit(2);
}
try {
$r = $model->chat($prompt);
} catch (\Displace\Infer\InferenceException $e) {
fwrite(STDERR, "inference: " . $e->getMessage() . PHP_EOL);
exit(3);
}
Retry vs surface
InferenceException covers two flavors of failure:
- Transient — out-of-memory under load, e.g.
with_mlock+ a large prompt. Often resolved by reducingnCtxor splitting the work. - Permanent — model has no chat template, prompt has null bytes, invalid option. Retrying makes no sense.
The message string is the only signal you have today; structured error codes are on the roadmap. For now, a pragmatic split:
try {
$r = $model->chat($prompt, maxTokens: $budget);
} catch (\Displace\Infer\InferenceException $e) {
if (str_contains($e->getMessage(), 'n_ctx')) {
// prompt too long — surface to caller, don't retry
throw $e;
}
// other inference failure — log + maybe retry
$log->warning('chat failed, retrying once', ['error' => $e->getMessage()]);
$r = $model->chat($prompt, maxTokens: $budget);
}
Always-safe patterns
Model::close() is idempotent — calling it on an already-closed
model is a no-op. Safe inside finally:
$model = Model::load($path);
try {
return $model->chat($prompt);
} finally {
$model->close();
}
After close(), every other method on that Model raises
InferenceException with "model has been closed".
Environment variables
The extension reads exactly one environment variable today. We’ll add more as they earn their keep; the conservative approach is to keep configuration in PHP (named arguments, load options) rather than sprinkled across the environment.
EXT_INFER_LOG
Restores llama.cpp’s verbose stderr logging, which is silenced by default.
| Value | Effect |
|---|---|
| (unset) | llama.cpp logs are silenced. This is the default. |
| Any value | llama.cpp logs are passed through to stderr verbatim. |
EXT_INFER_LOG=1 php hello.php
Why silence by default?
A single Model::load() + chat() pair against a typical GGUF
produces several hundred lines of stderr — model metadata, KV-cache
sizing, graph reservation, attention layout, sampler config, and
more. For a CLI tool drilling into a problem it’s useful; for a PHP
extension running inside a request, it’s structured-log poison.
When to enable it
- Diagnosing a
ModelLoadException. The verbose log dumps the GGUF header before failing, which usually points at the cause (wrong architecture, wrong quant, truncated file). - Diagnosing a slow load. The log shows where the time goes — reading from disk, mmap setup, weight copy.
- Reporting an issue. The first thing maintainers will ask for is the
verbose log; capture it once with
EXT_INFER_LOG=1and paste.
How it works
The extension hooks llama_log_set at backend init time, replacing
llama.cpp’s default callback with a no-op. The hook is process-global —
once installed, it covers every subsequent call. EXT_INFER_LOG is
checked only at backend init (the first time Model::load() is
called); changing the variable mid-process has no effect.
Reserved for future use
These names are not consumed by the extension today but may be in future versions. Avoid using them as application env vars to keep your forward-upgrade path clean:
EXT_INFER_DEFAULT_NCTXEXT_INFER_DEFAULT_TEMPERATUREEXT_INFER_BACKEND(CPU / Metal / CUDA selection at runtime)
If you want any of these to land sooner rather than later, open an issue with the use case.
Compatibility matrix
PHP versions
| Version | Status | Notes |
|---|---|---|
| 8.3 | ✅ supported | Security-only upstream through end of 2026. |
| 8.4 | ✅ supported | Active support. |
| 8.5 | ✅ supported | Current release. |
| 8.2 and earlier | ❌ not supported | composer.json declares php: ^8.3. |
Every released binary is built against a specific PHP minor. A binary built for PHP 8.4 will not load into PHP 8.5 or 8.3. PIE handles this automatically (it picks the right tarball); manual installs need to match versions explicitly.
Operating systems
| Platform | Status | Notes |
|---|---|---|
| macOS arm64 | ✅ supported | Apple Silicon. Tested on macOS 14+. |
| macOS x86_64 | ⚠️ not in release matrix | Builds from source. We don’t ship binaries. |
| Linux x86_64 (glibc) | ✅ supported | Ubuntu 22.04+, Debian 12+, RHEL 9+. Most modern distros. |
| Linux arm64 (glibc) | ✅ supported | Ubuntu 24.04 arm64, Debian 12 arm64, AWS Graviton. |
| Linux musl (Alpine) | ⚠️ builds from source | .cargo/config.toml has the right crt-static opt-out; no released binary. |
| FreeBSD / OpenBSD | ⚠️ builds from source | Untested but should work; the build script handles non-Linux non-macOS as Linux. |
| Windows | ❌ excluded | os-families-exclude: ["windows"] in composer.json. Out of scope for v0.1. |
Threading
ext-infer is thread-safe by design — the LlamaBackend
singleton is guarded by a Sync mutex, the underlying LlamaModel’s
weights are read-only after load (llama.cpp explicitly supports many
contexts on one model), and each chat() / raw() / embed() call
builds its own per-call LlamaContext. Two threads calling
Model::chat() concurrently on the same handle is the supported,
intended shape.
| PHP build | Status | Notes |
|---|---|---|
| NTS | ✅ supported, the default | What every release binary targets today. |
| ZTS | ✅ supported (support-zts: true in composer.json) | Not yet exercised in CI. See Threading & ZTS. |
Acceleration backends
| Backend | Status | Notes |
|---|---|---|
| CPU (default) | ✅ supported, the default | Portable, no hardware requirements. |
| Apple Metal | ⚠️ opt-in via cargo feature | make release FEATURES=metal. See Apple Metal. |
| CUDA (NVIDIA GPU) | ❌ not yet | llama-cpp-2 supports it via a cargo feature; we haven’t exposed or tested it. |
| ROCm / Vulkan | ❌ not yet | Same — supported upstream, not surfaced. |
If you want CUDA or other GPU acceleration sooner rather than later, open an issue describing your use case — surfacing the feature is small work; testing it across the GPU landscape is the hard part.
Tested model families
What the maintainers have actually exercised end-to-end. Other GGUF-supported families almost certainly work; this is the “we’ve seen it produce sensible output” list.
| Family | Used for |
|---|---|
| Qwen3 (Instruct) | Chat completions, reasoning splitting. |
| Qwen3-Embedding | Embeddings, cosine similarity. |
| Llama 3 / 3.1 / 3.2 | Chat completions. No reasoning. |
| Mistral | Chat completions. |
| BGE / E5 / GTE | Embeddings. |
Versioning policy
- Pre-1.0 (
0.x.y), breaking changes happen between minors (0.1.x→0.2.x), not patches. - Once
v1.0.0ships, the class / method / argument surface is frozen. New features land additively; behavioral changes that affect existing callers wait for the next major. - See
RELEASE.mdfor the cut-a-release flow.
Reporting compatibility issues
If you hit a “should work but doesn’t” combination on this matrix, the issue template asks for:
- PHP version (
php --version) - OS / arch (
uname -a) - libc (Linux:
ldd --version | head -1) - ZTS or NTS (
php -i | grep 'Thread Safety') - Whether the extension was installed via PIE,
make install, or loaded with-d extension=…
Three of those four are usually enough to triage.
Threading & ZTS
ext-infer is thread-safe by design. This page documents what
that actually means: where the synchronization happens, what the
runtime expectations are, and where the rough edges still are.
The thread-safety story, top to bottom
1. LlamaBackend is a Sync-guarded singleton
llama.cpp’s LlamaBackend::init() is process-global state.
Initializing it twice is undefined behavior; not initializing it at
all means no inference. ext-infer resolves this with:
#![allow(unused)]
fn main() {
static BACKEND: OnceLock<LlamaBackend> = OnceLock::new();
static BACKEND_INIT: Mutex<()> = Mutex::new(());
}
The first Model::load() call (from any thread) acquires the mutex,
checks the OnceLock, calls LlamaBackend::init() if needed, and
publishes the result. Every subsequent call sees a populated
OnceLock and returns immediately without re-acquiring. The mutex is
contended only during cold startup.
OnceLock<T> is Sync as long as T: Send + Sync, which
LlamaBackend is.
2. LlamaModel weights are immutable after load
llama.cpp explicitly supports running multiple contexts in parallel
against a single loaded model. The weights are read-only after load_from_file
returns; only the per-context state (KV cache, sampler state) mutates
during inference.
This is what makes the “load once, use from many threads” pattern work without any locking on the model itself.
3. Per-call LlamaContext
Model::chat(), Model::raw(), and Model::embed() each build a
fresh LlamaContext for the duration of the call and drop it on
the way out. Two threads calling chat() simultaneously get two
independent contexts that share the same underlying weights via
references.
#![allow(unused)]
fn main() {
// Inside run_completion:
let ctx_params = LlamaContextParams::default().with_n_ctx(Some(n_ctx));
let mut ctx = model.new_context(backend, ctx_params)?;
// ... decode, sample, decode, sample ...
// ctx dropped at function exit
}
No state survives the call. No cleanup is required. No two threads
ever touch the same LlamaContext.
4. Model::close() is the one &mut self method
PHP’s runtime serializes calls into the same object method via its
own object lock, so close() from one thread while another calls
chat() should be safe by the runtime’s invariants — but it’s the
one place where the Rust code mutates the Model itself
(self.inner = None). The worst case is the user-after-close error,
which is what close() is supposed to provoke anyway.
When you actually get concurrency
Three deployment shapes use this thread-safety:
- PHP-FPM workers (process-based) — each worker is independent; the thread-safety story doesn’t matter, but the mmap-sharing story does. See Worker pools.
- ZTS PHP +
parallel(thread-based) — one PHP process, multiple OS threads, each callingchat()on a sharedModel. This is what the thread-safety story is for. - Swoole / ReactPHP coroutines (single-threaded but context-switching) — not actually concurrent at the OS level, so thread-safety isn’t strictly required; you’ll still benefit from the per-call context pattern because no global state survives.
ZTS-specific notes
ZTS (Zend Thread Safe) is a PHP build mode that adds TLS storage
around engine globals so multiple PHP interpreters can run in one
process. It’s required for pthreads
(EOL) and the more modern parallel
extension.
Detecting ZTS
php -i | grep 'Thread Safety'
# expected: Thread Safety => enabled
Or from PHP:
if (PHP_ZTS) {
// ZTS build
}
Installing ZTS PHP
Most distros ship NTS PHP. To get ZTS:
- Ubuntu / Debian: build from source with
./configure --enable-zts. Some PPAs (ondrej/php) ship a ZTS variant underphp{X}.{Y}-ztsbut coverage is spotty. - macOS: Homebrew’s
php@*formulas are NTS. Usephpbrew install +zts +parallelor build from source. - Docker: official
php:*-cliimages are NTS. The communitysilkeh/phpimages include ZTS variants.
ext-infer v0.1 ships NTS-only release binaries. ZTS users need
to build from source. The composer.json declares
support-zts: true so a future ZTS release can ship without changing
the install story.
Loading ext-infer into ZTS PHP
Same extension=infer line in php.ini, plus parallel if you want
threading:
extension=infer.so
extension=parallel.so
A minimal parallel test
<?php
use parallel\Runtime;
use Displace\Infer\Model;
use Displace\Infer\Prompt;
$model = Model::load('models/Qwen3-0.6B-Q8_0.gguf');
$rt1 = new Runtime();
$rt2 = new Runtime();
$f1 = $rt1->run(function (Model $m) {
return $m->chat(Prompt::user('What is the capital of France?'))->answer();
}, [$model]);
$f2 = $rt2->run(function (Model $m) {
return $m->chat(Prompt::user('What is the capital of Italy?'))->answer();
}, [$model]);
echo "F: ", $f1->value(), PHP_EOL;
echo "I: ", $f2->value(), PHP_EOL;
$model->close();
If this works, you have concurrent inference. If it crashes — please open an issue with the model name, PHP version, build flags, and the crash output. CI doesn’t exercise this path yet, so user reports are the canary.
Future work
Two threading-related items on the roadmap:
CI exercise for ZTS
Add a parallel-driven stress test to CI. Today the matrix only
covers NTS. Adding ZTS will require:
- Building a ZTS-PHP runner image (the maintainers haven’t picked one yet).
- Adding a ZTS leg to
ci.ymland the release matrix inrelease.yml.
Reusable session contexts
Today, every chat() call rebuilds the LlamaContext from scratch.
That drops the KV cache, so multi-turn conversations re-prefill on
every turn. A Session abstraction that owns a long-lived context
would let users opt into KV-cache reuse for back-to-back turns of the
same conversation. Tracked in PLAN.md.
This wouldn’t change the thread-safety story — each Session would
be owned by one thread (or guarded by a mutex if shared) — but it
would significantly improve multi-turn performance.
Next
- Worker pools recipe — practical patterns for concurrency in production.
- Performance tuning — knobs that matter once you’ve got the concurrency story right.
Apple Metal
Metal is Apple’s low-level GPU API. On Apple Silicon hardware (M1 / M2 / M3 / M4), llama.cpp uses Metal to offload weight-matrix multiplications to the integrated GPU, which substantially outpaces the CPU for medium-to-large models.
ext-infer exposes Metal as an opt-in cargo feature. It is not
enabled by default — the default build is CPU-only and portable to
non-Apple platforms.
When Metal helps
Order-of-magnitude rule of thumb on an M-series Mac:
| Model size | CPU tokens/sec | Metal tokens/sec | Speedup |
|---|---|---|---|
| 0.6B | ~80 | ~120 | 1.5× |
| 3B | ~25 | ~70 | 2.8× |
| 7B | ~12 | ~50 | 4× |
| 13B+ | (memory-limited) | ~25 | dramatic |
Numbers are rough — they depend on quant level, M-series generation, prompt length, and what else the machine is doing. The pattern is clear though: Metal’s value grows with model size.
For small models on a fast CPU, Metal can actually be slower on the first few tokens because of the shader compilation overhead. If you’re running 600M-param models in batch mode, the CPU build is likely fine.
Enabling Metal
The cargo feature is named metal:
make release FEATURES=metal
make install FEATURES=metal
Or via raw cargo:
cargo build --release --features metal
The release binary is now Metal-enabled. No runtime flag — Metal is used automatically when the cargo feature is on.
Per-layer offload
The Model::load() option n_gpu_layers controls how many
transformer layers are offloaded to the GPU. Defaults to 0 (CPU
only); set to a high number (the model’s total layer count, or just
999 as a “all of them” shortcut) to offload everything:
$model = Model::load($path, [
'n_gpu_layers' => 999, // offload all layers to Metal
]);
For models that fit entirely in unified memory, full offload is almost always what you want. For models that don’t fit, partial offload lets you put the hot lower layers on the GPU and keep the upper layers on CPU. Tune empirically; the upstream llama.cpp Metal docs have more.
Why isn’t it the default?
Three reasons we ship CPU-by-default and Metal-by-opt-in for v0.1:
- The release matrix builds on the GitHub
macos-14runner. Its hardware revision andMACOSX_DEPLOYMENT_TARGETare not-fully-pinned — we haven’t validated that a Metal-enabled binary built there actually loads on every customer Mac. - CI doesn’t test Metal output for correctness. Different precision behavior on GPU vs CPU could surface as different sampler output, and we haven’t caught that drift end-to-end yet.
- Cold-start cost. Metal shader compilation adds ~1s to the first inference. Acceptable for long-running workers, awkward for a CLI tool people run once.
Making Metal the default for macos-arm64 release tarballs is on the roadmap once those three concerns are resolved.
Verifying Metal is actually being used
Enable EXT_INFER_LOG and look for
Metal-specific lines:
EXT_INFER_LOG=1 php hello.php 2>&1 | grep -i metal | head
You should see something like:
ggml_metal_init: GPU name: Apple M2 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple8 (1008)
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: recommendedMaxWorkingSetSize = 48318.38 MiB
If you see no Metal lines at all, the cargo feature didn’t get
applied — re-check the make release FEATURES=metal invocation.
Memory considerations
Apple Silicon has unified memory — the GPU and CPU share the same physical RAM. There is no “host-to-device” copy step like on discrete GPUs. The trade-off is that GPU memory pressure shows up as overall system memory pressure: a 13B model in Metal mode uses ~8 GB of the same RAM your other apps need.
recommendedMaxWorkingSetSize in the log above is what macOS thinks
you should keep the GPU footprint under. Loading a model larger than
that will work — Metal pages weights in and out as needed — but
performance drops sharply.
Cross-platform note
#[cfg(feature = "metal")] only enables Metal on Apple targets.
Building with --features metal on Linux is harmless (the feature is
a no-op there), but there’s no reason to do it.
For GPU acceleration on non-Apple hardware (CUDA on NVIDIA, ROCm on
AMD, Vulkan as a portable option) — the llama-cpp-2 crate supports
all three, but ext-infer hasn’t surfaced them as cargo features yet.
If you want one, open an issue.
Next
- Performance tuning — once Metal is on, the
next bottleneck is usually
nCtxormaxTokens. - Choosing a model — Metal opens up larger models you might not have considered.
Performance tuning
A Model::chat() call has three dominant costs:
- Loading — one-time on the first call. Dominated by disk I/O
(or
mmapsetup) for the GGUF. - Prompt prefill — tokenize + forward-pass on the prompt. Scales roughly linearly with prompt length.
- Token generation — sample, decode, sample, decode, … Scales
linearly with
maxTokens(or wherever the model chooses to stop).
This page walks through each, what knobs affect it, and what trade-offs each knob carries.
Reducing load time
Load is the slow part — for a 4 GB model from a cold cache, expect 1–3 seconds on SSD, longer on spinning rust.
use_mmap (default true)
Memory-mapping the GGUF skips the explicit read() syscall and lets
the OS page weights in lazily. Always leave this on unless you’re
diagnosing a specific mmap issue. Without it, load reads the entire
file upfront — slower for large models, identical for small ones once
cached.
$model = Model::load($path, ['use_mmap' => true]); // default
use_mlock (default false)
mlock pins the model’s pages in physical RAM so the OS can’t page
them out. Useful when:
- You’re on a memory-constrained machine and would rather OOM than thrash.
- You’re serving a large model under unpredictable load and want predictable latency.
The cost: that physical memory is unavailable to anything else on the system. Don’t turn it on unless you know you want it.
$model = Model::load($path, ['use_mlock' => true]);
On Linux, mlock has a per-process limit (RLIMIT_MEMLOCK). For
models larger than 64 MB (basically all of them), you’ll need to raise
it via /etc/security/limits.conf or ulimit -l unlimited. macOS
doesn’t enforce the same limit but may swap aggressively under
memory pressure.
Sharing one load across many workers
If you’re running multiple FPM workers, the OS automatically
deduplicates mmap’d pages across them. 16 workers loading the same
4 GB model consume ~4 GB of physical memory total, not 64. This is
why use_mmap matters even on machines with abundant RAM.
Reducing prompt prefill cost
Prefill cost scales with the number of prompt tokens. The longest prompts come from RAG pipelines that inject document context — see RAG over markdown.
nCtx (default 2048)
The context window for a single call. The rendered prompt + generated
tokens must fit. Lower is faster because llama.cpp allocates the
KV cache to nCtx, so a 32k context costs 16× more memory than a 2k
context even when most of it is unused.
$model->chat($prompt, nCtx: 4096, maxTokens: 1024);
For typical RAG/chat use cases, nCtx = 2048 to 4096 is plenty.
Go higher only when the model has been trained for it and you’ve
measured a quality benefit.
Prompt length
The fastest prompt is a short prompt. Common ways to compress without losing fidelity:
- Drop boilerplate from system messages. “You are a helpful assistant. Answer truthfully. Don’t make things up. Be concise. Use markdown formatting. …” is mostly cargo-culted. Test what’s actually load-bearing.
- Truncate conversation history. Keep the last N turns rather than every turn since the dawn of the conversation. For most chatbots, N = 6–10 is plenty.
- Summarize old turns. Replace turns 1–50 with “Earlier, the user asked about X and you said Y.” This is what production chatbots do above a certain length.
Reducing token-generation cost
Once prefill is done, each generated token costs roughly the same. Two knobs.
maxTokens (default 128)
The maximum number of generated tokens. Lower is faster. The default is conservative on purpose — bump it for any non-trivial generation:
$model->chat($prompt, maxTokens: 512); // ~4× the default budget
Set it high enough that legitimate answers complete, low enough that runaway generations (which happen) don’t wedge the worker for minutes. For reasoning models, you’ll want at least 512 — they spend many tokens thinking.
When finishReason() === 'length', you hit this budget. Surface it
to the caller so they can decide whether to bump or live with the
truncation.
temperature
temperature = 0.0 is greedy — sample the single highest-probability
token at every step. It’s also the fastest because the sampler is
trivial.
temperature > 0.0 enables the random sampler (with optional seed
for reproducibility), which is marginally slower per token. The
difference is small enough that you should pick based on output
quality, not speed.
Hardware-side knobs
Quantization
A Q4_K_M model is roughly 2× faster than Q8_0 of the same model —
fewer bits to fetch from memory per matrix multiply. See
Choosing a model for the
size/quality table.
If Q4_K_M answers are good enough for your use case, prefer it
over Q8_0. The space and speed savings are real; the quality drop
is usually small for chat workloads.
GPU offload
The biggest single speedup is moving compute off CPU. On Apple
Silicon, see Apple Metal — n_gpu_layers: 999
typically gives a 3–4× speedup for medium models.
On Linux + NVIDIA, CUDA support exists in llama-cpp-2 but isn’t
surfaced as an ext-infer cargo feature yet. Open an issue
if you want it.
Pinning threads to cores
llama.cpp respects the OMP_NUM_THREADS environment variable.
Setting it explicitly is sometimes faster than the default (which
uses all available cores, including hyperthreads that hurt more than
help). For a 4-physical-core box:
OMP_NUM_THREADS=4 php hello.php
Experimentally find the sweet spot for your CPU.
Measuring before tuning
A useful pattern: log latency per call and look for the actual bottleneck before reaching for any of these knobs.
$start = hrtime(as_number: true);
$r = $model->chat($prompt, maxTokens: 512);
$elapsed_ms = (hrtime(true) - $start) / 1_000_000;
error_log(sprintf(
'chat: %.0fms, %d tokens, %.1f tok/s, finish=%s',
$elapsed_ms,
$r->tokensGenerated(),
$r->tokensGenerated() / ($elapsed_ms / 1000),
$r->finishReason(),
));
If tokens/sec is low (< 20 on a modern CPU), you’re hardware-bound —
quantize down or enable GPU offload. If it’s reasonable (50+) but
total time is high, you’re generating too many tokens — reduce
maxTokens or compress the prompt.
Future work
Two performance items on the roadmap that aren’t shipping in v0.1 but would change the picture significantly:
- Reusable session contexts — KV-cache reuse across
chat()calls. Multi-turn conversations would skip the prefill cost on every turn after the first. - Continuous batching — process N prompts together so the GPU stays saturated. Necessary for any serious inference-as-a-service workload.
Tracked in PLAN.md.
Next
- Apple Metal — usually the largest single improvement on macOS.
- Choosing a model — the model you pick caps everything else.
- Worker pools recipe — when per-call tuning isn’t enough.
Building from source
The development build is what make build produces — a debug-mode
shared library you can load via -d extension=…. The release build
is what ships in PIE tarballs.
Prerequisites
- PHP 8.3+ with
php-configonPATH. - Rust — installed via rustup. The repo
pins the toolchain via
rust-toolchain.toml;rustupwill fetch it on first build. - cmake 3.18+ — llama.cpp’s build system.
- A C/C++ compiler — Clang (macOS / Linux) or GCC. The build
script honors
CC/CXXif you need to override. - libclang (Linux only) —
apt install libclang-devor distro equivalent. Used bybindgenfor the PHP header parse. cargo-php—cargo install cargo-phponce.
Verify everything:
php --version
php-config --version
rustup --version
cmake --version
cargo php --version
Cloning
git clone https://github.com/DisplaceTech/ext-infer
cd ext-infer
The repo includes a models/ directory (gitignored) where you can
drop GGUFs for testing. The PHPT suite and examples both default to
models/Qwen3-0.6B-Q8_0.gguf.
Debug build
make build
# -> target/debug/libinfer.{so,dylib}
Debug builds compile faster but run slower. Use them for iterative
development; switch to make release when you’re benchmarking or
shipping.
A cold make build takes a few minutes because cargo compiles
llama-cpp-sys-2 from source (it vendors all of llama.cpp). Cached
incremental rebuilds are sub-minute on a modern laptop.
Release build
make release
# -> target/release/libinfer.{so,dylib}
Use this for installing system-wide via make install, for the
performance numbers you’d quote in benchmarks, and for any
“production-like” testing.
Optional features
| Feature | Effect | When to use |
|---|---|---|
metal | Enables Apple Metal GPU offload on macOS-arm64. | When you have an Apple Silicon Mac and want GPU acceleration. See Apple Metal. |
make release FEATURES=metal
Loading your build into PHP
Two options.
Without installing
Pass -d extension=… on every PHP invocation:
php -d extension=$PWD/target/debug/libinfer.dylib your-script.php
Substitute .so on Linux. This is what every script in
examples/
assumes — you can drop the flag once you make install.
Installing system-wide
make install runs cargo php install --release, which:
- Builds release-mode if it hasn’t already.
- Drops the binary into PHP’s
extension_dir. - Adds
extension=infer.so(or.dylib) to a config file inphp.ini’s scan directory.
make install
php -m | grep infer
# infer
To revert:
make uninstall
Editor / IDE setup
Rust analyzer
The Rust code lives in src/. Pointing rust-analyzer at
Cargo.toml (default) Just Works.
PHP autocomplete
Use the hand-authored stubs at stubs/infer.stubs.php:
// .phpstorm.meta.php / .composer.json autoload config:
{
"autoload-dev": {
"files": ["stubs/infer.stubs.php"]
}
}
Or symlink it into your project. The stubs include full PHPDoc on every method so hovering in your IDE shows the option semantics without flipping to the docs.
Regenerating stubs (rare)
Stubs are hand-authored today because we want richer docblocks than
cargo php stubs emits. To regenerate from scratch (e.g. to confirm
the stub signatures match what’s actually registered):
make stubs
git diff stubs/infer.stubs.php
Reconcile the generated output with the hand-authored version manually.
Troubleshooting common build failures
| Error | Likely fix |
|---|---|
linker 'cc' not found / cc: command not found | Install Xcode CLT (xcode-select --install) or build-essential (Ubuntu). |
cmake: command not found | brew install cmake or apt install cmake. |
libclang.so: cannot open shared object | apt install libclang-dev (Linux). On macOS, libclang comes with the CLT. |
php-config: command not found | Install PHP CLI; on macOS via Homebrew use brew link [email protected] --force. |
cargo install cargo-php fails | Check your Rust version. rustup update may help. |
undefined symbol: _spl_ce_RuntimeException | The dynamic-lookup link flag didn’t apply. Check build.rs ran; usually a stale target/ — cargo clean and rebuild. |
Next
Testing
ext-infer has two test layers:
- PHPT — integration tests that exercise the extension from PHP. This is where the real correctness coverage lives.
- Rust unit tests — for pure-Rust helpers (currently none; see Why no Rust unit tests? below).
Plus formatting and clippy. CI runs all of the above on every push.
Running PHPT locally
The test harness lives in tests/phpt/.
make test runs the full suite against a debug build:
make test
What that command actually does:
- Build (
cargo build). - Sanity-load — confirm the extension actually loaded into PHP.
- Fetch
run-tests.phpfrom PHP-src matching the current minor (if not already cached). - Run
php run-tests.php -q --show-diff tests/phpt/withTEST_PHP_EXECUTABLEandTEST_PHP_ARGSset so the freshly built.so/.dylibis loaded.
Tests gated on a real model use the INFER_TEST_MODEL environment
variable:
INFER_TEST_MODEL=$PWD/models/Qwen3-0.6B-Q8_0.gguf make test
Without the variable, model-gated tests skip cleanly. CI runs in this
“no model” mode by default; setting INFER_TEST_MODEL runs the full
suite.
Writing a PHPT test
Files in tests/phpt/ follow the standard PHPT format:
--TEST--
Model::chat() returns a Response with the model's answer
--SKIPIF--
<?php
if (!extension_loaded('infer')) {
echo 'skip ext-infer not loaded';
exit;
}
$path = getenv('INFER_TEST_MODEL');
if (!$path || !is_file($path)) {
echo 'skip INFER_TEST_MODEL not set to an existing GGUF file';
}
?>
--FILE--
<?php
$model = \Displace\Infer\Model::load(getenv('INFER_TEST_MODEL'));
$r = $model->chat(\Displace\Infer\Prompt::user('hi'), maxTokens: 32);
echo $r->finishReason() === 'eos' || $r->finishReason() === 'length' ? "ok\n" : "bad\n";
$model->close();
?>
--EXPECT--
ok
Filename convention: NNN-short-description.phpt. NNN ordering is
loose — it determines the order run-tests.php runs them in, which
doesn’t really matter.
Three sections every model-gated test needs:
--SKIPIF--—skipifextension_loaded('infer')is false (the harness invocation always passes-d extension=…, so this catches setup mistakes) and skip ifINFER_TEST_MODELis unset.--FILE--— the actual PHP under test.--EXPECT--or--EXPECTF--— expected output. Use--EXPECTF--if you need wildcards (%s,%d).
For tests that DON’T need a model, drop the INFER_TEST_MODEL check
from --SKIPIF--. They’ll run in CI’s no-model leg.
Running Rust unit tests
cargo test --lib
…would be the command, but see the next section.
Why no Rust unit tests?
Earlier versions had Rust unit tests in src/response.rs and
src/embedding.rs covering pure-Rust helpers. They were dropped
because cargo test --lib builds an executable that statically links
the crate, which pulls in references to the ext-php-rs runtime
symbols (zend_throw_exception, _emalloc, …) — symbols only
resolved when loaded into a real PHP host. On a clean checkout,
cargo test --lib fails to link.
PHPT covers the same correctness ground end-to-end, so this is a net
win for CI simplicity. If a pure-Rust helper grows complex enough to
warrant unit tests in isolation, the path forward is to factor it
into a sibling crate that has no ext-php-rs dependency.
Linting
make fmt-check # cargo fmt --all --check
make clippy # cargo clippy --all-targets -- -D warnings
CI runs both with -D warnings. Local lints are pinned to the
same Rust toolchain as the build (via rust-toolchain.toml).
CI structure
.github/workflows/ci.yml
runs on every push and PR:
rustfmt + clippyon ubuntu-latest with PHP 8.4. Fast (~1 minute warm-cache).- Test matrix — 6 legs:
{ubuntu-latest, macos-14}×{8.3, 8.4, 8.5}. Each builds the extension, loads it, runs the no-model PHPT legs. Cache is scoped per-PHP-minor (see the comment inci.ymlabout why this matters forext-php-rsbinding regeneration).
What CI does not do:
- Run model-gated PHPT tests. Adding a fixture model to CI is on the roadmap; for now, run them locally before tagging.
- Exercise ZTS PHP. See Threading & ZTS.
Pre-flight checklist
Before opening a PR, the maintainers run:
cargo fmt --all --check # no diff
cargo clippy --all-targets -- -D warnings # clean
INFER_TEST_MODEL=$PWD/models/Qwen3-0.6B-Q8_0.gguf make test # all green
If any of those fail, the PR will fail CI for the same reason — fix locally first.
Next
- Releasing — what runs in the release workflow (a different beast than CI).
- Building from source — getting to the point where
make testcan even run.
Releasing
The full cut-a-release process lives in
RELEASE.md
at the repo root. This page is the one-screen version with pointers
back into that document.
The five-step shape
# 1. Bump versions
edit Cargo.toml # [package].version = "0.1.0"
# 2. Verify locally
cargo fmt --all --check
cargo clippy --all-targets -- -D warnings
INFER_TEST_MODEL=$PWD/models/Qwen3-0.6B-Q8_0.gguf make test
composer validate composer.json
# 3. Land the bump
git commit -am "chore(release): v0.1.0"
git push
# 4. Tag — this is what triggers the release workflow
git tag v0.1.0
git push --tags
# 5. Edit and publish the draft Release on GitHub
Step 4 is the only user-facing action. The release workflow takes it from there.
What the workflow does
For each (PHP minor, OS, arch) in the 9-leg matrix:
- Install system deps (
cmake,build-essential, …). - Install the matrix PHP via
shivammathur/setup-php@v2. cargo build --release.- Stage
infer.so/infer.dylibin the right shape. - Tarball as
php_infer-{version}_php{minor}-{arch}-{os}[-{libc}].tar.gzper PIE’s filename convention. - Compute a
.sha256sidecar. - Upload both to a draft GitHub Release.
The first matrix leg creates the draft Release; later legs add files to the same one.
Why “draft”?
Releases ship draft so a maintainer can:
- Verify all 18 files (9 tarballs + 9 sidecars) are attached.
- Write release notes — the workflow doesn’t auto-generate them.
- Spot-check one tarball locally with
pie installbefore exposing it to users.
After the manual review, hit Publish release in the GitHub UI.
Versioning policy
Pre-1.0 (0.x.y), breaking changes happen between minors (0.1 →
0.2), not patches. Once v1.0.0 ships, the class / method / argument
surface is frozen.
composer.json does NOT carry a version key — that would conflict
with the tag-derived version Composer infers. The branch-alias under
extra exists only so dev-main resolves to 0.1.x-dev for users
pinning a dev branch.
What RELEASE.md covers in more detail
- Pre-flight checklist (the verify-locally step expanded).
- Release-notes template.
- Post-publish smoke test (install via PIE, run hello-world).
- Hotfix / patch process.
- Yanking a broken release.
- Caveats (Windows excluded, ZTS untested, etc.).
- Symptom → first-thing-to-check table for release failures.
If you’re cutting a release, read RELEASE.md first. This page is
the index, not the manual.
Caveats
Three things v0.1 explicitly doesn’t ship and that you should know about before cutting one:
- No Windows binaries.
os-families-exclude: ["windows"]incomposer.jsonmakes PIE skip Windows hosts cleanly. - No ZTS binaries. The composer.json declares
support-zts: truebecause the code is thread-safe by construction, but the release matrix doesn’t include a ZTS runner. ZTS users need to build from source for now. - No musl Linux binaries. The release matrix is glibc only.
Musl users build from source; the
.cargo/config.tomlcarries the neededcrt-staticopt-out.
All three are tracked in
PLAN.md.
Next
RELEASE.md— the full process document.PLAN.md— what’s in flight after v0.1.