Multi-turn chat

The pattern: keep the system message stable, append user/assistant turns as the conversation grows, regenerate the prompt on each user input. Lifts directly from examples/chat-interactive/.

The shape

use Displace\Infer\Model;
use Displace\Infer\Prompt;

$model = Model::load('models/Qwen3-0.6B-Q8_0.gguf');

$base         = Prompt::system('You are a helpful, concise assistant.');
$conversation = $base;

while (($line = readline('> ')) !== false) {
    $line = trim($line);
    if ($line === '' || $line === '/exit') {
        break;
    }

    // /reset is trivial because Prompt is immutable.
    if ($line === '/reset') {
        $conversation = $base;
        continue;
    }

    $conversation = $conversation->withUser($line);

    $response = $model->chat(
        $conversation,
        maxTokens: 512,
        temperature: 0.7,
    );

    echo $response->answer(), PHP_EOL;

    // Feed answer() back, NOT text(). See "Reasoning models" below.
    $conversation = $conversation->withAssistant($response->answer());
}

$model->close();

$base is built once and never mutated. Every /reset re-seats the conversation at the original system instruction without re-allocating or re-rendering. If you change the system prompt mid-conversation elsewhere in your app, the immutable shape means concurrent uses of $base aren’t affected.

2. Conversation grows by immutable append

$conversation = $conversation->withUser($line);

Every with* returns a new Prompt. The old $conversation is still valid (and still has the previous turn count); the local just points at the new one. There’s no shared mutable state, so this code is safe to put behind a queue worker or run in parallel.

3. `Response::answer()` goes back, not `text()`

$conversation = $conversation->withAssistant($response->answer());

This matters for reasoning models. answer() is the reply with <think>...</think> blocks stripped; text() is the raw output including the thoughts. Feeding text() back means the model sees its own internal monologue on the next turn — and reasoning models tend to treat that as instruction, not history. The output derails fast.

For non-reasoning models, answer() and text() are byte-identical, so the rule is “use answer() always” rather than “use answer() for some models”.

Persisting conversations

If you need to save and restore conversations (e.g. per-user chat history in a database), serialize the message list and rebuild the Prompt:

function loadConversation(string $system, array $history): Prompt
{
    $p = Prompt::system($system);
    foreach ($history as $row) {
        $p = match ($row['role']) {
            'user'      => $p->withUser($row['content']),
            'assistant' => $p->withAssistant($row['content']),
        };
    }
    return $p;
}

function saveConversation(Prompt $p): array
{
    $rows = [];
    foreach ($p->messages() as $msg) {
        $rows[] = ['role' => $msg->role(), 'content' => $msg->content()];
    }
    return $rows;
}

Prompt::messages() walks in chronological order, so saving and re-loading round-trips faithfully.

Common shape: an HTTP turn

For a request/response API where every HTTP call is one turn:

// Inside your controller — assumes $model is injected and reused.
final class ChatController
{
    public function __construct(private Model $model, private HistoryStore $history) {}

    public function turn(Request $req): Response
    {
        $conversationId = $req->session('conversation_id');
        $history        = $this->history->load($conversationId);
        $system         = $req->user()->systemPrompt() ?? 'You are helpful.';

        $prompt = loadConversation($system, $history)
            ->withUser($req->json('message'));

        $reply = $this->model->chat(
            $prompt,
            maxTokens: 1024,
            temperature: 0.5,
        );

        $this->history->append($conversationId, 'user', $req->json('message'));
        $this->history->append($conversationId, 'assistant', $reply->answer());

        return new JsonResponse([
            'answer'    => $reply->answer(),
            'reasoning' => $reply->reasoning(),
            'truncated' => $reply->finishReason() === 'length',
            'tokens'    => $reply->tokensGenerated(),
        ]);
    }
}

The $model is loaded once at FPM-worker boot — not per request — and chat() is called per request. With current ext-infer (no KV-cache reuse yet), each turn re-tokenizes and re-decodes the full history, which is slow for long conversations. A Session object that reuses the underlying llama.cpp context is on the roadmap.

When to use `Model::raw()` instead

If you have a very specific prompt shape — tool calls, RAG context injected at a non-standard slot, custom format — see Raw completions. The Prompt builder doesn’t support tool-call messages today, so tool-aware conversations need raw() until tool calling lands.

ext-infer