Reasoning models
Qwen3, DeepSeek R1, and other reasoning-tuned models think out loud
before answering. When invoked through their chat template, they emit
<think>…</think> blocks containing the internal monologue, then the
actual reply. ext-infer understands this convention and exposes the
two streams separately on Response.
The split, in three calls
use Displace\Infer\Model;
use Displace\Infer\Prompt;
$model = Model::load('models/Qwen3-0.6B-Q8_0.gguf');
$response = $model->chat(
Prompt::user('What is 2+2?'),
maxTokens: 512,
);
echo $response->text(), PHP_EOL;
// <think>
// Okay, the user is asking what 2+2 is. This is basic arithmetic.
// I should respond with the correct sum, which is 4. Let me also
// verify there's no trick here — adding two and two definitely
// equals four.
// </think>
//
// 2 + 2 equals 4.
echo $response->answer(), PHP_EOL;
// 2 + 2 equals 4.
echo $response->reasoning(), PHP_EOL;
// Okay, the user is asking what 2+2 is. This is basic arithmetic.
// I should respond with the correct sum, which is 4. ...
echo $response->hasReasoning() ? 'yes' : 'no', PHP_EOL;
// yes
$model->close();
For a non-reasoning model:
reasoning()returnsnullanswer()equalstext()byte-for-bytehasReasoning()returnsfalse
The split is opt-out: there’s no flag to disable it. If the input
doesn’t contain <think>…</think> tags, nothing changes.
When the budget runs out mid-thought
Reasoning chains can be long. If maxTokens exhausts inside a
<think> block — before the closing </think> — the split fails
gracefully:
text()contains the partial reasoning verbatim, with the open<think>tag and no closing tag.reasoning()returns any previous completed reasoning blocks, ornullif none.answer()is the input with completed blocks removed and the partial thought left in place. The partial thought is intentionally left inanswer()— silently swallowing it would hide the budget problem.finishReason()returns'length'.
The fix is always “bump maxTokens”. A useful pattern is to surface
the truncation explicitly:
$response = $model->chat($prompt, maxTokens: 256);
if ($response->finishReason() === 'length') {
error_log(sprintf(
'truncated: model wanted more than 256 tokens for "%s..."',
substr($prompt->messages()[0]->content(), 0, 40),
));
}
The interactive chat example uses a softer hint: “(truncated — bump –max-tokens to see more)”.
When you DON’T want reasoning at all
Two strategies, depending on what “don’t want” means.
Strategy A — hide it in the UI, keep it under the hood
Default everywhere. Show $response->answer() to the end user. Log
$response->reasoning() for debugging or display behind a “show
thinking” toggle. No model-level change.
Strategy B — tell the model to skip thinking
Qwen3 has a /no_think directive that, when included as a system-message
suffix, suppresses the <think>...</think> block entirely. The model
still emits an empty <think></think> block (which the split handles —
reasoning() ends up being an empty string), but skips the actual
monologue:
$prompt = Prompt::system('You are helpful. /no_think')
->withUser('What is 2+2?');
$response = $model->chat($prompt);
$response->hasReasoning(); // true (empty block)
$response->reasoning(); // "" (empty string)
$response->answer(); // "2 + 2 equals 4."
This is Qwen3-specific. DeepSeek R1 has a similar concept (/no-cot
in some prompts). Other reasoning models vary. Check the model card.
Feeding history back
When building multi-turn conversations against a reasoning model, feed
Response::answer() back as the assistant’s reply, not
Response::text():
$conversation = $conversation->withAssistant($response->answer());
// ^^^^^^^^^^^^^^^^^^^
// not ->text()
text() includes the <think>…</think> block. Adding it to the
conversation means the model sees its own reasoning on the next turn
and tends to treat it as instruction rather than history — output
quality drops fast.
This is the single most-common mistake when wiring up reasoning models
in ext-infer. See Multi-turn chat for the
full pattern.
Performance note
Reasoning models spend many tokens on their internal monologue. A typical Qwen3-0.6B answer to “what is 2+2?” generates ~150 tokens of thinking before the 5-token answer. That’s an order of magnitude more work than a non-reasoning model would do for the same question.
If latency matters more than the highest-quality answer:
- Use
/no_think(Strategy B above) to skip the monologue. - Pick a non-reasoning model — Llama 3.x Instruct, Mistral Instruct, Qwen 2.5 Instruct (not Qwen3) all chat without thinking out loud.
See Performance tuning for more knobs.