Apple Metal
Metal is Apple’s low-level GPU API. On Apple Silicon hardware (M1 / M2 / M3 / M4), llama.cpp uses Metal to offload weight-matrix multiplications to the integrated GPU, which substantially outpaces the CPU for medium-to-large models.
ext-infer exposes Metal as an opt-in cargo feature. It is not
enabled by default — the default build is CPU-only and portable to
non-Apple platforms.
When Metal helps
Order-of-magnitude rule of thumb on an M-series Mac:
| Model size | CPU tokens/sec | Metal tokens/sec | Speedup |
|---|---|---|---|
| 0.6B | ~80 | ~120 | 1.5× |
| 3B | ~25 | ~70 | 2.8× |
| 7B | ~12 | ~50 | 4× |
| 13B+ | (memory-limited) | ~25 | dramatic |
Numbers are rough — they depend on quant level, M-series generation, prompt length, and what else the machine is doing. The pattern is clear though: Metal’s value grows with model size.
For small models on a fast CPU, Metal can actually be slower on the first few tokens because of the shader compilation overhead. If you’re running 600M-param models in batch mode, the CPU build is likely fine.
Enabling Metal
The cargo feature is named metal:
make release FEATURES=metal
make install FEATURES=metal
Or via raw cargo:
cargo build --release --features metal
The release binary is now Metal-enabled. No runtime flag — Metal is used automatically when the cargo feature is on.
Per-layer offload
The Model::load() option n_gpu_layers controls how many
transformer layers are offloaded to the GPU. Defaults to 0 (CPU
only); set to a high number (the model’s total layer count, or just
999 as a “all of them” shortcut) to offload everything:
$model = Model::load($path, [
'n_gpu_layers' => 999, // offload all layers to Metal
]);
For models that fit entirely in unified memory, full offload is almost always what you want. For models that don’t fit, partial offload lets you put the hot lower layers on the GPU and keep the upper layers on CPU. Tune empirically; the upstream llama.cpp Metal docs have more.
Why isn’t it the default?
Three reasons we ship CPU-by-default and Metal-by-opt-in for v0.1:
- The release matrix builds on the GitHub
macos-14runner. Its hardware revision andMACOSX_DEPLOYMENT_TARGETare not-fully-pinned — we haven’t validated that a Metal-enabled binary built there actually loads on every customer Mac. - CI doesn’t test Metal output for correctness. Different precision behavior on GPU vs CPU could surface as different sampler output, and we haven’t caught that drift end-to-end yet.
- Cold-start cost. Metal shader compilation adds ~1s to the first inference. Acceptable for long-running workers, awkward for a CLI tool people run once.
Making Metal the default for macos-arm64 release tarballs is on the roadmap once those three concerns are resolved.
Verifying Metal is actually being used
Enable EXT_INFER_LOG and look for
Metal-specific lines:
EXT_INFER_LOG=1 php hello.php 2>&1 | grep -i metal | head
You should see something like:
ggml_metal_init: GPU name: Apple M2 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple8 (1008)
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: recommendedMaxWorkingSetSize = 48318.38 MiB
If you see no Metal lines at all, the cargo feature didn’t get
applied — re-check the make release FEATURES=metal invocation.
Memory considerations
Apple Silicon has unified memory — the GPU and CPU share the same physical RAM. There is no “host-to-device” copy step like on discrete GPUs. The trade-off is that GPU memory pressure shows up as overall system memory pressure: a 13B model in Metal mode uses ~8 GB of the same RAM your other apps need.
recommendedMaxWorkingSetSize in the log above is what macOS thinks
you should keep the GPU footprint under. Loading a model larger than
that will work — Metal pages weights in and out as needed — but
performance drops sharply.
Cross-platform note
#[cfg(feature = "metal")] only enables Metal on Apple targets.
Building with --features metal on Linux is harmless (the feature is
a no-op there), but there’s no reason to do it.
For GPU acceleration on non-Apple hardware (CUDA on NVIDIA, ROCm on
AMD, Vulkan as a portable option) — the llama-cpp-2 crate supports
all three, but ext-infer hasn’t surfaced them as cargo features yet.
If you want one, open an issue.
Next
- Performance tuning — once Metal is on, the
next bottleneck is usually
nCtxormaxTokens. - Choosing a model — Metal opens up larger models you might not have considered.