Skip to content

Built-in slots

hal0 always ships five slots out of the box. They live in BUILTIN_SLOTS (src/hal0/slots/manager.py) and cannot be deleted from the dashboard. You can swap their model, unload them, or leave them offline, but the slot itself is always present.

SlotWhat it servesDefault backend
primaryChat and general LLM (/v1/chat/completions, /v1/completions)llama.cpp (Vulkan)
embedEmbeddings (/v1/embeddings) and rerank (/v1/rerankings)llama.cpp (Vulkan)
sttSpeech-to-text (/v1/audio/transcriptions)Moonshine
ttsText-to-speech (/v1/audio/speech)Kokoro
imgImage generation (/v1/images/generations)ComfyUI (ROCm)

They map directly to the modalities OpenAI exposes through /v1/*. Any client written against the OpenAI SDK can hit hal0 unmodified and reach chat, embeddings, transcription, speech, and image generation. Rerank piggybacks on the embed slot because it uses the same backend process.

The dashboard groups these slots into capability cards so an operator picks “embed” or “voice” without thinking about systemd templates. The bridge is fixed in src/hal0/capabilities/orchestrator.py:

  • embedembed + rerank (auto-managed as the embed-rerank slot)
  • voicestt + tts
  • imgimg

On first enable of the rerank child, the capability orchestrator synthesises an embed-rerank slot TOML, picks a free port in the slot range (avoiding 8081 which primary owns — the deployed default is 8086), and sets defaults.extra_args = "--reranking" so llama-server exposes /v1/rerankings instead of the chat surface. You don’t have to author it. The reranker default is bge-reranker-v2-m3-q4_k_m.

The embed capability card showing the EMBED and RERANK rows, model picker, backend selector, and live metrics.

The NPU backend card rolls up every NPU-capable model across the chat and embed slots in one disclosure, and is only advertised when AMD XDNA hardware is present and the FLM toolbox image is locally available.

All five slots bind to 127.0.0.1 on a port in the slot range (80818099). Only the API (:8080) and OpenWebUI (:3001) bind public interfaces, which makes the whole thing trivial to put behind a Traefik or Caddy vhost on your homelab gateway. Clients should always talk to the API, never to a slot directly; the API does authentication, single-flight, and structured-error wrapping.

You address a slot by its name in the OpenAI model field:

Terminal window
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "primary",
"messages": [{"role": "user", "content": "Hello!"}]
}'

The dispatcher resolves "primary" to whichever model is currently loaded in the primary slot. See Slot as model for the full convention.

Every slot has a default model picked at install time by the hardware probe. You can swap it at any time:

Terminal window
hal0 slot swap primary --model qwen3-30b-a3b-instruct-2507-q4_k_m

The slot transitions through unloading → warming → ready without dropping the API socket. In-flight requests on other slots keep flowing.

The [model] default entry in primary.toml is the install-time seed only — it is not consulted at runtime. Swap writes an env override; the TOML stays stale by design. Don’t try to change the live model by editing the TOML.

Beyond the five built-ins, you can add custom slots. For example a second chat model held hot in primary-fast, or an npu slot for the FLM provider on AMD XDNA hardware. See Custom slots.