Built-in slots

hal0 always ships five slots out of the box. They live in BUILTIN_SLOTS (src/hal0/slots/manager.py) and cannot be deleted from the dashboard. You can swap their model, unload them, or leave them offline, but the slot itself is always present.

The five slots

Slot	What it serves	Default backend
`primary`	Chat and general LLM (`/v1/chat/completions`, `/v1/completions`)	llama.cpp (Vulkan)
`embed`	Embeddings (`/v1/embeddings`) and rerank (`/v1/rerankings`)	llama.cpp (Vulkan)
`stt`	Speech-to-text (`/v1/audio/transcriptions`)	Moonshine
`tts`	Text-to-speech (`/v1/audio/speech`)	Kokoro
`img`	Image generation (`/v1/images/generations`)	ComfyUI (ROCm)

Why these five

They map directly to the modalities OpenAI exposes through /v1/*. Any client written against the OpenAI SDK can hit hal0 unmodified and reach chat, embeddings, transcription, speech, and image generation. Rerank piggybacks on the embed slot because it uses the same backend process.

Capability cards

The dashboard groups these slots into capability cards so an operator picks “embed” or “voice” without thinking about systemd templates. The bridge is fixed in src/hal0/capabilities/orchestrator.py:

embed → embed + rerank (auto-managed as the embed-rerank slot)
voice → stt + tts
img → img

On first enable of the rerank child, the capability orchestrator synthesises an embed-rerank slot TOML, picks a free port in the slot range (avoiding 8081 which primary owns — the deployed default is 8086), and sets defaults.extra_args = "--reranking" so llama-server exposes /v1/rerankings instead of the chat surface. You don’t have to author it. The reranker default is bge-reranker-v2-m3-q4_k_m.

The embed capability card showing the EMBED and RERANK rows, model picker, backend selector, and live metrics.

The NPU backend card rolls up every NPU-capable model across the chat and embed slots in one disclosure, and is only advertised when AMD XDNA hardware is present and the FLM toolbox image is locally available.

Addressing

All five slots bind to 127.0.0.1 on a port in the slot range (8081–8099). Only the API (:8080) and OpenWebUI (:3001) bind public interfaces, which makes the whole thing trivial to put behind a Traefik or Caddy vhost on your homelab gateway. Clients should always talk to the API, never to a slot directly; the API does authentication, single-flight, and structured-error wrapping.

You address a slot by its name in the OpenAI model field:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "primary",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

The dispatcher resolves "primary" to whichever model is currently loaded in the primary slot. See Slot as model for the full convention.

Swapping the default model

Every slot has a default model picked at install time by the hardware probe. You can swap it at any time:

hal0 slot swap primary --model qwen3-30b-a3b-instruct-2507-q4_k_m

The slot transitions through unloading → warming → ready without dropping the API socket. In-flight requests on other slots keep flowing.

The [model] default entry in primary.toml is the install-time seed only — it is not consulted at runtime. Swap writes an env override; the TOML stays stale by design. Don’t try to change the live model by editing the TOML.

User-defined slots

Beyond the five built-ins, you can add custom slots. For example a second chat model held hot in primary-fast, or an npu slot for the FLM provider on AMD XDNA hardware. See Custom slots.