Skip to content

What is a slot?

A slot is one inference workload running under hal0. Each slot owns exactly one model, one backend process, one port on 127.0.0.1, and one entry in the lifecycle state machine. Routing happens at the API edge: clients send OpenAI-shaped requests, the dispatcher picks the slot that owns the model, and the slot answers.

Concretely, each slot is a real systemd unit (e.g. hal0-slot@primary.service, an instance of the hal0-slot@.service template) running inside your LXC. systemctl status hal0-slot@primary works the way you expect it to. The slot shares the LXC’s unified memory pool with any other Proxmox tenants on the same node.

Running an LLM in your homelab isn’t really an inference problem; llama.cpp and friends already solve that. The hard part is everything around it:

  • Knowing when a model is actually ready for inference, separate from systemd reporting the unit is up.
  • Handling cold-boot grace so the first request doesn’t time out while VRAM/GTT fills.
  • Surviving an hal0-api restart without dropping the model.
  • Coalescing a thundering herd of identical prefetches into one upstream call.
  • Reporting structured errors when a model can’t load, with enough detail that the dashboard can show why.

Slots are the abstraction that owns all of that. The API process is stateless; the slot owns the model.

Each slot has:

  • A name (primary, embed, stt, tts, img, or a user-defined name).
  • A model assignment (a registry ref like qwen2.5-0.5b-instruct-q4_k_m).
  • A provider (llama.cpp, flm, moonshine, kokoro, comfyui) that knows how to build the env, start the process, and run a health probe.
  • A systemd unit, an instance of the hal0-slot@.service template (e.g. hal0-slot@primary.service).
  • A port in the range 80818099, bound to 127.0.0.1 only.
  • A state file at /var/lib/hal0/slots/<name>/state.json, updated atomically on every transition and streamed to clients over SSE.

A slot moves through a fixed state machine:

offline → pulling → warming → ready → serving ↔ idle → unloading
  • offline — defined but not running.
  • pulling — fetching weights into the registry; emits byte-level SSE progress.
  • warming — process started, health probe not yet green.
  • ready — health probe is green, no in-flight requests.
  • serving — handling at least one request.
  • idleready, but the idle-timeout has tripped; eligible for unload if memory pressure rises.
  • unloading — stopping the process, releasing GTT/VRAM, then back to offline.

The dashboard’s /slots view renders the live state for every slot, plus the per-slot T/S, ACT, MEM, and UP metrics.

Dashboard /slots view showing the five built-in slots with their lifecycle state, throughput, active count, memory, uptime, and a load sparkline.

Clients hit http://127.0.0.1:8080/v1/*. The dispatcher reads the model field, looks up which slot owns it, then proxies the request to that slot’s local port.

  • Single-flight prefetch. If N concurrent requests trigger the same cold load, the slot fires one upstream call and fans the response out to all N waiters.
  • Adaptive cold-boot. Health probes back off intelligently while the model is warming, so the API doesn’t 503 a request that’s about to succeed.
  • Decision logging. Every routing choice is recorded with the registry refs considered, the slot picked, and the reason. The dashboard’s Logs view tails this stream over SSE.
  • Not a container manager. Slots use plain systemd template units, not Docker Compose or Kubernetes. Containerised backends (toolbox images for FLM, ROCm, ComfyUI, etc.) are an implementation detail of each provider.
  • Not a model cache. Weights live under /mnt/ai-models/local with the index at /var/lib/hal0/registry/registry.toml (see the model registry); slots only reference registry entries.
  • Not multi-tenant inside hal0. Slot names are global to the install. There’s no per-user partitioning in the v0.1 alpha line; agent / multi-tenant work is on the v0.2 roadmap. Multi-tenancy between hal0 and other Proxmox guests on the same node is handled by the unified memory pool, which the dashboard surfaces.