chat · completions · embeddings · rerank · transcriptions · speech · images · models. Drop-in for any OpenAI SDK — point your client at :8080/v1 and go.
Your Strix Halo box,
running real /v1/*
inference.
hal0 turns a Linux box — ideally a Ryzen AI Max+ 395 with 128 GB of unified memory and an XDNA NPU — into a polished, OpenAI-compatible inference appliance. Slots, dispatcher, hardware probe, prewired chat UI, signed self-update. One command installs the lot.
Not another
llama-server wrapper.
Every workload is a real systemd-managed slot with a typed lifecycle. The API surface covers chat and embed and rerank and STT and TTS and image gen. The dashboard is for operating the box — not for chatting with it.
offline → pulling → starting → warming → ready → serving ↔ idle → unloading. Atomic transitions, persisted to state.json, streamed over SSE.
Registry-aware across local slots and external upstreams (OpenRouter, Anthropic, OpenAI, custom). Cold-cache prefetch coalesces a thundering herd into one HTTP call.
UMA pool on Strix Halo, real CPU/RAM/GPU on WSL2 / Proxmox / bare-metal. Slot-fit warnings size against the real unified pool, not a BAR carve-out.
hal0 update --channel stable|nightly. Verified tarballs swap a /usr/lib/hal0/current symlink. --rollback reverts atomically. Stable + nightly channels.
Dark-by-default operator console. SSE-backed status and log tail. Capability cards group flat slots into embed / voice / image picks for one-click ops.
Five providers,
one /v1/* surface.
Each provider is stateless — build_env() / start_cmd() / health() / infer().
No global state, no shared connection pool. The picker only advertises
a backend the slot can actually honour on your hardware.
Strix Halo native.
Not Strix-Halo-only.
The probe is UMA-aware on Strix Halo and falls back to portable parsers on every other host. The dashboard only labels memory "unified" when it actually is. Linux + systemd is the only hard requirement.
Curated starting points.
Tweak from there.
Sizes are published GGUF file sizes. The slot system takes a different model per slot whenever you change your mind. Picks refreshed to the latest open-weight releases — Qwen3, Llama-4 Scout, Hermes-4, gemma-3, Kokoro v1.
hal0 isn't an inference engine —
it's the orchestration around one.
Slots survive hal0-api restarts. Embeddings, rerank,
STT, TTS, and image gen all sit behind the same
/v1/* surface. UMA-aware hardware probe and
slot-fit warnings are first-class — not a slash command in a chat window.
One service for the
whole local AI stack.
No dates. Each themed row reads left-to-right: shipped,
in flight, exploring — the closer to the left, the closer it is to
running on your box. Tagged versions ship to
releases.hal0.dev within ~60s.
Live dashboard metrics, locked install + version flow,
MCP tool annotations.
Stop running models
from a chat tab.
Idempotent, non-interactive. Drops a venv at /usr/lib/hal0,
wires systemd units, lands hal0 on PATH, prints a QR.