v0.2.0-alpha.3 — shipping on every supported host

Your Strix Halo box,
running real /v1/*
inference.

hal0 turns a Linux box — ideally a Ryzen AI Max+ 395 with 128 GB of unified memory and an XDNA NPU — into a polished, OpenAI-compatible inference appliance. Slots, dispatcher, hardware probe, prewired chat UI, signed self-update. One command installs the lot.

~/hal0 — bash 5 slots · 0 errors
$ curl -fsSL https://hal0.dev/install.sh | bash
Read the docs → ★ Star on GitHub Apache-2.0 · Linux + systemd
hal0-api :8080 ready · 5 slots up · GTT 9.2 / 96 GB · probe strix-halo
host strix-halo-01 · 128 GB UMA · iGPU + XDNA SSE connected
primary
Qwen3-30B-A3B-Instruct-2507-Q4_K_M
serving
142 tok/s
embed
nomic-embed-text-v2-moe-Q4_K_M
serving
116 tok/s
embed-rerank
bge-reranker-v2-m3-q4_k_m
ready
warm
stt
moonshine-base
idle
:8083
tts
kokoro-82M-v1.0
idle
:8084
img
sdxl-turbo
ready
warm
Unified memory (GTT) 9.2 / 96 GB carveout · 128 GB DIMM
primary 6.5G embed 2.3G rerank 0.4G kernel + ZFS ARC 3.8G
dispatch p50
174ms
258tok/s
primary + embed concurrent
Strix Halo iGPU · ~9 GB GTT
<200ms
dispatch latency
both slots hot, single-flight
5slots
chat · embed · stt · tts · img
systemd-managed lifecycle
~10s
Phi-3 Mini pull → first token
2.39 GB · 71 tok/s · 280 ms RTT
/ what's in the box

Not another
llama-server wrapper.

Every workload is a real systemd-managed slot with a typed lifecycle. The API surface covers chat and embed and rerank and STT and TTS and image gen. The dashboard is for operating the box — not for chatting with it.

/v1/*
OpenAI-compatible API

chat · completions · embeddings · rerank · transcriptions · speech · images · models. Drop-in for any OpenAI SDK — point your client at :8080/v1 and go.

lifecycle
Typed slot state machine

offline → pulling → starting → warming → ready → serving ↔ idle → unloading. Atomic transitions, persisted to state.json, streamed over SSE.

dispatcher
Single-flight routing

Registry-aware across local slots and external upstreams (OpenRouter, Anthropic, OpenAI, custom). Cold-cache prefetch coalesces a thundering herd into one HTTP call.

probe
Hardware-aware everything

UMA pool on Strix Halo, real CPU/RAM/GPU on WSL2 / Proxmox / bare-metal. Slot-fit warnings size against the real unified pool, not a BAR carve-out.

update
Cosign-keyless self-update

hal0 update --channel stable|nightly. Verified tarballs swap a /usr/lib/hal0/current symlink. --rollback reverts atomically. Stable + nightly channels.

dashboard
Vue 3 + Tailwind 4 admin UI

Dark-by-default operator console. SSE-backed status and log tail. Capability cards group flat slots into embed / voice / image picks for one-click ops.

SLOT LIFECYCLE — ENFORCED IN _transition()
offline pulling starting warming ready serving idle unloading offline error
/ provider stack

Five providers,
one /v1/* surface.

Each provider is stateless — build_env() / start_cmd() / health() / infer(). No global state, no shared connection pool. The picker only advertises a backend the slot can actually honour on your hardware.

Provider
Workloads
Hardware
Endpoints
llama.cppb9279
chat · embed · rerank · vision
Vulkan (default) · ROCm · CUDA
/v1/chat · /v1/embeddings · /v1/rerankings
FLMv1
chat · embed (ASR multiplex)
AMD XDNA NPU (opt-in)
/v1/chat · /v1/embeddings
Moonshinebase
speech-to-text
CPU only
/v1/audio/transcriptions
Kokoro82M
text-to-speech · 54 voices
CPU · Vulkan
/v1/audio/speech
ComfyUIv1
image gen · SDXL / SD 1.5 / Flux
ROCm
/v1/images/generations
/ hardware

Strix Halo native.
Not Strix-Halo-only.

The probe is UMA-aware on Strix Halo and falls back to portable parsers on every other host. The dashboard only labels memory "unified" when it actually is. Linux + systemd is the only hard requirement.

reference deployment
Ryzen AI Max+ 395
"Strix Halo" iGPU + XDNA NPU + 128 GB LPDDR5X-8000 unified memory
128GB
unified · BIOS-tunable to ~96 GB GPU
258tok/s
primary + embed concurrent
All published perf numbers come from this box. Q4 70B fits with massive headroom; Q4 MoE 100B+ with a 17–22B active path becomes feasible.
first-class
Ryzen AI Max 385 / 390
Strix Halo with 64 GB unified
64GB
unified memory
~70B
Q4 ceiling, shorter context
Same install path. Every small + mid tier fits; 70B Q4 works with tighter context windows.
supported
NVIDIA RTX 30 / 40 / 50
10–32 GB dedicated VRAM · CUDA llama.cpp
32GB
RTX 5090 VRAM ceiling
~30B
Q4 comfortable
Same slot lifecycle, dedicated VRAM instead of UMA — higher tok/s on small models, lower ceiling on the big ones.
supported
AMD Radeon RX 7000
16–24 GB discrete · Vulkan today, ROCm in build queue
24GB
7900 XTX VRAM
Vulkan
now · ROCm soon
Discrete AMD path. ROCm toolbox image is on the build list — Vulkan path works today for chat + embed.
fallback
CPU-only x86_64
Vulkan-CPU · usable for tiny models
0.5–4B
practical model size
Qwen0.5B
the CI smoke model
CI runs Qwen 0.5B here. Usable for tiny models and smoke tests, not the headline experience.
supported
Proxmox LXC
privileged container · iGPU + XDNA passthrough
0600
PVE token stored sealed
segmented
host-pressure overlay
Drop a read-only PVEAuditor token into Settings and the memory bar shows physical DIMM total + a muted "Proxmox host" segment for other-tenant + ZFS ARC pressure.
/ recommended loadouts

Curated starting points.
Tweak from there.

Sizes are published GGUF file sizes. The slot system takes a different model per slot whenever you change your mind. Picks refreshed to the latest open-weight releases — Qwen3, Llama-4 Scout, Hermes-4, gemma-3, Kokoro v1.

~5 GB · daily 8 GB+
Small dedicated coder
primary
Qwen2.5-Coder-7B-InstructQ4_K_M · 4.7 GB · best small dedicated coder
~19 GB · pro 24 GB+
MoE that reasons like 30B
primary
Qwen3-Coder-30B-A3BQ4_K_M · MoE 3B active · runs near 3B speeds, reasons like 30B
embed
nomic-embed-text-v2-moeQ4_K_M · 140 MB · repo-aware search
~47 GB · max 64 GB+ UMA
Qwen3-Coder-Next 80B-A3B
primary
Qwen3-Coder-Next-80B-A3BQ4_K_M · MoE 3B active · Qwen3-Coder lineup ceiling
alt
Hermes-4-70BQ4_K_M · hybrid reasoning + tool-friendly coding
alt
Qwen3.6-27B-A3B-MTPQ4_K_M · 18.8 GB · MoE with MTP · fast reasoning fallback
/ vs. the alternatives

hal0 isn't an inference engine —
it's the orchestration around one.

Slots survive hal0-api restarts. Embeddings, rerank, STT, TTS, and image gen all sit behind the same /v1/* surface. UMA-aware hardware probe and slot-fit warnings are first-class — not a slash command in a chat window.

hal0 (this)
ollama
LM Studio
OpenAI cloud
OpenAI-compatible /v1/*
chat · embed · rerank · STT · TTS · img
chat · embed
chat · embed
chat · embed · STT · TTS · img
Concurrent slots
5 systemd-managed
one at a time
one at a time
fully concurrent
Slot lifecycle state machine
typed, atomic, SSE-streamed
no
no
UMA-aware hardware probe
Strix Halo · NPU · platform-aware
CPU / GPU only
CPU / GPU only
XDNA NPU support
first-class via FLM
no
no
Dispatcher with upstreams
OpenRouter · Anthropic · OpenAI · custom
no
no
Cosign-signed updates
keyless OIDC + rollback
no
auto-update
Headless / Linux-first
systemd-required, headless
cross-platform
GUI required
cloud
Your data, your hardware
yes
yes
yes
no
Cost per million tokens
$0 + electricity
$0 + electricity
$0 + electricity
$0.50 – $60
/ roadmap

One service for the
whole local AI stack.

No dates. Each themed row reads left-to-right: shipped, in flight, exploring — the closer to the left, the closer it is to running on your box. Tagged versions ship to releases.hal0.dev within ~60s.

shipping now v0.2.0-alpha.3 released ~60s after tag · cosign-keyless verified · releases.hal0.dev/stable.json

Live dashboard metrics, locked install + version flow,
MCP tool annotations.

01
live metrics
Live TTFT dashboard
Per-slot time-to-first-token sampled in the dispatcher, surfaced as a sparkline-pinned HUD on SlotCard and aggregated fleet-wide. 60-second window; the stats row drops to KV / ACT / MEM / UP because T/S now lives on the sparkline itself.
02
kv-cache
KV-cache % gauge populates
Vulkan toolbox rebuilt against llama.cpp b9279. Scrape falls back to /slots when the legacy Prometheus gauge is missing — synthesises kv_cache_usage = max(n_prompt_tokens) / n_ctx, clamped at 1.0. The previously-empty cache tile finally has a number.
03
mcp
MCP tool annotations
All 22 admin tools and 4 standalone memory tools ship with MCP hint annotations (readOnly, destructive, idempotent, openWorld) so SDK clients can render confirmation UIs and reason about side-effects.
04
install
Install + version flow locked down
__version__ reads from importlib.metadata instead of a hard-coded string. pyproject.toml is the single source of truth. installer/install.sh copies the source tree into PREFIX so editable installs work from a temp checkout under any HAL0_PREFIX.
01
shipped
running on your box today
02
in flight
next release cuts
03
exploring
bets, not promises
Inference + providers
The /v1/* surface and the engines behind it.
4 / 0 / 2
OpenAI-compatible /v1/* API
Chat, completions, embeddings, rerank, transcriptions, speech, images. Every OpenAI SDK works unchanged against the local box.
Five-provider stack
llama.cpp (Vulkan / ROCm / CUDA) for chat and embed, FLM for the XDNA NPU, Moonshine for STT, Kokoro for TTS, ComfyUI for image generation.
Image generation
POST /v1/images/generations served by a ComfyUI provider on ROCm. Curated SDXL Turbo / SD 1.5 / Flux Schnell with license badges.
FLM NPU provider live
Self-contained ghcr.io/hal0ai/hal0-toolbox-flm:v1, pinned by sha256. Chat + embed surfaced in the picker only when XDNA hardware is present.
nothing queued
Fine-tune & LoRA hot-swap
Attach and rotate LoRAs against a warm base model without unloading the underlying weights.
Per-model rate limits & budgets
Cost-style accounting for local inference so a chatty agent can be capped without taking the whole box down.
Slot lifecycle
Atomic transitions, real metrics, no snapshots.
5 / 1 / 0
Slot lifecycle state machine
Atomic transitions (offline → pulling → starting → warming → ready → serving), persisted and SSE-streamed.
Capability slots overlay
Embed / Voice / Img capability cards plus an NPU backend rollup, backed by /etc/hal0/capabilities.toml.
embed-rerank built-in slot
Auto-created on first enable: bge-reranker-v2-m3-q4_k_m on :8086 with --reranking. /v1/rerankings stays separate from chat.
Per-slot live metrics
GET /api/slots/metrics reads docker cgroup memory + ActiveEnterTimestampMonotonic + scraped /metrics.
Orchestrator drift reconcile
apply() reconciles capabilities.toml ↔ slots/*.toml on every call, not only on selection diff.
Benchmarks & presets UI
In-dashboard tok/s + latency runs, plus curated loadout presets you can flash onto a fresh install.
nothing queued
Install + distribution
One command. Signed. Resumable.
5 / 2 / 0
Signed release pipeline live
releases.hal0.dev/stable.json via Cloudflare middleware proxy. A v* tag publishes the GH release; manifest reflects it in ~60s.
Cosign-signed self-update
Atomic version swap with rollback. Stable + nightly channels, GitHub OIDC-verified release tarballs.
Installer overhaul
ASCII banner + step counter, extracted preflight.sh, hal0 doctor, hardware cards + auto-populated slots/primary.toml, live-hello + QR.
--models-dir install flag
install.sh --models-dir=PATH (or HAL0_MODELS_DIR, or interactive tty prompt) points a fresh install at an existing model store.
Install + version flow locked down
__version__ reads from importlib.metadata. pyproject.toml is the single source of truth; installer copies the source tree into PREFIX.
Extensions framework
Third-party apps packaged as hal0 extensions: systemd unit + dashboard tile + healthcheck. Bring-your-own-app, lifecycle-managed.
AUR PKGBUILD & Ubuntu PPA
Native distro packages on top of the install script: pacman and apt as first-class install paths.
nothing queued
Hardware + observability
Probe before you load. Memory bar that tells the truth.
4 / 0 / 1
UMA-aware probe
Detects iGPU, XDNA NPU, and unified memory pool; surfaces fit warnings inline before you load a model that won't fit.
Proxmox host-pressure segment
Read-only PVEAuditor token → dashboard memory bar shows physical DIMM total plus a muted host segment for other-tenant + ZFS ARC + kernel pressure.
KV-cache % gauge populates
Vulkan toolbox rebuilt against llama.cpp b9279; synthesises kv_cache_usage = max(n_prompt_tokens) / n_ctx, clamped at 1.0.
Live TTFT dashboard
Per-slot time-to-first-token sampled in the dispatcher, surfaced as a sparkline-pinned HUD readout on SlotCard.
nothing queued
Multi-host federation
A slot mesh across LAN boxes: primary on the Strix Halo, embed on the workstation, all behind one /v1/* surface.
Agents + memory + MCP
A place for agents to live, not only to chat with.
2 / 3 / 1
MCP tool annotations
All 22 admin tools and 4 standalone memory tools ship with MCP hint annotations so SDK clients can render confirmation UIs and reason about side-effects.
Approval inbox + bell
Capital-D destructive tool calls gate through a header bell and modal inbox in the dashboard, with full CLI parity via hal0 agent approvals.
Bundled agent app
Hermes-Agent installable from the dashboard, prewired to the local OpenAI API + MCP servers.
Cognee-backed memory
Embedded SQLite + LanceDB + Kuzu. Shared by default; X-hal0-Private: 1 promotes a client's writes to its own namespace.
MCP host + server
hal0 speaks Model Context Protocol both directions. Compose tools across local slots and external MCP services; discover them from the dashboard.
ChatOps adapters
Slack and Matrix bridges as extensions, so you can talk to hal0 from the rooms you already live in.
Security + UX
Off by default. Opt in when you're ready to expose the box.
3 / 1 / 1
Bundled OpenWebUI
Prewired chat on :3001 pointed at the local API. Zero config; the installer wires it for you.
Caddy + auth + HTTPS + mDNS
install.sh --auth=basic brings up Caddy with basic_auth, Bearer tokens for the API, automatic HTTPS, and a best-effort Avahi service file.
First-run wizard rewrite
Eight linear steps from password through hardware, primary model, capabilities, HF token, license, install, done. Conditional HF-token step.
STT / TTS curated picks
v0.1.1 seeded embed/rerank with real GGUF picks. STT (moonshine) and TTS (kokoro) need a pull-layer extension because their weights aren't GGUF / safetensors.
Voice mode end-to-end
Moonshine + agent loop + Kokoro stitched into a hands-free streaming conversation, gated through the slot lifecycle.
Exploring items are bets we believe in, not promises. Scope and order will shift as v0.2 lands and we learn what people actually do with a local platform. open an issue ↗
/ one command

Stop running models
from a chat tab.

Idempotent, non-interactive. Drops a venv at /usr/lib/hal0, wires systemd units, lands hal0 on PATH, prints a QR.

~/hal0 — bash 5 slots · 0 errors
$ curl -fsSL https://hal0.dev/install.sh | bash
Apache-2.0 Linux + systemd no telemetry by default cosign-signed releases