0.9.7 shipping now · v0.9 in development

Your Strix Halo box,
running real /v1/*
inference.

hal0 turns a Linux box — ideally a Ryzen AI Max+ 395 — into a private, OpenAI-compatible AI appliance: one /v1/* API across every modality, with concurrent workloads the box manages for you. One command installs the lot.

install.sh

Linux x86_64 · Python ≥3.12

$ curl -fsSL https://hal0.dev/install.sh | bash

Read the docs → ★ Star on GitHub Apache-2.0 · Linux + systemd

hal0-api :8080 ready · all slots up · GTT 9.2 / 96 GB · probe strix-halo

host strix-halo-01 · 128 GB UMA · iGPU + XDNA SSE connected

agent

Qwen3-30B-A3B-Instruct-2507-Q4_K_M

serving

142 tok/s

embed

nomic-embed-text-v2-moe-Q4_K_M

serving

116 tok/s

rerank

bge-reranker-v2-m3-q4_k_m

ready

warm

stt

whisper-v3:turbo (NPU)

idle

:8083

tts

kokoro-82M-v1.0

idle

:8084

img

sdxl-turbo

ready

warm

Unified memory (GTT) 9.2 / 96 GB carveout · 128 GB DIMM

agent 6.5G embed 2.3G rerank 0.4G kernel + ZFS ARC 3.8G

dispatch p50

174ms

/ local coding stack

Point your editor at your own box

Aim any OpenAI-compatible client at :8080/v1 — chat, completions, and a dedicated coder slot. Your code never leaves the LAN, and there's no per-token bill.

/ private knowledge & RAG

Retrieval grounded on your data

Embeddings, reranking, and a bundled agent with opt-in graph memory and MCP tools — a full RAG stack that runs on your hardware, not someone else's server.

/ voice & image lab

Speech and images, one control plane

Transcription, text-to-speech, and local image generation behind the same API — STT on the XDNA NPU, TTS switchable between Kokoro (CPU) and Qwen3-TTS (GPU), ComfyUI on the iGPU, switched cleanly so they share the box.

258tok/s

primary + embed concurrent

Strix Halo iGPU · ~9 GB GTT

1call

thundering herd, coalesced

single-flight dispatch + cold-cache prefetch

6workloads

chat · embed · rerank · stt · tts · img

as many slots as memory fits

100tok/s

35B model · MTP speculative decode

ace-saber 35B-A3B · 19 GB · single stream

from the roster benchmark →

/ what's in the box

Not another
llama-server wrapper.

Every workload is a real systemd-managed slot with a typed lifecycle. The API surface covers chat and embed and rerank and STT and TTS and image gen. The dashboard is for operating the box — not for chatting with it.

/v1/*

OpenAI-compatible API

chat · completions · embeddings · rerank · transcriptions · speech · images · models. Drop-in for any OpenAI SDK — point your client at :8080/v1 and go.

lifecycle

Typed slot state machine

offline → pulling → starting → warming → ready → serving ↔ idle → unloading. Atomic transitions, persisted to state.json, streamed over SSE.

dispatcher

Single-flight routing

Registry-aware across local slots and external upstreams (OpenRouter, Anthropic, OpenAI, custom). Cold-cache prefetch coalesces a thundering herd into one HTTP call.

probe

Hardware-aware everything

UMA pool on Strix Halo, real CPU/RAM/GPU on WSL2 / Proxmox / bare-metal. Slot-fit warnings size against the real unified pool, not a BAR carve-out.

update

Cosign-keyless self-update

hal0 update --channel stable|nightly. Verified tarballs swap a /usr/lib/hal0/current symlink. --rollback reverts atomically. Stable + nightly channels.

stacks

Declarative Stacks

Plan a slot/model layout into a change set, apply it atomically with rollback, and converge the live slots against it — content-hash drift detection, plus export/import via a checksummed .hal0stack.json envelope.

SLOT LIFECYCLE — ENFORCED IN _transition()

offline → pulling → starting → warming → ready ⇄ serving ⇄ idle → unloading → offline ↳ error

/ provider stack

Five providers,
one /v1/* surface.

Each provider is stateless — build_env() / start_cmd() / health() / infer(). No global state, no shared connection pool. The picker only advertises a backend the slot can actually honour on your hardware.

Provider

Workloads

Hardware

Endpoints

llama.cppb9279

chat · embed · rerank · vision

Vulkan (default) · ROCm · CUDA

/v1/chat · /v1/embeddings · /v1/rerankings

FLMv1

chat · embed (ASR multiplex)

AMD XDNA NPU (opt-in)

/v1/chat · /v1/embeddings

FLM / Whisperv3 turbo

speech-to-text (co-loaded with chat)

AMD XDNA NPU

/v1/audio/transcriptions

Kokoro / Qwen3-TTS82M · GPU

text-to-speech · one-switch CPU⇄GPU swap

CPU (Kokoro) · ROCm (Qwen3-TTS)

/v1/audio/speech

ComfyUIv1

image gen · SDXL / SD 1.5 / Flux

ROCm

/v1/images/generations

/ hardware

Strix Halo native.
Not Strix-Halo-only.

The probe is UMA-aware on Strix Halo and falls back to portable parsers on every other host. The dashboard only labels memory "unified" when it actually is. Linux + systemd is the only hard requirement.

reference deployment

Ryzen AI Max+ 395

"Strix Halo" iGPU + XDNA NPU + 128 GB LPDDR5X-8000 unified memory

128GB

unified · BIOS-tunable to ~96 GB GPU

258tok/s

primary + embed concurrent

All published perf numbers come from this box. Q4 70B fits with massive headroom; Q4 MoE 100B+ with a 17–22B active path becomes feasible.

first-class

Ryzen AI Max 385 / 390

Strix Halo with 64 GB unified

64GB

unified memory

~70B

Q4 ceiling, shorter context

Same install path. Every small + mid tier fits; 70B Q4 works with tighter context windows.

supported

NVIDIA RTX 30 / 40 / 50

10–32 GB dedicated VRAM · CUDA llama.cpp

32GB

RTX 5090 VRAM ceiling

~30B

Q4 comfortable

Same slot lifecycle, dedicated VRAM instead of UMA — higher tok/s on small models, lower ceiling on the big ones.

supported

AMD Radeon RX 7000

16–24 GB discrete · ROCm or Vulkan container profiles

24GB

7900 XTX VRAM

ROCm

· Vulkan

Discrete AMD path, same hal0-slot@<name> lifecycle as Strix Halo — both ROCm and Vulkan container profiles run today.

fallback

CPU-only x86_64

Vulkan-CPU · usable for tiny models

0.5–4B

practical model size

Qwen0.5B

the CI smoke model

CI runs Qwen 0.5B here. Usable for tiny models and smoke tests, not the headline experience.

supported

Proxmox LXC

privileged container · iGPU + XDNA passthrough

0600

PVE token stored sealed

segmented

host-pressure overlay

Drop a read-only PVEAuditor token into Settings and the memory bar shows physical DIMM total + a muted "Proxmox host" segment for other-tenant + ZFS ARC pressure.

/ recommended loadouts

Curated starting points.
Tweak from there.

Sizes are published file sizes; the slot system takes a different model per slot whenever you change your mind. These are tuned for Strix Halo: A3B MoE models give 30–80B quality at ~3B token-gen speed, and an MTP head on MTP-enabled llama.cpp delivers a measured 1.4–2.4× decode speedup. FP4 is what makes them fit the unified pool — not what makes them fast.

~22 GB · daily-pro 28 GB UMA

Your top coder, MTP-fast

primary

Qwopus3.6-27B-Coder-MTPQ6_K · 22.4 GB · MTP head speculates on ROCm — fastest local coder on the box

embed

qwen3-embedding-0.6b-q8Q8 · 0.6 GB · repo-aware code search

~8 GB · lean 10 GB UMA

Snappy completion

primary

qwopus3-5-9b-coder-mtp-q6-kQ6_K · 7.6 GB · MTP · fits beside chat + embed

alt

qwopus3-5-4b-coder-mtp-q6-kQ6_K · 3.6 GB · low-VRAM / CPU fallback

~50 GB · max 64 GB+ UMA

Coder-Next ceiling

primary

qwen3-coder-next-q4kxlQ4_K_XL · 49.6 GB · dedicated Qwen3-Coder-Next — the quality ceiling

alt

qwen3-coder-reap-25b-a3b-q5kmQ5_K_M · 17.7 GB · validated REAP prune · A3B speed, most of the quality

alt

qwen3-coder-next-reap-40b-a3b-q4kxlQ4_K_XL · 28.5 GB · community 50% prune — unvalidated, some math loss

Model ids are real, pull-able registry ids. Considering Apache-clean upgrades? Qwen-Image (beats FLUX on prompt + text rendering), Qwen3-Reranker-0.6B (pairs with the Qwen3 embedder), and Chatterbox-Turbo (zero-shot voice cloning) are easy pulls when you want them.

/ vs. the alternatives

hal0 isn't an inference engine —
it's the orchestration around one.

Slots survive hal0-api restarts. Embeddings, rerank, STT, TTS, and image gen all sit behind the same /v1/* surface. UMA-aware hardware probe and slot-fit warnings are first-class — not a slash command in a chat window.

hal0 (this)

ollama

LM Studio

OpenAI cloud

OpenAI-compatible /v1/*

chat · embed · rerank · STT · TTS · img

chat · embed

chat · embed · STT · TTS · img

Concurrent slots

as memory fits

one at a time

fully concurrent

Slot lifecycle state machine

typed, atomic, SSE-streamed

—

UMA-aware hardware probe

Strix Halo · NPU · platform-aware

CPU / GPU only

—

XDNA NPU support

first-class via FLM

—

Dispatcher with upstreams

OpenRouter · Anthropic · OpenAI · custom

—

Cosign-signed updates

keyless OIDC + rollback

auto-update

—

Headless / Linux-first

systemd-required, headless

cross-platform

GUI required

cloud

Your data, your hardware

yes

Cost per million tokens

$0 + electricity

$0.50 – $60

Competitor capabilities reflect each project's published docs as of June 2026 and move fast — treat this as a snapshot, not a live scorecard.

/ the console

Chat, image gen, agents, and memory —
one console.

Slots, models, local image generation, graph memory, an agent task board, MCP, and logs — every surface in one React operator console. SSE-backed, dark by default. Real screenshots from a live hal0 instance.

hal0 dashboard overview — slots, throughput, and live service health

Slots view — per-slot state and the typed inference lifecycle — **Slots** — one engine, every workload, a typed lifecycle you can watch live.

ComfyUI image generation with the iGPU in exclusive image mode while inference slots are paused — **Image gen on the iGPU** — generating flips the accelerator into exclusive image mode; inference pauses, then resumes. One GPU, shared cleanly.

Agent memory rendered as a navigable semantic and temporal knowledge graph — **Graph memory** (opt-in) — facts become a navigable semantic + temporal graph, namespaced per agent.

Plus a Hermes agent that lives on the box, an Operator Board kanban wired to it, an MCP server + client, and a live XDNA NPU view — see the roadmap for everything shipped.

/ the bundled agent

An agent that
lives on the box.

Hermes installs and bootstraps itself on first run — sandboxed under its own user, prewired to the local /v1 API and your MCP servers, with opt-in graph memory. Reach her from Telegram or Discord; she chains tools for hours fully AFK and folds every run back into memory.

Self-bootstraps: env probe → model wiring → MCP memory → persona.
Sandboxed hal0-agent@hermes.service — own user, no new privileges.
Gated tools clear an approval bell; every call is audited.

Agents & memory → hover to tilt · click to flip

Hermes

chadrock-35b-ace-saber

ctx0/164K

remote control · self-improving · orchestration

READY tap · abilities

Hermes · abilities

Ghost Relay 40pwr

Summon her from any Telegram or Discord thread.

Engram 60pwr

Folds every run back into memory — never relearns.

Deep Run 90pwr

Chains tools for hours, fully AFK.

Skills

voice · ttsspeech · sttimage-genvisionembeddings

logs persona

Run agents →

/ roadmap

One service for the
whole local AI stack.

No dates. Each themed row reads left-to-right: shipped, in flight, exploring — the closer to the left, the closer it is to running on your box. Tagged versions ship to releases.hal0.dev within ~60s.

Declarative Stacks, GPU voice, and audited agent memory —
the v0.9 line is next.

config

Stacks — declarative config SSOT

A StackConfig plans into a change set, applies atomically with rollback, and converges the live slot set against it. Content-hash drift detection, an active-stack pointer, and export/import via a checksummed .hal0stack.json envelope.

voice

GPU voice: Qwen3-TTS + NPU STT

A voice.tts switch swaps the tts slot between Kokoro (CPU, 54 voices) and Qwen3-TTS (GPU) with no reconfiguration, while whisper-v3:turbo handles transcription on the XDNA NPU alongside chat + embed.

memory

Hindsight memory, now audited

Every destructive memory op — bank delete, memory/config/document/directive/operation/mental-model deletes — now records a durable, attributable audit row, and the cognition consoles validate the upstream response shape instead of failing silently.

reliability

72-finding platform-review remediation

A large reliability pass: honest disable-a-capability semantics, resumable pulls, disk-space preflight, cross-process file locks, and Power/Thermal + per-slot throughput cards on the dashboard by default.

shipped

running on your box today

in flight

targeting v0.9

exploring

bets, not promises

Inference + providers

The /v1/* surface and the engines behind it.

4 / 0 / 2

✓

OpenAI-compatible /v1/* API

Chat, completions, embeddings, rerank, transcriptions, speech, images. Every OpenAI SDK works unchanged against the local box.

✓

Five-provider stack

llama.cpp (Vulkan / ROCm) for chat and embed, FLM for the XDNA NPU (chat + whisper-v3:turbo STT + embed, one process), Kokoro (CPU) or Qwen3-TTS (GPU) behind a one-switch tts slot, ComfyUI for image generation.

✓

Image generation + iGPU switchover

POST /v1/images/generations served by a ComfyUI engine; generating flips the GPU into exclusive image mode and back, so chat and image share one accelerator.

✓

FLM NPU provider

Self-contained NPU toolbox pinned by digest. The chat + STT + embed trio is surfaced only when XDNA hardware is present.

— nothing queued

◦

Fine-tune & LoRA hot-swap

Attach and rotate LoRAs against a warm base model without unloading the underlying weights.

◦

Per-model rate limits & budgets

Cost-style accounting for local inference so a chatty agent can be capped without taking the whole box down.

Slot lifecycle

Atomic transitions, honest health, no snapshots.

3 / 1 / 0

✓

Slot lifecycle state machine

Atomic transitions (offline → pulling → starting → warming → ready → serving), persisted and SSE-streamed.

✓

Honest slot health + derived context

A slot is marked ready only once its real /health passes — never on a systemd snapshot. Context size is derived per slot, never silently inheriting the 4096 default.

✓

Capability slots + profiles

Embed / Voice / Image capability cards and an NPU rollup over flat slots; per-device profiles unify the launch flags, all from one Slots tab.

◇

Benchmarks & presets UI

In-dashboard tok/s + latency runs, plus curated loadout presets you can flash onto a fresh install.

— nothing queued

Install + distribution

One command. Signed. Resumable.

3 / 1 / 0

✓

hal0 setup TUI

A terminal first-run wizard: pick a hardware-anchored tier, storage, extensions, and models, then it provisions a coherent set of slots — or run --auto for recommended defaults.

✓

Cosign-signed self-update

Atomic version swap with one-flag rollback. Stable + nightly channels, GitHub OIDC-verified release tarballs, manifest proxied at releases.hal0.dev.

✓

Extensions framework

Apps and agents (Open WebUI, ComfyUI, Hermes, Pi) packaged as auto-wired extensions: lifecycle-managed, selectable at setup.

◇

AUR PKGBUILD & Ubuntu PPA

Native distro packages on top of the install script: pacman and apt as first-class install paths.

— nothing queued

Hardware + observability

Probe before you load. A memory bar that tells the truth.

3 / 0 / 1

✓

UMA-aware probe

Detects iGPU, XDNA NPU, and the unified memory pool; surfaces fit warnings inline before you load a model that won't fit.

✓

Live GTT total + honest memory bar

The memory bar reports the live GTT total from the driver, not a stale cached probe — so the unified pool you see is the pool you have.

✓

NPU occupancy grid

A living per-slot occupancy view of the XDNA NPU that breathes with real activity instead of a static picker.

— nothing queued

◦

Multi-host federation

A slot mesh across LAN boxes: chat on the Strix Halo, embed on the workstation, all behind one /v1/* surface.

Agents + memory + MCP

A place for agents to live, not only to chat with.

4 / 0 / 1

✓

Bundled Hermes agent

Hermes installs and bootstraps on first run — sandboxed under its own user, prewired to the local /v1 API and MCP servers, with an agent-card library in the dashboard.

✓

Operator Board

A hal0-skinned kanban wired to Hermes (/api/board/*), with a live agent-chat drawer and working task creation.

✓

Hindsight graph memory

Opt-in memory engine (HAL0_MEMORY_ENABLED). Shared by default; the X-hal0-Agent header scopes a client's writes to its own namespace. Every destructive op — bank delete, memory/document/directive wipes — now records a durable, attributable audit row.

✓

MCP host + server

hal0 speaks Model Context Protocol both directions: an admin + memory MCP server, plus an allow-listed client that composes external MCP tools. Destructive calls gate through an approval bell + CLI.

— nothing queued

◦

ChatOps adapters

Slack and Matrix bridges as extensions, so you can talk to hal0 from the rooms you already live in.

Security posture

No bundled auth — open on the LAN by default; front it with your own proxy.

3 / 0 / 0

✓

Open by default, proxy at the edge

No bundled auth or TLS (ADR-0012): the API binds 0.0.0.0:8080 for the LAN. Expose it by putting your own reverse proxy — Traefik, nginx, or a Cloudflare Tunnel — in front.

✓

Bundled chat UI + voice

Open WebUI on :3001 prewired to the local API, whisper-v3:turbo STT (NPU, via FLM), and a one-switch Kokoro (CPU) ⇄ Qwen3-TTS (GPU) TTS engine served through the capability API.

✓

Hands-free voice via Open WebUI Call mode

Open WebUI's built-in Call mode wired to hal0's /v1/audio/transcriptions and /v1/audio/speech endpoints — whisper-v3:turbo for listen, Kokoro for reply. No native hal0 streaming orchestrator; the loop runs inside Open WebUI.

— nothing queued

※ Exploring items are bets we believe in, not promises. Scope and order will shift release to release — v0.9 is next — as we learn what people actually do with a local platform. open an issue ↗

/ get hal0

Stop running models
from a chat tab.

One command on a fresh Linux box. hal0 is pre-1.0 and moving fast — v0.9 is next — but it installs and runs the whole stack today.

install.sh

Linux x86_64 · Python ≥3.12

$ curl -fsSL https://hal0.dev/install.sh | bash

Read the docs → View on GitHub ↗

Apache-2.0 Linux + systemd no telemetry by default cosign-signed releases

Point your editor at your own box

Retrieval grounded on your data

Speech and images, one control plane

Not anotherllama-server wrapper.

Five providers,one /v1/* surface.

Strix Halo native.Not Strix-Halo-only.

Curated starting points.Tweak from there.

hal0 isn't an inference engine —it's the orchestration around one.

Chat, image gen, agents, and memory —one console.

An agent thatlives on the box.

One service for thewhole local AI stack.

Declarative Stacks, GPU voice, and audited agent memory —the v0.9 line is next.

Stop running modelsfrom a chat tab.

Not another
llama-server wrapper.

Five providers,
one /v1/* surface.

Strix Halo native.
Not Strix-Halo-only.

Curated starting points.
Tweak from there.

hal0 isn't an inference engine —
it's the orchestration around one.

Chat, image gen, agents, and memory —
one console.

An agent that
lives on the box.

One service for the
whole local AI stack.

Declarative Stacks, GPU voice, and audited agent memory —
the v0.9 line is next.

Stop running models
from a chat tab.