Skip to content

Strix Halo

Strix Halo is the reference platform for hal0. It’s the box the core team develops against, the box the CI smoke tests target, and the box every performance number on this site was measured on. If you’re deciding whether a Ryzen AI Max-class APU is the right homelab AI machine, this page is the deepest answer hal0 has to give.

hal0 /hardware view showing the Strix Halo iGPU, XDNA NPU, and unified-memory pool the probe wrote to /etc/hal0/hardware.json.

The /hardware page in the dashboard renders what hal0 probe wrote to /etc/hal0/hardware.json: iGPU, NPU, the unified pool, and the toolboxes the installer wired in.

AMD Strix Halo is the codename for AMD’s Ryzen AI Max family, high-end mobile/SFF APUs that combine a Zen 5 CPU, a large RDNA 3.5 integrated GPU branded Radeon 8060S, an XDNA-2 NPU, and a wide LPDDR5X memory controller on a single package. The headline target SKU is the Ryzen AI Max+ 395, paired with 128 GB of LPDDR5X-8000 unified system memory. Lower-tier 385 and 390 parts ship in 64 GB configurations and stay first-class for hal0. Most loadouts on this page work on a 64 GB box with tighter context budgets.

What makes Strix Halo interesting for inference isn’t any single component. It’s the combination:

  • One pool of memory. The CPU, the iGPU, and the NPU all address the same LPDDR5X. There is no PCIe transfer for weights, no host-to-device copy, no “is this tensor on the right device” cliff. When the iGPU loads a 42 GB Q4 70B, it’s reading from the same bytes the CPU sees.
  • The iGPU is real. Radeon 8060S is roughly RX 7600 class on RDNA 3.5. There’s enough compute to push a 7B Q4 model into hundreds of tokens per second, and enough memory bandwidth (LPDDR5X-8000 in a 256-bit configuration) to keep it fed.
  • The NPU is useful. The XDNA-2 block targets low-precision matmul at low idle power. It isn’t a primary chat engine. It’s good for always-on small models, wake-word detection, and audio pipelines, which frees the iGPU for the heavy lifting.
  • The carveout is configurable. Depending on BIOS, you can dedicate up to ~96 GB of the 128 GB pool to the iGPU as VRAM. Some configurations report ~110 GB addressable through GTT (graphics-translated paged memory) when the model exceeds the fixed carveout. Either way, the headroom dwarfs every consumer discrete card on the market.

The closest equivalent is a high-end Apple Silicon machine. Strix Halo is x86_64, runs Linux natively, supports Vulkan and ROCm, and has an NPU you can program independently of the GPU.

hal0 is opinionated about hardware. The slot lifecycle, the dispatcher, the systemd template unit, and the hardware-aware probe were all designed with Strix Halo’s behaviour in mind:

  • Slot fit warnings size to the unified pool. When you load a 42 GB Q4 70B on a 128 GB box, the dashboard tells you exactly how much GTT headroom is left for KV cache, embed, and audio slots. The probe at /etc/hal0/hardware.json knows about UMA. It doesn’t pretend the iGPU has 8 GB “dedicated” VRAM and 120 GB of disqualified system RAM.
  • The memory bar accounts for other PVE tenants. Drop a read-only PVEAuditor token plus endpoint into /etc/hal0/proxmox.json and the dashboard adds a Proxmox host segment showing what the rest of the node is eating. No other local inference platform draws this picture, because no other local inference platform expects to share a node. The API redacts the token on read and in logs.
  • Three slot-providing backends on Strix Halo. llama.cpp Vulkan (the default for the iGPU, no ROCm headers required), llama.cpp ROCm (opt-in via the rocm toolbox), and FastFlowLM on the XDNA NPU (via the flm toolbox). NPU, iGPU, and CPU can each own different slots in the same hal0 instance.
  • primary and embed are co-resident by default. On a discrete GPU you fight for VRAM. On Strix Halo, the embed slot’s ~600 MB bge-m3 lives alongside a 19 GB Q4 30B-A3B coder and a 330 MB Kokoro TTS without anything paging out.
hal0 unified-memory bar: System RAM, GTT (iGPU inference), Proxmox host pressure, and free, all in one pool.

The unified pool, in one bar: System RAM in the LXC, GTT slice in use by hal0’s slots, Proxmox host pressure from other tenants, and free.

The numbers below are real measurements on the reference Strix Halo deployment. They are deliberately conservative; nothing is extrapolated. Anything not on this list is not on this page.

ModelQuantThroughput / latencyNotes
Phi-3 MiniQ471 tok/s, 280 ms round-trip first message after warm2.39 GB HF download, ~10 s pull.
Qwen2.5-0.5BQ4_K_M217–413 tok/sThe CI smoke model. The range reflects context length and prompt size.
primary + embed (concurrent)~258 tok/s chat, <200 ms dispatchBoth slots warm. iGPU at ~9 GB GTT. Same single-flight dispatcher path as a single-slot test.

All three rows verified on Ryzen AI Max iGPU + Vulkan.

Curated starting points sized to a 128 GB Strix Halo envelope. The 64 GB SKU runs every small and mid tier here; for large tiers, shorten context windows or drop to a Q4 30B-A3B instead of a Q4 70B.

All sizes are published GGUF file sizes verified on Hugging Face, May 2026. Mix and match. The slot system takes a different model per slot whenever you change your mind. The companion loadouts reference carries the same picks broken out per hardware tier.

  • Small (~5 GB) — primary: Qwen2.5-Coder-7B-Instruct-Q4_K_M. The best small dedicated coder until a Qwen3-Coder small variant ships.
  • Mid (~19 GB) — primary: Qwen3-Coder-30B-A3B-Instruct-Q4_K_M (~18.6 GB, MoE with only 3B active params, runs near 3B speeds and reasons like a 30B); embed: nomic-embed-text-v2-moe-Q4_K_M (~140 MB) for repo-aware search.
  • Large (~42 GB) — primary: Hermes-4-70B-Q4_K_M (~42.5 GB) for hybrid reasoning + tool-friendly coding. Alt: Llama-4-Scout-17B-16E-Instruct-Q4_K_M (~50 GB, MoE 17B active, 10M context). On 128 GB you can keep the 30B-A3B coder and the 70B reasoner hot in separate slots.
  • Small (~2.5 GB) — Qwen3-4B-Instruct-2507-Q4_K_M (Aug 2025 release, 1M context). Snappy on any modern box.
  • Mid (~19 GB) — Qwen3-30B-A3B-Instruct-2507-Q4_K_M (MoE 3B active). Lighter alt: gemma-3-12b-it-Q4_K_M (~6.6 GB).
  • Large (~50 GB) — Llama-4-Scout-17B-16E-Instruct-Q4_K_M (MoE 17B active, 10M context). With this hot, 128 GB still leaves embed + STT/TTS room. 64 GB SKUs won’t fit this comfortably alongside audio.
  • primary: Qwen3-4B-Instruct-2507-Q4_K_M (~2.5 GB) for low-latency replies.
  • stt: Moonshine base (~190 MB) via the moonshine toolbox. Built for edge real-time. Higher-accuracy alt: whisper-large-v3-turbo (~1.6 GB) or Canary-Qwen-2.5B (Open ASR Leaderboard SOTA, 5.63% WER).
  • tts: Kokoro-82M v1.0 (~330 MB, 8 languages, 54 voices) via the kokoro toolbox. Voice-cloning alt: F5-TTS.

The entire pipeline is ~3 GB. On 128 GB you leave the rest of the budget free for a big embed or a second chat model warm in another slot.

  • primary: Hermes-4-70B-Q4_K_M (~42.5 GB, Aug 2025, hybrid-mode reasoning + creative strength). Lighter alt: Hermes-4-14B-Q4_K_M (~9 GB, Qwen-3-14B base).

Privacy-first / minimal footprint (under 1 GB)

Section titled “Privacy-first / minimal footprint (under 1 GB)”
  • primary: gemma-3-1b-it-Q4_K_M (~0.7 GB). Text-only, March 2025.
  • embed: nomic-embed-text-v2-moe-Q4_K_M (~140 MB, multilingual MoE, 137M params).
  • Runs on CPU-only fallback boxes too. The smallest viable hal0 install.
  • primary: Qwen3-30B-A3B-Instruct-2507-Q4_K_M (~18.6 GB) for synthesis.
  • embed: bge-m3 (~600 MB Q8): multilingual, multi-vector, 8192-token context, top retrieval R@1 in 2026 benchmarks.
  • The embed slot also serves rerank via /v1/rerankings. 128 GB extra: huge room for KV cache → long-context retrieval (64k+) without paging.
  • primary: Hermes-4-70B-Q4_K_M (~42.5 GB). Nous’s hybrid-reasoning model, explicitly tuned for tool-call faithfulness and format adherence. Lighter alt: Hermes-4-14B-Q4_K_M (~9 GB).
  • embed: bge-m3 or nomic-embed-text-v2-moe for retrieval-augmented routing.
  • Lines up with the v0.2 agents / MCP roadmap.

The biggest realistic single-model loadout that still fits a 128 GB Strix Halo with room to breathe:

  • Llama-4-Scout-17B-16E-Instruct-Q4_K_M (~50 GB, MoE 17B active, 10M context). The current best balance of size and capability.
  • Hermes-4-70B-Q8_0 (~75 GB). 70B at Q8 instead of Q4, trading size for quant headroom.
  • Mistral-Large-Instruct-2411-Q4_K_M (123B, ~73 GB). Older but still excellent for raw single-model quality.

Hard ceiling. Qwen3-235B-A22B-Instruct-2507-Q4_K_M (~142 GB) does not fit even on a 128 GB SKU. Llama-4-Maverick-Q4_K_M (~230 GB) and Mistral-Large-3-Q4 (675B / 41B active, ~340 GB) are well over the line. That’s where you start needing a multi-GPU rig or a bigger box.

The standard one-liner takes care of most of this. The hardware probe picks Vulkan as the default backend on Strix Halo automatically. A few things to check if you’re tuning:

  1. Kernel. A current 6.x kernel. Strix Halo is recent enough that stale distro kernels miss platform fixes; CachyOS, Arch, Fedora 40+, and Ubuntu 24.04+ are known-good baselines.

  2. Mesa. A recent Mesa with RADV Vulkan is what the iGPU actually uses. RADV in Mesa is the supported path, not AMDVLK and not the proprietary stack.

  3. BIOS carveout. Set the UMA carveout in BIOS as high as your workload needs. ~32 GB is a good default if you’re not loading a 70B; bump to 96 GB if you are. GTT will page beyond the carveout when needed, but a generous carveout reduces fragmentation.

  4. Re-run probe after BIOS changes. hal0 probe rewrites /etc/hal0/hardware.json and the dashboard’s fit warnings re-read it on the next page load.

If your Strix Halo box is a Proxmox node and hal0 is one tenant among several (the canonical homelab layout), passthrough is a privileged LXC with AppArmor unconfined, plus dev0dev3 and cgroup allow entries for /dev/dri/*, /dev/kfd, and /dev/accel/accel0 (the NPU). The hal0 service user goes into the render and video groups inside the container. From there the installer behaves identically to a bare-metal run.

Unprivileged LXC won’t work for iGPU or NPU passthrough on Strix Halo — that shape is CPU-only territory.

Optional: drop a read-only PVEAuditor token plus endpoint into /etc/hal0/proxmox.json and the dashboard’s memory bar adds the Proxmox host segment so you can see what other tenants are doing to the unified pool. The token is redacted on read and in logs. Skip the file and the bar collapses to the LXC view.

The probe detects the XDNA NPU and writes it to /etc/hal0/hardware.json. The flm toolbox is published at ghcr.io/hal0ai/hal0-toolbox-flm with pinned digests, so the NPU is a live slot backend via the FastFlowLM provider. FLM has its own model tag namespace (see flm list -j for what’s available); arbitrary GGUFs don’t apply here. The dashboard’s model picker groups FLM-eligible models behind the FLM provider so the mismatch is hard to hit by accident.

hal0 probe should list radeonsi / RADV under the GPU section. If it doesn’t:

Terminal window
vulkaninfo --summary

Verify a Vulkan device shows up. If vulkaninfo is empty:

  • Install Mesa Vulkan drivers from your distro (mesa-vulkan-drivers on Debian/Ubuntu, vulkan-radeon on Arch).
  • Add your service user to the render group: usermod -aG render hal0 then systemctl restart hal0-api.
  • Check dmesg | grep amdgpu. If the kernel didn’t load amdgpu, inference won’t work either.

Confirm the probe sees the NPU in /etc/hal0/hardware.json (a npu block with an accel/accel0 device). If it’s missing on a privileged LXC, double-check the AppArmor profile is unconfined and the container’s dev0dev3 lines cover /dev/accel/accel0 with cgroup allows. FLM’s own runtime probe (flm list -j inside the toolbox) will fail with a clear error if libxrt_coreutil.so.2 is missing from the image — pull the latest ghcr.io/hal0ai/hal0-toolbox-flm digest if you’ve been pinned to a stale tag.

Memory fragmentation under heavy multi-slot use

Section titled “Memory fragmentation under heavy multi-slot use”

GTT can fragment after long-running multi-slot sessions, especially if you’ve been swapping large models in and out. Symptoms: slot start times grow, throughput drops below baseline. Mitigations:

  • Restart the slot, not the host: hal0 slot restart primary.
  • If that doesn’t help, restart the API: systemctl restart hal0-api. Active slots stay up; the API process doesn’t own the model.
  • For chronic fragmentation, raise the BIOS carveout. A larger fixed pool fragments less than a small carveout paged through GTT.

A few common culprits:

  • Power state. If the iGPU is sitting in a low power state because the system is on AC saver mode, throughput will be a fraction of what’s possible.
  • Wrong toolbox. The hardware probe should have picked the Vulkan toolbox automatically; double-check with hal0 slot list --json and look for provider.
  • Context length blow-up. A 64k context KV cache is dramatic on any platform. The dashboard logs cache size when a request goes long; use it.

The honest answer: it depends on what you’re running.

  • Big models. Strix Halo wins. A 4090 has 24 GB of VRAM; a 5090 has 32 GB. A 128 GB Strix Halo runs a 42 GB Q4 70B in iGPU memory natively, with room left over for embed and STT/TTS. The discrete cards run that same model only with painful CPU offload.
  • Small models, raw tok/s. Discrete GPUs win. A 4090 outpaces the 8060S on a 4B chat model. The discrete card has more compute and substantially more memory bandwidth (~1 TB/s GDDR6X vs ~256 GB/s LPDDR5X) at the small end.
  • Mid-range. Roughly a wash, with discrete cards edging ahead on raw throughput and Strix Halo winning on what’s possible in one box without juggling.
  • Concurrent slots. Strix Halo wins easily. Chat, embed, STT, and TTS hot at the same time, no VRAM contention, is the headline experience. Discrete GPUs in v1 typically run one slot at a time.
  • Total cost and power. Strix Halo wins. A single APU in a SFF chassis beats a 4090 desktop on idle power, fan noise, and whole-system cost.

The conclusion most people land at: discrete GPUs are better inference accelerators; Strix Halo is the better homelab inference machine. hal0 is built around the latter.

If you do have a discrete GPU, hal0 runs there too. The NVIDIA and AMD discrete pages cover the install specifics.