Streaming

hal0 streams /v1/chat/completions and /v1/completions responses as Server-Sent Events, exactly matching the OpenAI streaming protocol. Any OpenAI SDK that handles streaming today works against hal0 unmodified.

Enable streaming

Add "stream": true to the request body:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "primary",
    "stream": true,
    "messages": [
      {"role": "user", "content": "Count to five."}
    ]
  }'

Wire format

Each chunk is a data: … line, JSON-encoded, terminated by a blank line. The stream ends with data: [DONE]. Same shape OpenAI ships.

data: {"id":"...","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"role":"assistant","content":""}}]}

data: {"id":"...","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"One"}}]}

data: {"id":"...","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":", two"}}]}

data: [DONE]

Python SDK

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="local")

stream = client.chat.completions.create(
    model="primary",
    stream=True,
    messages=[{"role": "user", "content": "Count to five."}],
)

for chunk in stream:
    print(chunk.choices[0].delta.content or "", end="", flush=True)

What hal0 adds on top

Streaming flows through the same dispatcher that handles non-streaming requests, so you get:

Single-flight prefetch. If two clients open identical streams on a cold slot, the slot fires one upstream call and fans the token stream to both.
Adaptive cold-boot. The first request after a slot reaches ready keeps the connection open while the model finishes warming; you don’t get a 503 on a request that’s about to work.
Structured errors mid-stream. If the slot transitions to error part-way through, the stream emits one final SSE event with the structured error envelope before closing.

The same SSE wire format works across the LAN. Point an OpenWebUI on your laptop or an MCP server on another box at http://hal0.lan:8080/v1 and streaming behaves identically through your Traefik vhost.