Docs

API Server

OpenAI-compatible server endpoints, streaming, and voice setup.

Trillim exposes a small FastAPI server for local HTTP access.

If you want the deeper operational semantics behind the server surface, see Advanced SDK and Server Notes.

There are two ways to run it:

  • trillim serve Trillim/<name> for the built-in demo server
  • Server(...) in Python when you need more control

CLI Server vs Python Server

The distinction matters:

Needtrillim serveServer(...)
quick local serveryesyes
fixed 127.0.0.1:8000 bindyesconfigurable
custom host/portnoyes
/v1/models/swapnoyes, with allow_hot_swap=True
custom search setupnoyes

CLI

trillim serve Trillim/BitNet-TRNQ

With voice routes:

trillim serve Trillim/BitNet-TRNQ --voice

Python

from trillim import LLM, Server

server = Server(
    LLM("Trillim/BitNet-TRNQ"),
    allow_hot_swap=True,
)
server.run(host="127.0.0.1", port=8000)

Endpoints

RouteMethodPurpose
/healthzGETreadiness and component health
/v1/modelsGETactive model metadata
/v1/chat/completionsPOSTOpenAI-compatible chat completions
/v1/models/swapPOSToptional hot-swap route
/v1/audio/transcriptionsPOSToptional STT route
/v1/audio/speechPOSToptional TTS route
/v1/voicesGEToptional voice list
/v1/voicesPOSToptional custom voice upload
/v1/voices/{voice_name}DELETEoptional custom voice deletion

There is no /v1/completions route in this implementation.

GET /healthz

Returns 200 when all composed components are healthy:

{"status": "ok"}

If an LLM component is not in the running state, the server returns 503 and includes the component state:

{
  "status": "degraded",
  "components": {
    "llm": {
      "state": "swapping"
    }
  }
}

GET /v1/models

Returns truthful metadata for the active runtime:

{
  "object": "list",
  "state": "running",
  "data": [
    {
      "id": "BitNet-TRNQ",
      "object": "model",
      "path": "/Users/you/.trillim/models/Trillim/BitNet-TRNQ",
      "max_context_tokens": 4096,
      "trust_remote_code": false,
      "adapter_path": null,
      "init_config": {
        "num_threads": 0,
        "lora_quant": null,
        "unembed_quant": null
      }
    }
  ]
}

POST /v1/chat/completions

This is the OpenAI-compatible chat route.

Minimal request:

curl http://127.0.0.1:8000/v1/chat/completions \
  -H "content-type: application/json" \
  -d '{
    "messages": [
      {"role": "user", "content": "Give me one sentence about local inference."}
    ]
  }'

Minimal Python client example:

import json
import urllib.request

body = json.dumps(
    {
        "model": "BitNet-TRNQ",
        "messages": [{"role": "user", "content": "Say hello."}],
    }
).encode("utf-8")
request = urllib.request.Request(
    "http://127.0.0.1:8000/v1/chat/completions",
    data=body,
    headers={"content-type": "application/json"},
    method="POST",
)
with urllib.request.urlopen(request, timeout=60) as response:
    payload = json.loads(response.read().decode("utf-8"))

print(payload["choices"][0]["message"]["content"])

Supported request fields:

FieldTypeMeaning
messagesarrayrequired message list
modelstringoptional, but if present it must match the active model name
streamboolenable SSE streaming
temperaturefloat0.0 to 2.0
top_kint1 to 200
top_pfloat> 0.0 and <= 1.0
repetition_penaltyfloat> 0.0 and <= 2.0
rep_penalty_lookbackint>= 0
max_tokensint1 to 8192

Notes:

  • Typical clients should send only system, user, and assistant roles.
  • When the LLM is using the search harness, the OpenAI route still streams assistant text only. Internal search progress is not exposed on this endpoint.
  • Requests larger than the JSON body cap are rejected before processing.

Streaming

Set "stream": true to receive server-sent events:

curl -N http://127.0.0.1:8000/v1/chat/completions \
  -H "content-type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "Say hi."}],
    "stream": true
  }'

The stream follows OpenAI-style chat chunks and ends with:

data: [DONE]

POST /v1/models/swap

This route exists only when the server was created with allow_hot_swap=True.

Example:

curl http://127.0.0.1:8000/v1/models/swap \
  -H "content-type: application/json" \
  -d '{
    "model_dir": "Trillim/BitNet-TRNQ",
    "lora_dir": "Trillim/BitNet-GenZ-LoRA-TRNQ"
  }'

Supported request fields:

FieldTypeMeaning
model_dirstringrequired store ID for the next base model
num_threadsintoptional runtime worker thread count
lora_dirstringoptional adapter store ID
lora_quantstringoptional LoRA runtime quantization
unembed_quantstringoptional unembedding quantization
harness_namestringdefault or search
search_providerstringddgs or brave
search_token_budgetintrequested search-context budget

Important behavior:

  • Omitted init-time fields reset to Trillim defaults. They do not inherit the previous runtime’s values.
  • The effective search token budget is clamped to one quarter of the active model context window.
  • Existing chat sessions become stale once swap handoff begins.
  • search_provider: "brave" requires SEARCH_API_KEY in the server environment.

Voice Routes

Voice routes exist only when the server includes STT() and TTS(). The CLI adds them with --voice.

Install the extra first:

uv add "trillim[voice]"

POST /v1/audio/transcriptions

This is a raw-body route. Send audio bytes directly and set content-type to audio/* or application/octet-stream.

curl "http://127.0.0.1:8000/v1/audio/transcriptions?language=en" \
  -H "content-type: audio/wav" \
  --data-binary @recording.wav

Response:

{"text": "transcribed text"}

Key facts:

  • max upload size: 64 MiB
  • language is optional
  • only one STT request is processed at a time

POST /v1/audio/speech

This is also a raw-body route. Send UTF-8 text as the body.

curl -N http://127.0.0.1:8000/v1/audio/speech \
  -H "voice: alba" \
  -H "speed: 1.0" \
  --data-binary "Hello from Trillim."

Response format:

  • event: audio with base64-encoded PCM chunks
  • event: done when synthesis is complete
  • event: error if the session fails after streaming has started

Important facts:

  • the HTTP response is SSE, not WAV
  • PCM is 24 kHz, mono, 16-bit
  • max text body size: 6 MiB
  • only one live TTS session is allowed at a time

If you want a ready-to-write WAV payload in Python, prefer await tts.synthesize_wav(...) from the SDK.

GET /v1/voices

Lists built-in and custom voice names:

curl http://127.0.0.1:8000/v1/voices

POST /v1/voices

Register a custom voice by sending raw audio bytes and a name header:

curl http://127.0.0.1:8000/v1/voices \
  -H "name: myvoice" \
  --data-binary @voice-sample.wav

Response:

{"name": "myvoice", "status": "created"}

Important facts:

  • max upload size: 10 MiB
  • max serialized custom voice state: 64 MiB
  • custom voice names and voice selectors accept only ASCII letters and digits
  • custom voice storage lives under ~/.trillim/voices
  • voice cloning support requires accepting the kyutai/pocket-tts terms and authenticating with Hugging Face
  • if a reference sample exceeds the serialized voice-state cap, Trillim rejects it and you should retry with a shorter sample

One-time setup for custom voice cloning:

hf auth login

If hf is not installed globally, uvx hf auth login works as well.

DELETE /v1/voices/{voice_name}

Delete a custom voice:

curl -X DELETE http://127.0.0.1:8000/v1/voices/myvoice

Error Codes

The server maps the public failure modes consistently:

StatusMeaning
400invalid input, bad JSON, model mismatch, or validation failure
409session conflict such as closed, stale, or exhausted chat state
429component busy or not admitting more work
504progress timeout
503startup failure, worker failure, or other service-side error

Use the SDK if you want the native Python exception types instead of HTTP status codes.