Docs
Python Components
Compose LLM, Whisper, TTS, and Server components in Python.
Use the Python SDK when you want to embed Trillim in an application instead of shelling out to the CLI.
If you want the deeper operational details behind the public surface, see Advanced SDK and Server Notes.
Choose Sync or Async
Trillim exposes both patterns:
Runtime(...)is the synchronous facade. It starts components on a background event loop and lets you call them from normal blocking Python.LLM,STT, andTTScan also be used directly as async components.
Use Runtime for Synchronous Code
Runtime is the easiest entry point for scripts, workers, and sync web backends.
from trillim import LLM, Runtime
with Runtime(LLM("Trillim/BitNet-TRNQ")) as runtime:
reply = runtime.llm.chat(
[{"role": "user", "content": "Explain local inference in one sentence."}]
)
print(reply)
Runtime exposes composed components by name:
runtime.llmruntime.sttruntime.tts
It also syncifies async iterators and async context-managed session objects, so these patterns are valid:
from trillim import LLM, Runtime
from trillim.components.llm import ChatTokenEvent, ChatDoneEvent
with Runtime(LLM("Trillim/BitNet-TRNQ")) as runtime:
for event in runtime.llm.stream_chat(
[{"role": "user", "content": "Write five words about CPUs."}]
):
if isinstance(event, ChatTokenEvent):
print(event.text, end="", flush=True)
elif isinstance(event, ChatDoneEvent):
print(f"\nused {event.usage.total_tokens} tokens")
Use LLM Directly for Async Code
import asyncio
from trillim import LLM
async def main():
llm = LLM("Trillim/BitNet-TRNQ")
await llm.start()
try:
reply = await llm.chat(
[{"role": "user", "content": "Name two benefits of local models."}]
)
print(reply)
finally:
await llm.stop()
asyncio.run(main())
LLM Constructor Rules
LLM(...) loads from a managed store ID, not a raw path:
from trillim import LLM
llm = LLM("Trillim/BitNet-TRNQ")
adapter_llm = LLM(
"Trillim/BitNet-TRNQ",
lora_dir="Trillim/BitNet-GenZ-LoRA-TRNQ",
)
Useful constructor options:
| Option | Meaning |
|---|---|
num_threads | Worker thread count, 0 uses Trillim defaults |
lora_dir | Optional adapter store ID |
lora_quant | Runtime quantization for LoRA layers |
unembed_quant | Runtime quantization for the unembedding layer |
trust_remote_code | Opt in to custom bundle tokenizer/config code |
harness_name | default or search |
search_provider | ddgs or brave |
search_token_budget | Search-context budget; clamped at runtime to one quarter of the active model context window |
One-Turn Calls vs Sessions
Use the one-turn helpers when you already have the full message list:
await llm.chat(messages)async for event in llm.stream_chat(messages): ...
Use open_session() when you want multi-turn state:
import asyncio
from trillim import LLM
async def main():
llm = LLM("Trillim/BitNet-TRNQ")
await llm.start()
try:
async with llm.open_session(
[{"role": "system", "content": "Be concise."}]
) as session:
session.add_user("Give me three uses for local AI.")
print(await session.chat())
session.add_user("Now shorten that to one sentence.")
print(await session.chat())
finally:
await llm.stop()
asyncio.run(main())
Session rules that matter in real code:
ChatSessionis created byLLM.open_session(). You do not construct it yourself.- A session is single-consumer. Do not iterate and mutate it concurrently.
- When a model swap begins, existing chat sessions become stale and raise
SessionStaleError. - A closed session raises
SessionClosedErrorif reused. - Very long-lived sessions can hit the lifetime token cap and raise
SessionExhaustedError.
Inspect the Active Runtime
model_info() is the truthful snapshot of the active runtime:
import asyncio
from trillim import LLM
async def main():
llm = LLM("Trillim/BitNet-TRNQ", lora_dir="Trillim/BitNet-GenZ-LoRA-TRNQ")
await llm.start()
try:
info = llm.model_info()
print(info.state)
print(info.name)
print(info.max_context_tokens)
print(info.adapter_path)
print(info.init_config)
finally:
await llm.stop()
asyncio.run(main())
Enable Search
The search harness is for models that emit <search>...</search> tags.
from trillim import LLM
llm = LLM(
"Trillim/BitNet-TRNQ",
lora_dir="Trillim/BitNet-Search-LoRA-TRNQ",
harness_name="search",
search_provider="ddgs",
)
Notes:
- Use
braveonly ifSEARCH_API_KEYis set in the environment. - Search is configured from the SDK or Python server composition, not the CLI.
Hot-Swap a Running Model
import asyncio
from trillim import LLM
async def main():
llm = LLM("Trillim/BitNet-TRNQ")
await llm.start()
try:
info = await llm.swap_model(
"Trillim/BitNet-TRNQ",
lora_dir="Trillim/BitNet-GenZ-LoRA-TRNQ",
)
print(info.name)
print(info.adapter_path)
finally:
await llm.stop()
asyncio.run(main())
Important hot-swap behavior:
- The component must already be running.
- Concurrent swap requests fail fast instead of queueing behind an in-flight preflight or handoff.
- Existing sessions become stale once swap handoff begins.
stop()wins over in-flight startup, hot swap, and recovery restart work; if shutdown races with replacement-model preflight or handoff, Trillim discards that work and leaves the componentunavailable.- Omitted init-time options reset to Trillim defaults instead of inheriting the previous runtime.
Use STT
STT requires the voice extra.
import asyncio
from trillim import STT
async def main():
stt = STT()
await stt.start()
try:
text = await stt.transcribe_file("sample.wav", language="en")
print(text)
finally:
await stt.stop()
asyncio.run(main())
Public helpers:
await stt.transcribe_bytes(audio_bytes, language=None)await stt.transcribe_file(path, language=None)
Practical notes:
languageis optional and must contain only letters and hyphens.STTprocesses one transcription at a time.- Source files are copied into Trillim-owned temp storage before transcription.
Use TTS
TTS also requires the voice extra.
import asyncio
from pathlib import Path
from trillim import TTS
async def main():
tts = TTS(default_voice="alba", speed=1.0)
await tts.start()
try:
print(await tts.list_voices())
wav_bytes = await tts.synthesize_wav("Hello from Trillim.")
Path("speech.wav").write_bytes(wav_bytes)
async with await tts.speak("Streaming speech") as session:
pcm = await session.collect()
print(len(pcm))
finally:
await tts.stop()
asyncio.run(main())
Public helpers:
await tts.list_voices()await tts.register_voice(name, audio)await tts.delete_voice(name)await tts.synthesize_stream(text, voice=None, speed=None)await tts.synthesize_wav(text, voice=None, speed=None)await tts.speak(text, voice=None, speed=None)
audio for register_voice() can be:
bytesstrPath
Custom voice names and voice selectors accept only ASCII letters and digits.
Session rules that matter:
TTSSessionis created byTTS.speak(). You do not construct it directly.- A session is single-consumer.
await session.collect()andasync for chunk in sessionare mutually exclusive.- Only one live TTS session is allowed at a time. A second request raises
AdmissionRejectedError. - You can
pause(),resume(),cancel(), andset_speed()on a live session.
Public Error Types You Will See First
InvalidRequestError: input validation failed before work startedAdmissionRejectedError: the component is busy or drainingContextOverflowError: the rendered prompt exceeded the active model context windowProgressTimeoutError: an operation stopped making required progressSessionBusyError: a session already has an active consumerSessionClosedError: a closed session was reusedSessionStaleError: an LLM session was invalidated by model swapSessionExhaustedError: an LLM session exceeded its lifetime token quota
When to Switch to the Server API
Use the SDK when you want direct control from Python. Switch to the server docs when you need:
- HTTP access from another process or machine
- OpenAI client compatibility
- a health endpoint
- raw-body voice routes
- more operational detail in one place via Advanced SDK and Server Notes