Extensions
Request pipeline hooks — caching, rate limiting, usage tracking, budget enforcement, and logging.
Extensions add functionality to the request pipeline via hooks. They run in-handler (not as middleware), giving direct access to typed request and response data.
Available extensions
Cache
Caches non-streaming chat completion responses. Cache key is a SHA-256 hash of the serialized request body.
[extensions.cache]
ttl_seconds = 3600 # default: 300 (5 minutes)Admin route: DELETE /v1/cache — clears all cached entries.
Rate limit
Enforces per-key request and token rate limits using a per-minute sliding window.
[extensions.rate_limit]
requests_per_minute = 60 # required
tokens_per_minute = 100000 # optionalReturns HTTP 429 when limits are exceeded. Token counting uses actual usage from provider responses (both streaming and non-streaming).
Usage tracker
Accumulates prompt and completion token counts per key and model.
[extensions.usage]No configuration needed. Admin route: GET /v1/usage — returns JSON array of usage entries with key, model, prompt_tokens, and completion_tokens.
Budget
Enforces per-key spend limits. Requires pricing to be configured for the models in use.
[extensions.budget]
default_budget = 10.00 # USD, required
[extensions.budget.keys.team-a]
budget = 50.00 # USD override for this keyReturns HTTP 429 when a key's spend exceeds its budget. Admin route: GET /v1/budget — returns JSON array with key, spent_usd, budget_usd, and remaining_usd.
Logging
Structured request logging via the tracing framework.
[extensions.logging]
level = "info"Logs completed requests (model, provider, key, latency, token counts) and errors.
Hook pipeline
Extensions run in config order at these points:
- on_request — before provider dispatch. Can short-circuit (rate limit, budget).
- on_cache_lookup — before provider dispatch for non-streaming. Returns cached response if available.
- on_response — after successful non-streaming response.
- on_chunk — for each SSE chunk during streaming.
- on_error — when a provider call fails.
Combining extensions
Multiple extensions can be enabled simultaneously:
[extensions.logging]
level = "info"
[extensions.rate_limit]
requests_per_minute = 100
[extensions.usage]
[extensions.cache]
ttl_seconds = 600
[extensions.budget]
default_budget = 100.00All extensions share the same storage backend.