Extensions

Request pipeline hooks — caching, rate limiting, usage tracking, budget enforcement, and logging.

Extensions add functionality to the request pipeline via hooks. They run in-handler (not as middleware), giving direct access to typed request and response data.

Available extensions

Cache

Caches non-streaming chat completion responses. Cache key is a SHA-256 hash of the serialized request body.

[extensions.cache]
ttl_seconds = 3600           # default: 300 (5 minutes)

Admin route: DELETE /v1/cache — clears all cached entries.

Rate limit

Enforces per-key request and token rate limits using a per-minute sliding window.

[extensions.rate_limit]
requests_per_minute = 60      # required
tokens_per_minute = 100000    # optional

Returns HTTP 429 when limits are exceeded. Token counting uses actual usage from provider responses (both streaming and non-streaming).

Usage tracker

Accumulates prompt and completion token counts per key and model.

[extensions.usage]

No configuration needed. Admin route: GET /v1/usage — returns JSON array of usage entries with key, model, prompt_tokens, and completion_tokens.

Budget

Enforces per-key spend limits. Requires pricing to be configured for the models in use.

[extensions.budget]
default_budget = 10.00        # USD, required

[extensions.budget.keys.team-a]
budget = 50.00                # USD override for this key

Returns HTTP 429 when a key's spend exceeds its budget. Admin route: GET /v1/budget — returns JSON array with key, spent_usd, budget_usd, and remaining_usd.

Logging

Structured request logging via the tracing framework.

[extensions.logging]
level = "info"

Logs completed requests (model, provider, key, latency, token counts) and errors.

Hook pipeline

Extensions run in config order at these points:

on_request — before provider dispatch. Can short-circuit (rate limit, budget).
on_cache_lookup — before provider dispatch for non-streaming. Returns cached response if available.
on_response — after successful non-streaming response.
on_chunk — for each SSE chunk during streaming.
on_error — when a provider call fails.

Combining extensions

Multiple extensions can be enabled simultaneously:

[extensions.logging]
level = "info"

[extensions.rate_limit]
requests_per_minute = 100

[extensions.usage]

[extensions.cache]
ttl_seconds = 600

[extensions.budget]
default_budget = 100.00

All extensions share the same storage backend.

Extensions

On this page