- Resources: Define what you’re limiting (like model inferences or tokens) and the time window (per second, hour, day, week, or month). For example, “1000 model inferences per day” or “500,000 tokens per hour”.
- Priority: Control which rules take precedence when multiple rules could apply to the same request. Higher priority numbers override lower ones.
- Scope: Determine which requests the rule applies to. You can set global limits for all requests, or targeted limits using custom tags like user IDs.
Learn rate limiting concepts
Let’s start with a brief tutorial on the concepts behind custom rate limits in TensorZero. You can define custom rate limiting rules in your TensorZero configuration using[[rate_limiting.rules]]. Your configuration can have multiple rules. Rate limit state is stored in Postgres, so restarting the gateway preserves existing limits and multiple gateway instances automatically share the same limits. Tracking begins when a rate limit rule is first applied to a request. Requests made before a rule was configured do not count towards its limit. Modifying a rate limit rule resets its usage.
Resources
Each rate limiting rule can have one or more resource limits. A resource limit is defined using theRESOURCE_per_WINDOW syntax. For example: tensorzero.toml
RESOURCE_per_minute limit is first used at 10:30:15, it’ll be refilled at 10:31:15, 10:32:15, and so on. You must specify 
max_tokens for a request if a token limit applies to it. The gateway makes a reasonably conservative estimate of token usage and later records the actual usage.Scope
Each rate limiting rule can optionally have a scope. The scope restricts the rule to certain requests only. If you don’t specify a scope, the rule will apply to all requests.By tags
At the moment, only scoping by user-definedtags are supported. You can limit the scope to a specific value, to each individual value (tensorzero::each), or to every value collectively (tensorzero::total). For example, the following rule would only apply to inference requests with the tag user_id set to intern: tensorzero.toml
user_id set to intern and the tag env set to production: tensorzero.toml
tags support two special strings for tag_value: - tensorzero::each: The rule independently applies to every- tag_keyvalue.
- tensorzero::total: The limits are summed across all values of the tag.
user_id tag individually (i.e. each user gets their own limit): tensorzero.toml
tensorzero.toml
The rule above does not apply to requests that do not specify any 
user_id value.Priority
Each rate limiting rule must have a priority (e.g.priority = 1). The gateway iterates through the rules in order of priority, starting with the highest priority, until it finds a matching rate limit; once it does, it enforces all rules with that priority number and disregards any rules with lower priority. For example, the configuration below would enforce the first rule for requests with user_id = "intern" and the second rule for all other user_id values: tensorzero.toml
always = true to enforce the rule regardless of other rules; rules with always = true do not affect the priority calculation above. Set up rate limits
Let’s set up rate limits for an application to restrict usage depending on an user-defined tag for user IDs.You can find a complete runnable example of this guide on GitHub.
1
Set up Postgres
You must set up Postgres to use TensorZero’s rate limiting features.See the Deploy Postgres guide for instructions.
2
Configure rate limiting rules
Add to your TensorZero configuration:Make sure to reload your gateway.
config/tensorzero.toml
3
Make inference requests
If we make two consecutive inference requests with 
user_id = "intern", the second one should fail because of rule [B]. However, if we make two consecutive inference requests with user_id = "ceo", both should succeed because rule [C] will override rule [B].- Python (TensorZero SDK)
- Python (OpenAI SDK)
Advanced
Customize capacity and refill rate
By default, rate limits use a simple bucket model where the entire capacity refills at the start of each time window. For example,tokens_per_minute = 100_000 allows 100,000 tokens every minute, with the full allowance resetting at the top of each minute. However, you can customize this behavior using the capacity and refill_rate parameters to create a token bucket that refills continuously: capacity parameter sets the maximum number of tokens that can be stored in the bucket, while the refill_rate determines how many tokens are added to the bucket per time window (10,000 per minute). This creates smoother rate limiting behavior where instead of getting your full allowance at the start of each minute: you get 10,000 tokens added every minute, up to a maximum of 100,000 tokens stored at any time. To achieve these benefits, you’ll typically want to use a low time granularity with a capacity much larger than the refill_rate. This approach is particularly useful for burst protection (users can’t consume their entire daily allowance in the first few seconds), smoother traffic distribution (requests are naturally spread out over time rather than clustering at window boundaries), and a better user experience (users get a steady trickle of quota rather than having to wait for the next time window).