Price Limiting vs. Quota Reservations: when to make use of every
You might have a single gpt-oss-20b deployment. Six groups need to use it. Advertising is operating batch summarization jobs at 3am. The fraud group wants sub-second responses 24/7. An intern’s Jupyter pocket book is by accident hammering the endpoint in a decent loop. And your GPU invoice is already eye-watering.
Sound acquainted? DataRobot provides you two instruments to resolve this: Price Limiting and Quota Reservations. This submit explains when to achieve for every, backed by an actual load check instance on a staging deployment.
Price Limits and Quota Reservations, in plain English
Price Limits – Out there in DataRobot v11.4
Price limits units per-consumer caps throughout a number of dimensions: requests per minute, token rely per hour, concurrent requests, and enter sequence size. A default coverage applies to all customers, with per-entity exceptions out there for particular overrides.
What it protects towards: Any single client overconsuming — whether or not by means of excessive request quantity, massive inputs, or extreme concurrency.
Quota Reservations – out there in DataRobot v11.9
Quota reservations outline the deployment’s complete doable throughput (worth per minute) and a utilization threshold that triggers enforcement. Inside that finances, particular entities may be allotted a reserved proportion — guaranteeing them a minimal slice of capability that different customers can’t take away.
What it protects towards: Precedence hunger. With out reservations, a loud neighbor can eat all the capability finances, leaving your important workloads with nothing.
How Price Limits and Quota Reservations work collectively (and aside)
Used alone, every device solves a selected drawback:
- Price limiting alone caps complete throughput. Underneath saturation, all customers compete equally — first come, first served.
- Quota reservations alone assure minimal throughput for particular customers, no matter what others are doing.
Collectively, they offer you each management surfaces: a ceiling that protects the mannequin and assured flooring for the customers that matter most.
Load testing a multi-tenant deployment
To guage these options below strain, we load-tested a gpt-oss-20b deployment in our staging setting. The setup simulates an actual multi-tenant situation: 4 customers sharing one mannequin, every with completely different precedence ranges.
Instance configuration
| Setting | Worth |
|---|---|
| Mannequin | gpt-oss-20b (NVIDIA NIM) |
| Capability | 1000 RPM |
| Utilization Threshold – | 80% (enforcement begins at 800 RPM) |
| Shopper | Kind | Reserved Capacity | Efficient Assure |
|---|---|---|---|
| Manufacturing Agent A | Deployment | 30% | 300 RPM |
| Manufacturing Agent B | Deployment | 20% | 200 RPM |
| Manufacturing Agent C | Deployment | 30% | 300 RPM |
| Dev Person (unreserved) | Person | – | None — shares the 20% unreserved pool |
This left a 20% unreserved pool (200 RPM) for the dev consumer and any overflow.
Instance load profile
We ran six escalating eventualities over 17 minutes to look at behaviour at completely different saturation ranges:
| State of affairs | What Occurs | Mixed Load |
|---|---|---|
| Regular site visitors | All 4 customers at reasonable, throttled charges |
~600 RPM (under utilization threshold) |
| Slight overload | All 4 customers ramp as much as simply over capability |
~1,200 RPM (1.2× capability) |
| Heavy overload | All 4 customers fireplace as quick as doable |
~7,200 RPM (7× capability) |
| Excessive overload | Most concurrent employees per client |
~12,000 RPM (12× capability) |
| Late joiner | Three brokers flood first, dev consumer joins 60s later |
~9,000 RPM |
| Reserved-only | Three brokers compete, dev consumer silent |
~7,200 RPM |
When to make use of Price Limiting alone
Price limiting by itself is the correct selection when:
- All customers are equally necessary. If no group’s site visitors is extra important than one other’s, there’s no want for reservations. Equal competitors below saturation is honest sufficient.
- You simply want to guard the GPU. Your main concern is {that a} spike in site visitors doesn’t degrade mannequin latency or trigger OOM errors. You need a security valve, not a site visitors coverage.
- You might have a single client. If there’s just one utility hitting the deployment, reservations are meaningless — there’s nobody to order towards.
What the instance confirmed
Throughout the regular site visitors situation (~600 RPM mixed, nicely under the 800 RPM utilization threshold), the speed limiter was invisible and all 4 customers achieved 100% success charges with zero rejected requests.
| State of affairs | Mixed RPM | Success Price | 429s |
|---|---|---|---|
| Regular site visitors | ~600 | 100% | 0 |
Measurement your reservations based mostly on absolutely the minimal throughput every client requires throughout peak competition. That is by design, so that you’re not penalizing regular site visitors.
And it protects the mannequin even below excessive abuse. Throughout the excessive overload situation (20,000+ RPM towards 1,000 RPM capability, which is a a 20× overload), the speed limiter rejected 95% of requests. However the mannequin itself stayed completely wholesome:
| NIM Metric | Underneath 20× Overload |
|---|---|
| GPU Utilization | 91–95% (steady) |
| E2E Latency | 1.25s → 2.09s (temporary spike, then steady) |
| Time to First Token | 35ms (unchanged) |
| Inter-Token Latency | 18ms (unchanged) |
| KV Cache | <3% (not careworn) |
The speed limiter acted as a firewall between chaotic shopper demand and steady mannequin inference. With out it, these 20,000 requests per minute would have queued up contained in the NIM, latency would have ballooned, and the mannequin would have successfully turn out to be unusable for everybody.
Takeaway: In case your solely purpose is “don’t let site visitors spikes kill the mannequin,” price limiting alone is enough and zero-config past setting the capability quantity.
When so as to add Quota Reservations
Quota reservations turn out to be important when:
- Some customers are extra necessary than others. Your fraud detection system can’t afford to be starved out by a batch analytics job. Your manufacturing agent wants assured throughput {that a} developer’s check harness can’t steal.
- You might have a multi-tenant deployment. A number of groups, functions, or downstream deployments share the identical mannequin. With out reservations, the loudest client wins.
- You need predictable SLAs. For those who’ve promised a group “your utility will get at the very least 300 RPM,” reservations are the way you implement that promise on the infrastructure degree.
- You might have a mixture of interactive and batch workloads. Batch jobs are bursty and can fortunately eat all out there capability. Reservations guarantee interactive workloads nonetheless get their share throughout batch spikes.
Learn how to measurement reservations
Measurement your reservations based mostly on absolutely the minimal throughput every client requires throughout peak competition.
Guidelines of thumb:
- Don’t reserve 100%. Go away an unreserved pool (10–20%) for ad-hoc site visitors, new customers, and overflow. For those who reserve all the pieces, any new utility will get zero throughput till you reconfigure.
- Measurement reservations to minimal wants, not peak wants. Reservations assure a ground, not a ceiling. An entity with 30% reserved can nonetheless use greater than 30% when capability is out there.
- Match reservation measurement to enterprise criticality, not group measurement. Your fraud detection system might need fewer requests than your analytics pipeline, however it wants assured entry extra.
In our instance, three manufacturing brokers obtained 30%/20%/30% reservations, leaving a 20% unreserved pool for the dev consumer. This meant the dev consumer might nonetheless use the deployment — they only wouldn’t get assured entry throughout competition.
Do reservations work below actual load?
At slight overload (1.2× capability): The system degrades gracefully
Throughout the slight overload situation (~1,200 RPM towards 1,000 RPM capability), all 4 customers achieved 100% success — the token bucket’s burst capability absorbed the slight overage. That is the “sleek degradation” zone the place reservations aren’t but wanted, however the system is proving it may possibly deal with bursts.
At heavy-to-extreme overload (7–12× capability): reservations preserve a assured ground
When all 4 customers fired as quick as doable (7,000–12,000 RPM towards a 1,000 RPM capability), the system was overwhelmed. Right here’s what every client skilled throughout the total check:
| Shopper | Reserved | Success Price | Profitable Requests |
|---|---|---|---|
| Manufacturing Agent A | 30% | 29.0% | 4,172 |
| Manufacturing Agent B | 20% | 30.2% | 4,332 |
| Manufacturing Agent C | 30% | 28.9% | 4,176 |
| Dev Person (unreserved) | – | 28.9% | 2,828 |
Why the success charges look comparable: At 12× overload, even a 300 RPM reservation is simply ~2.5% of what every client is making an attempt to ship (~3,000 RPM per client vs. a 300 RPM assure). The reservation works by guaranteeing every client receives its assured 200–300 RPM. Nonetheless, as a result of 97% of complete site visitors is rejected throughout excessive overloads, the relative proportion variations compress.
The extra revealing metric is absolute throughput. Reserved customers accomplished 4,172–4,332 profitable requests. The unreserved dev consumer accomplished 2,828 — about 34% fewer. Even accounting for the dev consumer’s shorter energetic time, reserved customers constantly acquired extra requests by means of throughout shared eventualities.
At saturation with a late joiner: reservations defend incumbents
Within the late joiner situation, the three manufacturing brokers had been already flooding the system when the dev consumer joined 60 seconds later. With all reserved capability spoken for, the dev consumer was confined to the 20% unreserved pool (~200 RPM). The manufacturing brokers continued drawing from their assured buckets, unaffected by the brand new arrival.
That is the situation that issues most in manufacturing. A batch job kicks off, or a brand new utility goes reside, and immediately there’s extra demand than provide. With out reservations, the brand new load pushes everybody’s throughput down equally. With reservations, your important customers are shielded.
Reserved customers compete pretty amongst themselves
Within the reserved-only situation, the dev consumer went silent and solely the three manufacturing brokers competed. Their success charges had been almost similar (28.9%–30.2%) — the system divided throughput proportionally throughout their reservations.
What the server sees: OTEL metrics inform the story
Shopper-side metrics (success charges, 429 counts) let you know what your customers skilled. Server-side OTEL metrics let you know what the platform skilled. Right here’s what our instance deployment appeared like from the within.
The speed limiter protects mannequin well being
Throughout peak load (20,596 requests/minute hitting the endpoint), the NIM was serving solely the ~1,000 RPM that the speed limiter let by means of:
| What the endpoint noticed | What the NIM noticed |
|---|---|
| 20,596 requests/min | ~1,000 requests/min (served) |
| 19,603 rate-limited/min | 18–22 concurrent requests |
| — | 1.25s E2E latency (steady) |
| — | 91–95% GPU utilization (wholesome) |
With out price limiting, these 20,000 RPM would have queued contained in the NIM. The GPU wouldn’t have gotten extra productive — it’s already at 91–95% — however latency would have spiraled as requests stacked up. As an alternative, the speed limiter rejected extra requests instantly (at 429-response speeds, not inference speeds), retaining the mannequin responsive for the site visitors it did settle for.


Token throughput follows profitable requests
Peak token throughput was ~199,350 tokens/min (complete), with ~115,939 enter and ~83,411 output. These numbers observe immediately with the speed limiter’s allowed throughput — not with the tried request quantity. One other manner of seeing that the speed limiter is appropriately shaping site visitors.


Deciding between Price Limits and Quota Reservations
Use this flowchart to determine what to configure:
Step 1: Do you may have a shared deployment with a number of customers?
- No → Price limiting alone is enough. Set capability to guard the GPU and transfer on.
- Sure → Proceed to Step 2.
Step 2: Are all customers equally necessary?
- Sure → Price limiting alone could also be sufficient. Underneath saturation, all customers compete equally — first come, first served. If that’s acceptable, cease right here.
- No → Proceed to Step 3.
Step 3: Do any customers want assured minimal throughput?
- Sure → Add quota reservations. Measurement them to the minimal RPM every important client wants throughout peak competition.
- No, however some customers have to be deprioritized → Use per-entity exceptions as a substitute of reservations. Cap the noisy neighbors reasonably than guaranteeing the important ones.
Step 4: Configure the unreserved pool.
- Don’t reserve 100% of capability. Go away 10–20% unreserved for ad-hoc site visitors, overflow, and new functions that haven’t been assigned reservations but.
Sensible configuration suggestions
Begin with price limiting solely. Monitor your deployment’s site visitors patterns for per week. Have a look at peak RPM, who’s sending what, and whether or not anybody is constantly overconsuming. Then add reservations the place the info tells you they’re wanted.
Set utilization threshold at 70–80%. This provides the token bucket burst room to soak up brief spikes with out triggering price limiting on each minor fluctuation. In our instance, we used 80% and the system dealt with 1.2× capability gracefully earlier than enforcement kicked in.
Monitor with OTEL metrics. After configuring price limiting, examine these server-side metrics to substantiate issues are working:
- deployment.requests vs deployment.requests.rate_limited — are you rejecting the correct quantity?
- nvidia_gpu_utilization — is the mannequin nonetheless saturated or did price limiting create headroom?
- nvidia_vllm:e2e_request_latency_seconds — is latency steady below load?
- deployment.concurrent_requests — are requests queuing up or flowing easily?
Reservation sizing components:
Reserved RPM = Capability × Reserved %
Instance: 1000 RPM × 30% = 300 RPM assured
Don’t confuse this with a price restrict. A 30% reservation means “you’ll at all times get at the very least 300 RPM, even when the system is saturated.” The entity can nonetheless use extra when capability is out there.
Abstract
| Function | Protects In opposition to | Use When |
|---|---|---|
| Price Limiting | GPU overload, runaway customers, latency spikes | At all times — it’s your security internet |
| Quota Reservations | Precedence hunger, noisy neighbors, SLA violations | A number of customers with completely different significance ranges |
| Per-entity exceptions | A selected client overconsuming | You need to cap a loud neighbor with out reserving capability for others |
When contemplating Price Limiting vs. Quota Reservations: use every device the place it suits. Layer them the place the issue calls for it.
