LLM price optimization is essentially a token economics drawback. This tutorial covers 4 distinct strategies — immediate compression, semantic caching, chain-of-thought pruning, and output size constraints — that when mixed can cut back LLM API prices by as much as 63%.
Scale back LLM API Prices
- Instrument token logging on each API name to determine a value baseline earlier than optimizing.
- Compress system prompts by eliminating hedge language, consolidating directions into structured codecs, and utilizing instruments like LLMLingua.
- Constrain output size with
max_completion_tokensormax_tokensand implement structured JSON schemas. - Prune chain-of-thought reasoning in manufacturing by instructing the mannequin to return solely the ultimate reply.
- Implement semantic caching utilizing embedding similarity to skip redundant API calls fully.
- Leverage provider-native immediate caching from OpenAI, Anthropic, or Google for automated enter token reductions.
- Validate output high quality towards your analysis set after every optimization to make sure accuracy holds.
Desk of Contents
Why Normal Prompting Is Burning Your Price range
LLM price optimization is essentially a token economics drawback. Each API name to OpenAI, Anthropic, or Google Gemini payments by the token, and most manufacturing techniques ship much more tokens than the duty really requires. Verbose system prompts padded with hedge language, repeated context throughout dialog turns, unconstrained output lengths, and chain-of-thought reasoning left enabled in manufacturing all contribute to payments that run two to a few occasions increased than mandatory.
This tutorial covers 4 distinct strategies for lowering that waste: immediate compression, semantic caching, chain-of-thought pruning, and output size constraints. When mixed, these strategies can cut back LLM API prices by as much as 63%, although the precise determine is determined by use case, mannequin choice, and visitors patterns. The strategies aren’t theoretical. Every part contains working code examples in Python and Node.js that concentrate on the OpenAI and Anthropic APIs instantly, with measured token counts exhibiting the earlier than and after.
The viewers right here is builders already calling LLM APIs in manufacturing or at scale, not these experimenting with chat completions for the primary time.
Understanding Token Economics Throughout Suppliers
How OpenAI, Anthropic, and Google Gemini Worth Tokens
All three main suppliers cut up billing into enter tokens and output tokens, however the ratio between them varies considerably. Output tokens price greater than enter tokens, by an element of 2x to 5x relying on the mannequin. For GPT-4o, OpenAI prices $2.50 per million enter tokens and $10.00 per million output tokens, a 4x ratio. Anthropic’s Claude 3.5 Sonnet costs at $3.00 per million enter and $15.00 per million output, a 5x ratio. Google’s Gemini 1.5 Flash prices roughly 33x lower than GPT-4o on each enter ($0.075 per million) and output ($0.30 per million) for prompts underneath 128K tokens.
Observe: All pricing figures on this article are as of the time of writing. Confirm present pricing at openai.com/pricing, anthropic.com/pricing, and Google’s Generative AI pricing web page earlier than operating price projections.
This asymmetry has a direct consequence for optimization precedence: lowering output tokens yields disproportionately bigger price financial savings per token eradicated.
Decreasing output tokens yields disproportionately bigger price financial savings per token eradicated.
Every supplier additionally provides cached token reductions. OpenAI’s automated immediate caching gives a 50% low cost on cached enter tokens. Anthropic’s express immediate caching provides a 90% low cost on cache reads (although cache writes price 25% greater than base enter). Google Gemini’s context caching prices at about 25% of the usual enter price for cached content material.
The place Tokens Are Wasted in a Typical API Name
4 classes account for the majority of pointless token spend:
- System immediate bloat. Directions include filler phrases, extreme examples, and redundant guardrails that usually double the immediate size with out bettering output high quality.
- Repeated context throughout dialog turns. Multi-turn flows resend the identical background info with each request.
- Uncontrolled output verbosity. Fashions generate explanations, caveats, and preambles that the consuming utility instantly discards when you do not cap output size.
- Chain-of-thought reasoning left lively in manufacturing. Prolonged intermediate reasoning steps that served their objective throughout improvement add no worth in a deployed pipeline.
Approach 1: Immediate Compression
What Immediate Compression Means in Observe
Immediate compression reduces the token depend of a immediate whereas preserving the knowledge the mannequin wants to supply an correct response. There are two classes. Lossy compression removes content material fully, reminiscent of dropping optionally available examples or eliminating edge case directions that apply to a small fraction of requests. Lossless compression rephrases the identical content material extra concisely, reminiscent of changing prose directions into structured YAML or JSON format, or changing multi-sentence explanations with terse directives.
Compression hurts high quality when it removes disambiguation that the mannequin genuinely wants. For duties with slender, well-defined outputs like entity extraction or classification, aggressive compression is protected. For duties requiring nuanced judgment, reminiscent of open-ended writing or complicated reasoning, over-compression can degrade outcomes. Monitor output high quality metrics (F1 rating for extraction, human analysis scores for era) alongside token counts; if high quality drops greater than 2-3% in your eval set, you’ve got compressed too far.
Handbook Immediate Compression Methods
Three handbook methods yield the most important beneficial properties with the least threat:
- Get rid of hedge language and politeness tokens. Phrases like “Please kindly make sure that you rigorously contemplate” change into “Guarantee.”
- Consolidate multi-sentence directions into structured codecs. A five-sentence paragraph explaining a desired JSON output form turns into the JSON schema itself, which is each shorter and extra exact.
- Use reference tokens as an alternative of repeating context. Reasonably than restating a product description in each the system immediate and the person message, outline it as soon as and consult with it by label.
Programmatic Immediate Compression with LLMLingua
Microsoft Analysis’s LLMLingua strategy makes use of a small language mannequin to establish and take away tokens from a immediate that contribute least to the mannequin’s potential to supply right outputs. The library evaluates token-level perplexity and prunes low-information tokens whereas preserving semantic integrity.
Set up the required dependencies first:
pip set up openai "llmlingua>=0.2.2" numpy
Observe: The primary run will obtain a transformer mannequin checkpoint (~500MB) from Hugging Face. Guarantee ample disk house and permit a number of minutes for the obtain.
Observe: The checkpoint
microsoft/llmlingua-2-bert-base-multilingual-cased-meetingbankused beneath is optimized for assembly transcripts (MeetingBank dataset). Validate compressed output high quality in your area earlier than manufacturing use. For different textual content varieties, consider different LLMLingua-2 checkpoints and examine entity extraction accuracy earlier than and after compression.
import time
from llmlingua import PromptCompressor
from openai import OpenAI, RateLimitError, APIError
consumer = OpenAI()
original_prompt = """You're an professional product overview analyst. Your job is to rigorously
learn product critiques submitted by customers and extract structured info from them.
It is best to establish the important thing entities talked about within the overview, together with product names,
model names, and particular options that the reviewer discusses. Please be certain to
contemplate each constructive and unfavourable sentiments expressed about every entity. If you
discover an entity, classify it into one of many following classes: product, model, or
characteristic. Additionally decide the sentiment as constructive, unfavourable, or impartial. Return your
evaluation as a JSON object with an array referred to as 'entities', the place every entity has the
fields 'title', 'sort', and 'sentiment'. Be thorough however concise in your extraction.
Don't embody entities which are solely talked about in passing with none opinion expressed.
Give attention to entities the place the reviewer has expressed a transparent opinion or analysis.
Be sure that your JSON is legitimate and correctly formatted. Don't embody any rationalization
or commentary exterior the JSON object. Solely return the JSON.
It is best to deal with critiques in English. If the overview incorporates a number of merchandise being
in contrast, extract entities for all of them. If a characteristic is talked about for a number of
merchandise, create separate entity entries for every product-feature mixture.
Be certain that entity names are normalized — for instance, use the total model title slightly
than abbreviations when doable. If the reviewer makes use of slang or casual language,
interpret it to the most effective of your potential and use normal terminology in your output."""
compressor = PromptCompressor(
model_name="microsoft/llmlingua-2-bert-base-multilingual-cased-meetingbank",
use_llmlingua2=True
)
compressed = compressor.compress_prompt(
original_prompt,
price=0.4,
force_tokens=["JSON", "entities", "name", "type", "sentiment"]
)
compressed_prompt = compressed["compressed_prompt"]
origin_tokens = compressed.get("origin_tokens", "UNVERIFIED")
compressed_tokens = compressed.get("compressed_tokens", "UNVERIFIED")
ratio = compressed.get("compressed_tokens_ratio", "UNVERIFIED")
print(f"Out there keys: {listing(compressed.keys())}")
print(f"Authentic tokens: {origin_tokens}")
print(f"Compressed tokens: {compressed_tokens}")
print(f"Compression ratio: {ratio}")
max_retries = 3
response = None
for try in vary(max_retries):
attempt:
response = consumer.chat.completions.create(
mannequin="gpt-4o",
messages=[
{"role": "system", "content": compressed_prompt},
{"role": "user", "content": "The new Sony WH-1000XM5 headphones have amazing noise cancellation but the build quality feels cheaper than the XM4. Battery life is stellar though."}
]
)
break
besides RateLimitError:
wait = 2 ** try
print(f"Charge restricted. Retrying in {wait}s (try {try + 1}/{max_retries})")
time.sleep(wait)
besides APIError as e:
print(f"API error on try {try + 1}: {e}")
if try == max_retries - 1:
elevate
if response is None:
elevate RuntimeError("Exceeded max retries for OpenAI API name")
if response.utilization is None:
elevate ValueError("response.utilization is None — streaming mode shouldn't be supported right here")
print(f"Immediate tokens used: {response.utilization.prompt_tokens}")
print(f"Completion tokens used: {response.utilization.completion_tokens}")
print(response.selections[0].message.content material)
The force_tokens parameter ensures that vital phrases survive the compression go. With a price of 0.4, the compressed immediate retains about 200 tokens from the unique ~500 whereas preserving the extraction directions and output format necessities.
Measuring Compression Impression
Systematic measurement requires logging token utilization on each name and evaluating towards a recognized baseline.
Observe: These JavaScript examples use top-level
awaitand require Node.js 14.8+ with ES modules. Add"sort": "module"to yourpackage deal.jsonor wrap the code in(async () => { ... })();.
npm set up openai @anthropic-ai/sdk
import OpenAI from "openai";
const openai = new OpenAI();
const PRICING = {
"gpt-4o": { enter: 2.5, output: 10.0 },
"gpt-4o-mini": { enter: 0.15, output: 0.6 },
};
async operate trackedCompletion(mannequin, messages, label = "default") {
const pricing = PRICING[model];
if (!pricing) {
throw new Error(
`Mannequin "${mannequin}" not present in PRICING desk. ` +
`Add it or confirm the mannequin title. Recognized fashions: ${Object.keys(PRICING).be a part of(", ")}`
);
}
let response;
const MAX_RETRIES = 3;
for (let try = 0; try < MAX_RETRIES; try++) {
attempt {
response = await openai.chat.completions.create({ mannequin, messages });
break;
} catch (err) {
if (err?.standing === 429 && try < MAX_RETRIES - 1) {
const wait = Math.pow(2, try) * 1000;
console.warn(`[${label}] Charge restricted. Retrying in ${wait}ms`);
await new Promise(r => setTimeout(r, wait));
} else {
throw err;
}
}
}
if (!response?.utilization) {
throw new Error(`[${label}] response.utilization is null — verify for streaming mode`);
}
const { prompt_tokens, completion_tokens } = response.utilization;
const inputCost = (prompt_tokens / 1_000_000) * pricing.enter;
const outputCost = (completion_tokens / 1_000_000) * pricing.output;
const totalCost = inputCost + outputCost;
console.log(`[${label}] Mannequin: ${mannequin}`);
console.log(` Immediate tokens: ${prompt_tokens}`);
console.log(` Completion tokens: ${completion_tokens}`);
console.log(` Enter price: $${inputCost.toFixed(6)}`);
console.log(` Output price: $${outputCost.toFixed(6)}`);
console.log(` Whole price: $${totalCost.toFixed(6)}`);
return { response, prompt_tokens, completion_tokens, totalCost };
}
const baseline = await trackedCompletion(
"gpt-4o",
[
{ role: "system", content: "Your original 500-token system prompt here..." },
{ role: "user", content: "Review text here..." },
],
"baseline"
);
const compressed = await trackedCompletion(
"gpt-4o",
[
{ role: "system", content: "Your compressed 200-token prompt here..." },
{ role: "user", content: "Review text here..." },
],
"compressed"
);
const financial savings = ((baseline.totalCost - compressed.totalCost) / baseline.totalCost) * 100;
console.log(`
Price discount: ${financial savings.toFixed(1)}%`);
You possibly can drop this wrapper into any manufacturing pipeline to constantly monitor token spend and validate that compression delivers anticipated financial savings.
Approach 2: Semantic Caching
What Semantic Caching Is and How It Differs from Actual-Match Caching
Actual-match caching solely returns a saved end result when the incoming request is similar, character for character, to a beforehand seen request. Semantic caching makes use of embedding-based similarity to acknowledge that “What’s the capital of France?” and “Inform me France’s capital metropolis” ought to return the identical cached response. This will increase cache hit charges considerably for functions the place customers phrase comparable questions in numerous methods.
Supplier-native caching and application-layer semantic caching resolve completely different issues. OpenAI and Anthropic’s immediate caching low cost the price of resending similar immediate prefixes. Utility-layer semantic caching avoids the API name fully when a sufficiently comparable question has already been answered.
Implementing Utility-Layer Semantic Caching
Observe: The in-memory cache beneath is for demonstration solely and isn’t production-safe. It has no TTL and makes use of a easy dimension cap for eviction, which means it won’t deal with expiration or refined eviction methods. For manufacturing use, substitute with Redis (utilizing RediSearch for vector similarity) or a devoted vector database with TTL and eviction configured.
import threading
import time
import numpy as np
from openai import OpenAI, RateLimitError, APIError
consumer = OpenAI()
_cache_lock = threading.Lock()
_cache: listing[dict] = []
CACHE_MAX_SIZE = 10_000
SIMILARITY_THRESHOLD = 0.95
def get_embedding(textual content: str) -> np.ndarray:
end result = consumer.embeddings.create(
mannequin="text-embedding-3-small",
enter=textual content
)
return np.array(end result.knowledge[0].embedding)
def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
norm_a = np.linalg.norm(a)
norm_b = np.linalg.norm(b)
if norm_a == 0.0 or norm_b == 0.0:
return 0.0
return float(np.dot(a, b) / (norm_a * norm_b))
def cached_completion(user_query: str, system_prompt: str, mannequin: str = "gpt-4o") -> str:
query_embedding = get_embedding(user_query)
with _cache_lock:
for entry in _cache:
similarity = cosine_similarity(query_embedding, entry["embedding"])
if similarity >= SIMILARITY_THRESHOLD:
print(f"Cache HIT (similarity: {similarity:.4f})")
return entry["response"]
print("Cache MISS — calling API")
response = None
max_retries = 3
for try in vary(max_retries):
attempt:
response = consumer.chat.completions.create(
mannequin=mannequin,
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_query},
]
)
break
besides RateLimitError:
wait = 2 ** try
print(f"Charge restricted. Retrying in {wait}s (try {try + 1}/{max_retries})")
time.sleep(wait)
besides APIError as e:
print(f"API error on try {try + 1}: {e}")
if try == max_retries - 1:
elevate
if response is None:
elevate RuntimeError("Exceeded max retries for API name")
if response.utilization is None:
elevate ValueError("response.utilization is None — streaming mode shouldn't be supported right here")
end result = response.selections[0].message.content material
with _cache_lock:
if len(_cache) >= CACHE_MAX_SIZE:
_cache.pop(0)
_cache.append({
"embedding": query_embedding,
"question": user_query,
"response": end result
})
return end result
result1 = cached_completion(
"What are the primary options of the iPhone 15 Professional?",
"You're a product professional. Reply concisely."
)
result2 = cached_completion(
"Inform me the important thing options of Apple's iPhone 15 Professional",
"You're a product professional. Reply concisely."
)
For manufacturing use, changing the in-memory listing with Redis utilizing its vector search functionality (RediSearch) or a devoted vector database gives persistence and scalability. The embedding name itself could be very low-cost: OpenAI’s text-embedding-3-small prices $0.02 per million tokens (as of the time of writing — confirm present pricing at openai.com/pricing earlier than projecting prices).
Utilizing Supplier-Native Immediate Caching
OpenAI’s immediate caching is automated. When the primary 1,024 or extra tokens of a immediate match a earlier request precisely, cached tokens are billed at a 50% low cost. No code modifications are required, however structuring prompts in order that the static system directions seem first and variable content material seems final maximizes cache hit charges.
Observe: OpenAI’s automated immediate caching solely prompts when the matching immediate prefix is at the very least 1,024 tokens. Prompts shorter than this threshold won’t profit from caching.
Anthropic’s immediate caching is express and provides steeper reductions. Cache reads price 90% lower than base enter pricing. Cache writes price 25% extra, which is value noting as a value issue for low-traffic deployments the place cache writes could outnumber reads. The developer locations cache_control breakpoints to mark which immediate segments needs to be cached.
Observe: Anthropic requires the cached phase to be at the very least 1,024 tokens for
cache_controlto take impact. The instance beneath makes use of a shortened immediate for readability; in apply, broaden or mix segments to fulfill the ≥1,024 token threshold. Verify caching activated by checkingcache_creation_input_tokens > 0within the response.
import Anthropic from "@anthropic-ai/sdk";
const anthropic = new Anthropic();
const systemPrompt = `You're an professional product overview analyst. Extract entities
from critiques as JSON with fields: title, sort (product/model/characteristic), sentiment
(constructive/unfavourable/impartial). Return solely legitimate JSON. Deal with comparisons by creating
separate entries. Normalize entity names to full model names.`;
async operate analyzeReview(reviewText) {
let response;
attempt {
response = await anthropic.messages.create({
mannequin: "claude-3-5-sonnet-20241022",
max_tokens: 1024,
system: [
{
type: "text",
text: systemPrompt,
cache_control: { type: "ephemeral" },
},
],
messages: [{ role: "user", content: reviewText }],
});
} catch (err) {
if (err?.standing === 429) {
console.warn("Charge restricted by Anthropic. Implement retry logic for manufacturing use.");
}
throw err;
}
console.log("Enter tokens:", response.utilization.input_tokens);
console.log("Cache creation tokens:", response.utilization.cache_creation_input_tokens || 0);
console.log("Cache learn tokens:", response.utilization.cache_read_input_tokens || 0);
if (!response.content material || response.content material.size === 0 || response.content material[0].sort !== "textual content") {
throw new Error("Sudden response content material format from Anthropic API");
}
return response.content material[0].textual content;
}
await analyzeReview("The Sony WH-1000XM5 has nice ANC however feels flimsy.");
await analyzeReview("Samsung Galaxy S24 Extremely digital camera is unimaginable, battery is mediocre.");
await analyzeReview("MacBook Professional M3 efficiency is excellent however it runs scorching.");
Anthropic’s cached immediate content material has a minimal size requirement of 1,024 tokens and a time-to-live of 5 minutes from the final cache write; cache reads don’t lengthen the TTL. For top-throughput functions making a number of calls per minute with the identical system immediate, the 90% learn low cost accumulates quickly. In low-traffic situations, bear in mind that cache writes price 25% greater than normal enter pricing, so rare utilization patterns could not see internet financial savings from caching.
Cache Invalidation and Freshness
Set TTLs based mostly on how steadily the underlying knowledge or directions change. For static system prompts, lengthy TTLs or no expiration are applicable. For queries towards quickly altering knowledge, reminiscent of real-time pricing or stock, semantic caching introduces stale response threat. Person-specific dynamic queries with private context ought to bypass the cache fully.
Approach 3: Chain-of-Thought Pruning for Manufacturing
Why CoT Reasoning Inflates Output Prices
Chain-of-thought prompting is effective throughout improvement and analysis as a result of it makes the mannequin’s reasoning auditable. In manufacturing, nevertheless, downstream techniques eat solely the ultimate reply. CoT reasoning can inflate output size by 3x to 5x (it is a generally noticed vary and varies by job), and since output tokens carry the very best per-token price, this represents a 3x to 5x improve in output price that provides no worth to the deployed system.
CoT reasoning can inflate output size by 3x to 5x, and since output tokens carry the very best per-token price, this represents a 3x to 5x improve in output price that provides no worth to the deployed system.
Methods for Pruning CoT in Manufacturing
Probably the most direct strategy: instruct the mannequin to return solely the ultimate reply. Combining this with structured output mode (JSON) constrains the response form and eliminates explanatory prose.
Anthropic’s prolonged considering characteristic (obtainable on Claude 3.7 Sonnet and later appropriate fashions) gives a budget_tokens parameter that caps the variety of tokens the mannequin can spend on inside reasoning. Confirm mannequin help in Anthropic’s prolonged considering documentation earlier than use. This permits managed reasoning depth with out limitless output enlargement.
import time
from openai import OpenAI, RateLimitError, APIError
consumer = OpenAI()
overview = """The Bose QuietComfort Extremely earbuds ship distinctive sound high quality
with deep bass and clear highs. The noise cancellation is top-tier, rivaling
over-ear headphones. Nevertheless, the match could be uncomfortable throughout lengthy periods,
and the case is unnecessarily cumbersome. Battery lifetime of 6 hours is first rate however not
class-leading. At $299, they're costly however justified for audiophiles."""
max_retries = 3
cot_response = None
for try in vary(max_retries):
attempt:
cot_response = consumer.chat.completions.create(
mannequin="gpt-4o",
messages=[
{"role": "system", "content": "Extract product entities with sentiment. Think step by step."},
{"role": "user", "content": review}
]
)
break
besides RateLimitError:
wait = 2 ** try
print(f"Charge restricted. Retrying in {wait}s (try {try + 1}/{max_retries})")
time.sleep(wait)
besides APIError as e:
print(f"API error on try {try + 1}: {e}")
if try == max_retries - 1:
elevate
if cot_response is None:
elevate RuntimeError("Exceeded max retries for CoT API name")
if cot_response.utilization is None:
elevate ValueError("cot_response.utilization is None — streaming mode shouldn't be supported right here")
direct_response = None
for try in vary(max_retries):
attempt:
direct_response = consumer.chat.completions.create(
mannequin="gpt-4o",
max_completion_tokens=256,
response_format={"sort": "json_object"},
messages=[
{"role": "system", "content": "Extract entities as JSON: {"entities": [{"name": str, "type": str, "sentiment": str}]}. Return ONLY the JSON."},
{"function": "person", "content material": overview}
]
)
break
besides RateLimitError:
wait = 2 ** try
print(f"Charge restricted. Retrying in {wait}s (try {try + 1}/{max_retries})")
time.sleep(wait)
besides APIError as e:
print(f"API error on try {try + 1}: {e}")
if try == max_retries - 1:
elevate
if direct_response is None:
elevate RuntimeError("Exceeded max retries for direct API name")
if direct_response.utilization is None:
elevate ValueError("direct_response.utilization is None — streaming mode shouldn't be supported right here")
print(f"CoT output tokens: {cot_response.utilization.completion_tokens}")
print(f"Direct output tokens: {direct_response.utilization.completion_tokens}")
OUTPUT_PRICE_PER_MILLION = 10.0
cot_cost = (cot_response.utilization.completion_tokens / 1_000_000) * OUTPUT_PRICE_PER_MILLION
direct_cost = (direct_response.utilization.completion_tokens / 1_000_000) * OUTPUT_PRICE_PER_MILLION
print(f"CoT output price: ${cot_cost:.6f}")
print(f"Direct output price: ${direct_cost:.6f}")
The CoT model returns a number of paragraphs of reasoning adopted by the extraction, whereas the direct model returns solely the JSON object. On a job like this, anticipate a 3x or larger distinction in output token depend.
Preserving CoT for Debugging With out Paying for It
A sensible sample: gate CoT behind an setting variable or characteristic flag. Allow CoT throughout improvement and in error-analysis pipelines. Disable it in manufacturing. When manufacturing errors floor for investigation, replay the precise failing enter with CoT enabled, producing the reasoning hint on demand slightly than on each request.
Approach 4: Output Size Constraints
Utilizing max_tokens / max_completion_tokens Strategically
Most builders go away the utmost output size unset, permitting the mannequin to generate as many tokens because it deems applicable. That is costly. For duties with predictable output shapes, reminiscent of classification, extraction, or short-answer responses, setting a ceiling prevents runaway era.
The parameter names differ by supplier: OpenAI makes use of max_completion_tokens, Anthropic makes use of max_tokens, and Google Gemini makes use of maxOutputTokens. To search out the correct ceiling, pattern outputs from consultant inputs throughout improvement and set the restrict at 1.5x to 2x the noticed p95 (the ninety fifth percentile — i.e., the size exceeded by solely 5% of outputs in your pattern) output size.
Structured Output as a Price Management Mechanism
Perform calling and power use schemas act as implicit output constraints. When the mannequin should conform to an outlined schema, it can’t generate preambles, explanations, or pointless fields. Observe that when utilizing tool_choice to power a operate name, the mannequin’s response content material can be null — the precise payload is in tool_calls[0].operate.arguments, which should be parsed as JSON.
import OpenAI from "openai";
const openai = new OpenAI();
const OUTPUT_PRICE_PER_MILLION = 10.0;
const overview = `The Dyson V15 Detect has unimaginable suction energy and the laser mud
detection is genuinely helpful. However at $750 it is overpriced, and the battery solely
lasts 25 minutes on max energy. The attachments are well-designed.`;
let proseResponse;
attempt {
proseResponse = await openai.chat.completions.create({
mannequin: "gpt-4o",
messages: [
{ role: "system", content: "Extract product entities with sentiment from this review." },
{ role: "user", content: review },
],
});
} catch (err) {
if (err?.standing === 429) {
console.warn("Charge restricted. Implement retry logic for manufacturing use.");
}
throw err;
}
if (!proseResponse?.utilization) {
throw new Error("proseResponse.utilization is null — verify for streaming mode");
}
let structuredResponse;
attempt {
structuredResponse = await openai.chat.completions.create({
mannequin: "gpt-4o",
messages: [
{ role: "system", content: "Extract product entities with sentiment." },
{ role: "user", content: review },
],
instruments: [
{
type: "function",
function: {
name: "extract_entities",
description: "Extract entities from a product review",
parameters: {
type: "object",
properties: {
entities: {
type: "array",
items: {
type: "object",
properties: {
name: { type: "string" },
type: { type: "string", enum: ["product", "brand", "feature"] },
sentiment: { sort: "string", enum: ["positive", "negative", "neutral"] },
},
required: ["name", "type", "sentiment"],
},
},
},
required: ["entities"],
},
},
},
],
tool_choice: { sort: "operate", operate: { title: "extract_entities" } },
});
} catch (err) {
if (err?.standing === 429) {
console.warn("Charge restricted. Implement retry logic for manufacturing use.");
}
throw err;
}
if (!structuredResponse?.utilization) {
throw new Error("structuredResponse.utilization is null — verify for streaming mode");
}
const message = structuredResponse.selections[0].message;
if (!message.tool_calls || message.tool_calls.size === 0) {
throw new Error("No tool_calls returned. Test tool_choice config.");
}
const rawArgs = message.tool_calls[0].operate.arguments;
let entities;
attempt {
entities = JSON.parse(rawArgs).entities;
} catch (e) {
throw new Error(`Did not parse software arguments as JSON: ${rawArgs}`);
}
console.log(`Prose completion tokens: ${proseResponse.utilization.completion_tokens}`);
console.log(`Structured completion tokens: ${structuredResponse.utilization.completion_tokens}`);
console.log("Extracted entities:", entities);
const proseCost = (proseResponse.utilization.completion_tokens / 1_000_000) * OUTPUT_PRICE_PER_MILLION;
const structuredCost = (structuredResponse.utilization.completion_tokens / 1_000_000) * OUTPUT_PRICE_PER_MILLION;
console.log(`Prose output price: $${proseCost.toFixed(6)}`);
console.log(`Structured output price: $${structuredCost.toFixed(6)}`);
The structured response constrains the mannequin to populating solely the outlined fields, whereas the prose response contains introductory textual content, explanations of every entity, and a closing abstract. In apply, structured output produces 2x to 4x fewer tokens than unconstrained prose for extraction duties. Run the code above by yourself inputs and log the distinction.
Price Comparability Desk: Earlier than and After Throughout 5 Fashions
The next desk reveals estimated prices for a standardized job, extracting three entities from a two-paragraph product overview, run 1,000 occasions. Baseline makes use of a verbose 500-token system immediate with unconstrained output. Optimized makes use of a compressed 200-token immediate with structured JSON output.
Observe on pricing: GPT-4o: $2.50/$10.00 per million enter/output tokens. GPT-4o mini: $0.15/$0.60. Claude 3.5 Sonnet: $3.00/$15.00. Claude 3.5 Haiku (Anthropic’s lower-cost mannequin tier): $0.80/$4.00. Gemini 1.5 Flash: $0.075/$0.30 (underneath 128K tokens). All costs are as of the time of writing — confirm at every supplier’s pricing web page earlier than projecting prices.
| Mannequin | Baseline Enter | Compressed Enter | Baseline Output | Constrained Output | Baseline Price/1K | Optimized Price/1K | Financial savings |
|---|---|---|---|---|---|---|---|
| GPT-4o | 580 | 280 | 350 | 120 | $4.95 | $1.90 | 62% |
| GPT-4o mini | 580 | 280 | 350 | 120 | $0.30 | $0.11 | 63% |
| Claude 3.5 Sonnet | 580 | 280 | 350 | 120 | $6.99 | $2.64 | 62% |
| Claude 3.5 Haiku | 580 | 280 | 350 | 120 | $1.86 | $0.70 | 62% |
| Gemini 1.5 Flash | 580 | 280 | 350 | 120 | $0.15 | $0.06 | 60% |
The financial savings percentages are constant by development, since token reductions are mounted and pricing scales linearly. Fashions with increased output-to-input worth ratios, like Claude 3.5 Sonnet at 5x, present barely increased absolute greenback financial savings. The Gemini 1.5 Flash financial savings, whereas proportionally comparable, symbolize a a lot smaller absolute greenback determine as a result of the bottom pricing is already very low. These figures don’t embody extra financial savings from semantic caching, which might additional cut back prices proportional to cache hit price.
Combining All 4 Strategies: A Actual-World Optimization Pipeline
Really helpful Order of Operations
Apply the strategies so as of effort-to-impact ratio:
- Compress prompts. This delivers the most important enter financial savings and takes the least effort — you solely rewrite prompts.
- Constrain outputs utilizing
max_completion_tokens(OpenAI) ormax_tokens(Anthropic) and structured output schemas. This targets the most costly token class with minimal code modifications. - Prune chain-of-thought for manufacturing. This requires a conditional flag however yields 3x to 5x output token reductions.
- Add semantic caching. This calls for probably the most infrastructure (embedding era, a vector retailer) however delivers the very best long-term financial savings at scale as a result of it eliminates API calls fully.
Estimating Your Financial savings
The financial savings components: (baseline_cost - optimized_cost) / baseline_cost. As an estimate based mostly on the token reductions demonstrated above, immediate compression saves 20% to 40% on enter tokens. Output constraints save 30% to 50% on output tokens. Caching saves proportionally to hit price — even a 30% hit price eliminates almost a 3rd of all API calls.
The 60%+ combination determine is lifelike when at the very least three of the 4 strategies goal a workload with repeated question patterns and predictable output shapes. Workloads with extremely distinctive queries and variable-length outputs will see decrease caching advantages however can nonetheless obtain 40% to 50% financial savings from compression and output constraints alone.
Begin With the Lowest-Hanging Fruit
The 4 strategies lined right here — immediate compression, semantic caching, chain-of-thought pruning, and output size constraints — type a sensible framework for LLM token optimization that works throughout suppliers and fashions. The best-priority first step shouldn’t be implementing any approach however instrumenting token logging on each API name. With no baseline measurement, financial savings can’t be quantified or validated.
The best-priority first step shouldn’t be implementing any approach however instrumenting token logging on each API name. With no baseline measurement, financial savings can’t be quantified or validated.
For implementation particulars, see the LLMLingua repository, OpenAI’s immediate caching information, Anthropic’s immediate caching documentation, and Google’s context caching reference. Test present pricing on every supplier’s pricing web page earlier than operating price projections.
