Semantic Caching for LLMs: FastAPI, Redis, and Embeddings

By Dr. Mike

April 28, 2026

0

7

House

Desk of Contents

Semantic Caching for LLMs: FastAPI, Redis, and Embeddings
Introduction: Why Semantic Caching Issues for LLM Techniques
How Semantic Caching Works for LLMs: Embeddings and Similarity Search Defined
Semantic Caching Structure and Request Stream
Configuring Your Setting for Semantic Caching: FastAPI, Redis, and Ollama Setup
Undertaking Construction
FastAPI Entry Level for Semantic Caching: Wiring the API Service
FastAPI Ask Endpoint: Finish-to-Finish Semantic Caching Request Stream
Embeddings: Turning Textual content into Semantic Vectors
The Semantic Cache: Cosine Similarity, Redis Storage, and Reusing Which means
Cache Entries: What Precisely Will get Saved?
Finish-to-Finish Demo: Verifying Core Cache Habits
Abstract

Semantic Caching for LLMs: FastAPI, Redis, and Embeddings

On this lesson, you’ll learn to construct a semantic cache for LLM purposes utilizing FastAPI, Redis, and embedding-based similarity search, and the way requests stream from precise matches to semantic matches earlier than falling again to the LLM.

This lesson is the first in a 2-part collection on Semantic Caching for LLMs:

Semantic Caching for LLMs: FastAPI, Redis, and Embeddings (this tutorial)
Lesson 2

To learn to construct a semantic cache for LLM purposes utilizing embeddings and Redis, simply maintain studying.

In search of the supply code to this submit?

Introduction: Why Semantic Caching Issues for LLM Techniques

Value, Latency, and Redundant LLM Calls

Giant language fashions are highly effective, however they aren’t low cost. Each request to an LLM includes tokenization, inference, decoding, and community overhead. Even when fashions are hosted regionally, response instances are measured in tons of of milliseconds or seconds quite than microseconds.

In actual purposes, this price compounds rapidly. Customers usually ask comparable questions repeatedly, both throughout classes or throughout the identical workflow. Every request is handled as a contemporary LLM invocation, even when the underlying intent has already been dealt with earlier than.

This results in 3 systemic issues:

Excessive latency: Customers anticipate responses that would have been reused immediately
Elevated price: Similar reasoning is paid for a number of instances
Wasted capability: LLM throughput is consumed by redundant requests

These points change into particularly seen below load, the place repeated paraphrased queries can overwhelm an in any other case well-sized system.

Why Precise-Match Caching Breaks Down for Pure Language

Conventional caching assumes that similar inputs produce similar outputs. This works properly for APIs, database queries, and deterministic features. It fails for pure language.

From a string-matching perspective, the next queries are fully unrelated:

“What’s semantic caching?”
“Are you able to clarify how semantic caching works?”
“How does caching primarily based on embeddings work for LLMs?”

A standard cache keyed on uncooked strings will miss all three. Because of this, the system calls the LLM 3 times, regardless that a human would count on the identical reply.

This brittleness causes exact-match caches to have extraordinarily low hit charges in LLM-backed programs. Worse, it provides a false sense of optimization. The cache exists, however it virtually by no means helps in follow.

The place Semantic Caching Matches in Actual Techniques

Semantic caching addresses this mismatch by caching that means as a substitute of actual textual content.

Moderately than asking “have I seen this string earlier than?”, a semantic cache asks “have I answered one thing semantically comparable earlier than?”. It does this by changing queries into embeddings and evaluating them utilizing a similarity metric corresponding to cosine similarity.

In an actual system, semantic caching sits between the applying layer and the LLM:

The applying sends a question
The cache evaluates whether or not a previous response is reusable
Solely true cache misses attain the LLM

When designed appropriately, this layer is invisible to the person. Responses really feel quicker, prices drop, and the system scales extra gracefully with out altering the frontend or immediate logic.

This lesson focuses on constructing that layer explicitly and transparently, utilizing FastAPI, Redis, and embeddings, with out hiding the mechanics behind heavy abstractions.

Figure 1: Why semantic caching matters for LLM systems. Exact-match caching treats paraphrased queries as unique requests, resulting in repeated LLM calls. Semantic caching groups queries by meaning, reducing latency and redundant inference. — **Determine 1:** Why semantic caching issues for LLM programs. Precise-match caching treats paraphrased queries as distinctive requests, leading to repeated LLM calls. Semantic caching teams queries by that means, decreasing latency and redundant inference (supply: picture by the writer).

Precise-match caching treats paraphrased queries as distinctive requests, leading to repeated LLM calls. Semantic caching teams comparable queries by that means, permitting responses to be reused and decreasing each latency and price.

How Semantic Caching Works for LLMs: Embeddings and Similarity Search Defined

Part 1 defined why semantic caching exists.

This part explains the way it works, conceptually, earlier than we contact any FastAPI, Redis, or code.

The objective right here is to present the reader a psychological execution mannequin they’ll maintain of their head whereas studying the implementation.

From Textual content to Which means: Embeddings because the Cache Key

Semantic caching replaces uncooked textual content comparability with vector similarity.

As an alternative of caching responses below the literal question string, the system converts every question into an embedding: a high-dimensional numeric vector that captures semantic that means. Queries which are worded in a different way however imply the identical factor produce embeddings which are shut collectively in vector area.

That is what permits the cache to acknowledge paraphrases as equal:

“How do I reset my password?”
“I forgot my password, what ought to I do?”
“Information me by way of password restoration”

Precise strings differ. Embeddings don’t.

At a excessive degree, semantic caching works by:

Producing an embedding for the incoming question
Evaluating it towards embeddings saved within the cache
Reusing a cached response if similarity is excessive sufficient

The similarity metric used on this lesson is cosine similarity, which measures the angle between two vectors quite than their uncooked magnitude.

Why a Layered Cache Beats Semantic-Solely Caching

Whereas semantic matching is highly effective, it’s also computationally costly.

Embedding era requires a mannequin name. Similarity search requires vector math. Doing this for each request, even when the very same question has already been seen, could be wasteful.

That’s the reason this lesson makes use of a layered caching technique.

Layer 1: Precise Match (Quick Path)

The question is normalized and hashed.

If the identical question has already been answered, the response is returned instantly.

No embedding era
No similarity computation
Minimal latency

This handles repeated similar queries effectively.

Layer 2: Semantic Match (Versatile Path)

If no precise match exists, the question is embedded and in contrast towards cached embeddings.

This layer catches:

paraphrases
minor wording variations
reordered phrases

Semantic matches commerce compute price for a lot greater cache hit charges.

Layer 3: LLM Fallback (Gradual Path)

If neither precise nor semantic matches succeed, the request is forwarded to the LLM.

The response is then saved within the cache so future requests can reuse it.

This layered method ensures:

the most affordable checks occur first
costly operations are solely used when needed

Confidence, Freshness, and Cache Security

Semantic similarity alone just isn’t sufficient to determine whether or not a cached response must be reused.

This lesson introduces the thought of confidence scoring, which mixes:

Similarity: how shut the embeddings are
Freshness: how previous the cached entry is

A extremely comparable however stale response mustn’t essentially be trusted. Likewise, a contemporary response with low similarity must be rejected.

As well as, cached entries are validated to forestall:

expired responses
poisoned entries (errors, empty outputs)

These checks make sure the cache improves correctness and efficiency quite than degrading them.

**Determine 2:** Layered semantic caching request stream (supply: picture by the writer).

Incoming queries first try an exact-match lookup, then fall again to semantic similarity search utilizing embeddings, and eventually name the LLM solely on cache miss. This ordering minimizes latency and pointless mannequin calls.

Notice: On this lesson, we implement this stream utilizing Redis as a easy embedding retailer with linear similarity scans, quite than a devoted vector database.

Semantic Caching Structure and Request Stream

In Part 2, you discovered how semantic caching works conceptually.

On this part, we map that psychological mannequin to a actual request stream in an LLM-backed service.

The objective is to reply one query clearly:

What occurs, step-by-step, when a person sends a request to this technique?

We’ll keep implementation-aware, however not code-specific but. That comes subsequent.

Excessive-Degree System Parts

At a excessive degree, the system consists of 5 logical elements:

API layer: Receives person requests and orchestrates the caching pipeline.
Precise-match cache: Performs quick hash-based lookups for similar queries.
Embedding mannequin: Converts textual content queries into semantic vectors when wanted.
Semantic cache: Shops embeddings and responses and performs similarity matching.
LLM: Acts as the ultimate fallback when no cache entry is appropriate.

Every part has a narrowly outlined accountability. This separation is intentional and retains the system simple to purpose about and lengthen.

On this implementation:

The API layer is constructed utilizing FastAPI and acts because the orchestration level.
Redis is used because the backing retailer for each exact-match and semantic cache layers.
Ollama supplies each embedding era and LLM inference regionally.

These decisions maintain the system light-weight, self-contained, and simple to purpose about whereas nonetheless reflecting actual manufacturing patterns.

Finish-to-Finish Request Stream

When a person sends a question, the system processes it within the following order.

Step 1: Request enters the API

The API receives a textual content question together with non-obligatory flags, corresponding to whether or not to make use of the bypass_cache. Enter validation occurs instantly to forestall meaningless or malformed queries from coming into the pipeline.

This ensures the cache just isn’t polluted with empty or invalid entries.

Step 2: Precise-match cache lookup

The question is normalized and hashed.

The system checks whether or not an similar question has already been answered.

If a precise match exists and is legitimate, the response is returned instantly.
No embeddings are generated.
The LLM just isn’t touched.

That is the quickest doable path by way of the system.

Step 3: Embedding era

If the exact-match lookup fails, the question is handed to the embedding mannequin.

The mannequin converts the textual content right into a numeric vector that captures semantic that means. This vector turns into the important thing for semantic comparability.

This step is deliberately skipped when a precise match succeeds.

Step 4: Semantic cache lookup

The embedding is in contrast towards cached embeddings utilizing a similarity metric.

A cached response is reused provided that:

similarity exceeds an outlined threshold
the entry has not expired
the entry just isn’t poisoned
the computed confidence is excessive sufficient

If an acceptable match is discovered, the response is returned to the person with out calling the LLM.

Step 5: LLM fallback and cache inhabitants

If each cache layers miss, the request is forwarded to the LLM.

As soon as a response is generated:

it’s returned to the person
it’s saved within the cache with metadata, timestamps, and TTL (Time To Dwell)

This ensures future requests can reuse the consequence.

Why This Structure Works Properly

This structure is deliberately conservative and specific.

Low-cost operations occur first.
Costly operations are deferred.
Each step is observable and debuggable.
No part hides complexity behind opaque abstractions.

Most significantly, the system degrades gracefully. Even when the cache supplies no profit, the request nonetheless succeeds through the LLM.

Figure 3: Architecture and request flow for a layered semantic caching system. — **Determine 3:** Structure and request stream for a layered semantic caching system (supply: picture by the writer).

Person queries enter the API, try an exact-match lookup, fall again to semantic similarity search utilizing embeddings, and name the LLM solely when each cache layers miss. Profitable LLM responses are saved for future reuse.

Would you want instant entry to three,457 pictures curated and labeled with hand gestures to coach, discover, and experiment with … without spending a dime? Head over to Roboflow and get a free account to seize these hand gesture pictures.

Configuring Your Setting for Semantic Caching: FastAPI, Redis, and Ollama Setup

To observe this information, you want a small set of Python libraries and system providers that assist API orchestration, vector similarity, and LLM interplay. The objective is to maintain the surroundings light-weight, reproducible, and simple to purpose about.

At a minimal, you’ll need:

Python 3.10 or newer
Redis (used because the cache backing retailer)
An LLM + embedding supplier (Ollama on this tutorial)

All required Python dependencies are pip-installable.

Putting in Python Dependencies

Create and activate a digital surroundings (beneficial), then set up the required packages:

$ pip set up fastapi uvicorn redis httpx python-dotenv numpy

These libraries present the next performance:

fastapi: API layer and request orchestration
uvicorn: ASGI server for operating the service
redis: consumer Communication with the cache retailer
httpx: HTTP consumer for embedding and LLM calls
numpy: Vector math for cosine similarity
python-dotenv: Setting-based configuration

Verifying Redis

This lesson assumes Redis is operating regionally on the default port.

You may confirm Redis is accessible with:

$ redis-cli ping
PONG

If Redis just isn’t put in, you can begin it rapidly utilizing Docker (however you can also spin it up utilizing the docker-compose.yml we offer within the code zip):

$ docker run -p 6379:6379 redis:7

Setting Up Ollama

This technique makes use of Ollama for each embedding era and LLM inference. Ensure Ollama is put in and operating, and that the required fashions can be found.

For instance:

$ ollama pull nomic-embed-text
$ ollama pull llama3.2

As soon as operating, Ollama exposes native HTTP endpoints that the applying will name immediately for embeddings and textual content era.

Want Assist Configuring Your Growth Setting?

Having bother configuring your growth surroundings? Need entry to pre-configured Jupyter Notebooks operating on Google Colab? Remember to be a part of PyImageSearch College — you can be up and operating with this tutorial in a matter of minutes.

All that mentioned, are you:

Brief on time?
Studying in your employer’s administratively locked system?
Desirous to skip the effort of combating with the command line, package deal managers, and digital environments?
Able to run the code instantly in your Home windows, macOS, or Linux system?

Then be a part of PyImageSearch College immediately!

Acquire entry to Jupyter Notebooks for this tutorial and different PyImageSearch guides pre-configured to run on Google Colab’s ecosystem proper in your net browser! No set up required.

And better of all, these Jupyter Notebooks will run on Home windows, macOS, and Linux!

Undertaking Construction

Earlier than diving into particular person elements, let’s take a second to know how the challenge is organized.

A transparent listing construction is very vital in LLM-backed programs, the place obligations span API orchestration, caching, embeddings, mannequin calls, and observability. On this challenge, every concern is remoted into its personal module so the request stream stays simple to hint and purpose about.

After downloading the supply code from the “Downloads” part, your listing construction ought to seem like this:

.
├── app
│   ├── api
│   │   ├── __init__.py
│   │   └── ask.py
│   ├── cache
│   │   ├── __init__.py
│   │   ├── poisoning.py
│   │   ├── schemas.py
│   │   ├── semantic_cache.py
│   │   └── ttl.py
│   ├── config
│   │   ├── __init__.py
│   │   └── settings.py
│   ├── embeddings
│   │   ├── __init__.py
│   │   └── embedder.py
│   ├── llm
│   │   ├── __init__.py
│   │   └── ollama_client.py
│   ├── important.py
│   └── observability
│       └── metrics.py
├── complete-codebase.txt
├── docker-compose.yml
├── Dockerfile
├── README.md
└── necessities.txt

Let’s break this down at a excessive degree.

The app/ Package deal

The app/ listing incorporates all runtime software code. Nothing outdoors this folder is imported at execution time.

This retains the service self-contained and makes it simple to purpose about deployment and dependencies.

app/important.py: Software Entry Level

This file defines the FastAPI software and registers all routers.

It incorporates no enterprise logic — solely service wiring. Each request into the system enters by way of this file.

app/api/: API Layer

The api/ package deal defines HTTP-facing endpoints.

ask.py: Implements the /ask endpoint and acts because the orchestration layer for all the semantic caching pipeline.

The API layer is accountable for:

enter validation
implementing cache ordering
coordinating cache, embeddings, and LLM calls
returning structured debug data

It does not implement caching or similarity logic immediately.

app/cache/: Caching Logic

This package deal incorporates all cache-related performance.

semantic_cache.py: Core semantic cache implementation (precise match, semantic match, Redis storage, similarity search).
schemas.py: Defines the cache entry schema used for Redis storage.
ttl.py: Software-level TTL configuration and expiration checks.
poisoning.py: Security checks to forestall invalid or error responses from being reused.

By isolating caching logic right here, the API layer stays clear and reusable.

app/embeddings/: Embedding Technology

embedder.py: Handles embedding era through Ollama’s embedding endpoint.

This module has a single accountability: convert textual content into semantic vectors.

It doesn’t cache, rank, or validate embeddings.

app/llm/: LLM Consumer

ollama_client.py: Wraps calls to the Ollama text-generation endpoint.

Retaining LLM interplay remoted permits the remainder of the system to stay model-agnostic.

app/observability/: Metrics

metrics.py: Implements easy in-memory counters for cache hits, misses, and LLM calls.

These metrics are deliberately light-weight and meant for studying and debugging, not manufacturing monitoring.

Configuration and Infrastructure

Exterior the app/ listing:

config/settings.py: Centralizes environment-based configuration (Redis host, TTLs, mannequin names).
Dockerfile and docker-compose.yml: Outline a reproducible runtime surroundings for the API and Redis.
necessities.txt: Lists all Python dependencies required to run the service.

FastAPI Entry Level for Semantic Caching: Wiring the API Service

Earlier than we have a look at caching logic, embeddings, or Redis, it’s vital to know how the service itself is wired collectively. Each request to the semantic cache enters the system by way of a single FastAPI software, outlined in app/important.py.

This file acts because the entry level of the service. Its accountability is to not implement enterprise logic, however to attach the applying elements and expose HTTP routes.

Software Entry Level (app/important.py)

from fastapi import FastAPI
from api.ask import router as ask_router

app = FastAPI(title="Semantic Cache Fundamentals")
app.include_router(ask_router)

Let’s break this down.

The FastAPI() name creates the applying object. This object represents all the net service and is what the ASGI (Asynchronous Server Gateway Interface) server (uvicorn) runs when the container begins.

The applying itself incorporates no information of caching, embeddings, or LLMs. It merely defines a runtime container that may host these capabilities.

Router Registration

As an alternative of defining endpoints immediately in important.py, the applying imports a router from api/ask.py and registers it utilizing include_router().

This sample serves a number of functions:

Separation of issues: Routing and request dealing with reside outdoors the applying entry level.
Scalability: Because the system grows, further routers (for well being checks, metrics, or admin endpoints) will be added with out modifying core software wiring.
Readability: important.py stays simple to know at a look, even because the codebase expands.

At runtime, FastAPI merges the routes outlined in ask_router into the principle software. When a request arrives on the /ask endpoint, FastAPI resolves it by way of the registered router and forwards it to the suitable handler operate.

Why This Issues

Retaining the entry level minimal is intentional. It ensures that:

The applying startup course of is predictable
Routing logic is straightforward to hint
Core performance can evolve independently of service wiring

With the applying construction in place, we will now give attention to what really occurs when a request reaches the system.

Within the subsequent part, we are going to stroll by way of the /ask endpoint and see the way it orchestrates exact-match caching, semantic search, and LLM fallback step-by-step.

FastAPI Ask Endpoint: Finish-to-Finish Semantic Caching Request Stream

This part makes the structure concrete. We now stroll by way of the /ask endpoint, which orchestrates all the semantic caching pipeline from request arrival to response supply.

The objective right here is to not memorize code, however to know why every step exists, the place it lives, and the way it protects efficiency, price, and correctness.

The Position of the Ask Endpoint

The Ask endpoint is the management airplane of the system.

It does not:

Compute similarity
Retailer embeddings
Speak on to Redis internals

As an alternative, it:

Validates enter
Decides which cache layers to seek the advice of
Enforces ordering between low cost and costly operations
Collects observability indicators
Ensures a response even on cache failure

This separation is intentional. Cache logic stays reusable and testable, whereas orchestration logic stays specific on the API boundary.

Defining the API Contract

We start by defining the request and response fashions.

class AskRequest(BaseModel):
    question: str
    bypass_cache: bool = False

The request consists of a person question and an non-obligatory bypass_cache flag. This flag permits us to drive a cache miss throughout debugging or testing, guaranteeing that the LLM and embedding pipeline nonetheless operate appropriately.

Earlier than the request ever reaches the cache, the question discipline is validated.

@field_validator('question')
@classmethod
def validate_query(cls, v: str) -> str:
    if not v or not v.strip():
        increase ValueError("Question can't be empty or whitespace-only")
    return v.strip()

This validation step protects the system on the boundary. Rejecting empty or whitespace-only queries prevents:

wasted embedding computation
cache air pollution with meaningless entries
pointless LLM calls

This can be a recurring sample in manufacturing programs: fail quick, earlier than costly operations are triggered.

class AskResponse(BaseModel):
    response: str
    from_cache: bool
    similarity: float
    debug: dict

The response mannequin deliberately exposes diagnostic data by way of fields corresponding to from_cache, similarity, and debug. Throughout growth, this makes cache conduct clear quite than opaque.

Initializing the Cache

Earlier than dealing with requests, we create a SemanticCache occasion:

cache = SemanticCache()

The endpoint itself stays stateless. All persistence and reuse reside contained in the cache layer.

Step 1: Coming into the Endpoint

The endpoint is registered utilizing FastAPI’s routing mechanism:

@router.submit("/ask", response_model=AskResponse)
def ask_endpoint(request: AskRequest):

FastAPI mechanically validates incoming requests and outgoing responses utilizing the schemas outlined earlier. If invalid knowledge enters or exits the system, FastAPI raises an error as a substitute of silently failing.

Contained in the handler, we extract the question and initialize monitoring state:

question = request.question
miss_reason = None

The miss_reason variable exists purely for observability. Moderately than treating cache misses as a black field, we explicitly monitor why a miss occurred.

Step 2: Precise-Match Cache Lookup (Quick Path)

The primary determination level is the exact-match cache lookup:

if not request.bypass_cache:
    cached = cache.search(None, exact_query=question)

That is the least expensive path by way of the system.

If the identical question has already been answered, the response will be returned instantly:

no embeddings are generated
no similarity computation happens
the LLM just isn’t touched

If a cached entry is discovered, it’s validated:

if is_expired(cached):
    miss_reason = "expired"
elif is_poisoned(cached):
    miss_reason = "poisoned"
elif cached.get("confidence", 0.0) < 0.7:
    miss_reason = "low_confidence"

Solely entries which are contemporary, legitimate, and assured are allowed to short-circuit the pipeline.

When all checks cross, the endpoint returns instantly:

metrics.cache_hit()
return AskResponse(...)

This path usually completes in milliseconds and handles repeated similar queries effectively.

Step 3: Embedding Technology (Escalation Level)

If the exact-match lookup fails or is bypassed, the endpoint escalates:

embedding = embed_text(question)

Embedding era is dear, even when operating regionally. For that reason, it’s deliberately delayed till all cheaper choices have been exhausted.

This single design selection has a major impression on system effectivity.

Step 4: Semantic Cache Lookup

With the embedding obtainable, the endpoint makes an attempt a semantic search:

cached = cache.search(embedding)

This path catches paraphrased and reworded queries. As earlier than, cached entries are validated to make sure they’re protected to reuse.

If an acceptable match is discovered, the response is returned with out calling the LLM.

Step 5: Express Cache Bypass

The bypass_cache flag is dealt with explicitly:

if request.bypass_cache:
    miss_reason = "bypass"

This permits managed testing and debugging with out modifying code or disabling cache logic globally.

Step 6: LLM Fallback and Cache Inhabitants

If each cache layers miss, the request is forwarded to the LLM:

metrics.cache_miss()
response = generate_llm_response(question)
metrics.llm_call()

That is the slowest path by way of the system, however it ensures correctness.

Profitable responses are saved within the cache:

if not response.startswith("[LLM Error]"):
    cache.retailer(question, embedding, response, metadata=metadata)

Responses starting with [LLM Error] are deliberately not cached, stopping cache poisoning and guaranteeing failures don’t propagate to future requests.

Management Stream Abstract

The endpoint follows a easy, specific sequence:

**Determine 4:** LLM API Management Stream with Layered Semantic Caching (supply: picture by the writer).

Each costly operation is deferred till completely needed.

Embeddings: Turning Textual content into Semantic Vectors

Up so far, we have now handled embeddings as a black field: one thing costly that we attempt to keep away from except completely needed.

On this part, we are going to open that field simply sufficient to know what embeddings are, when they’re generated, and why they allow semantic caching with out diving into vector math or mannequin internals.

Why Embeddings Exist in This System

Precise-match caching works solely when queries are similar on the string degree. As quickly as wording modifications, precise matching breaks down.

Embeddings remedy this downside by changing textual content right into a numeric illustration that captures that means quite than floor type.

Queries that imply the identical factor have a tendency to supply vectors which are shut collectively in vector area, even when their wording differs considerably.

That is the muse that makes semantic caching doable.

Embedding Technology Occurs on Demand

In our implementation, embeddings are generated solely after the exact-match cache fails.

This determination is intentional.

Embedding era includes:

a mannequin invocation
community overhead
serialization and deserialization
non-trivial latency

Due to this price, embeddings are handled as an escalation step, not a default operation.

For this reason the /ask endpoint first makes an attempt an exact-match lookup earlier than calling embed_text().

The embed_text Perform

def embed_text(textual content: str):

This operate has one accountability: Convert enter textual content right into a semantic vector illustration.

It doesn’t carry out caching, similarity search, or validation. These issues reside elsewhere.

Calling the Embedding Mannequin

url = f"http://{settings.OLLAMA_HOST}:{settings.OLLAMA_PORT}/api/embeddings"

Right here, we assemble the Ollama embedding endpoint utilizing configuration values (e.g., settings.OLLAMA_HOST, settings.OLLAMA_PORT, and so on.).

This permits the embedding service to run regionally, inside Docker, or on a distant host with out altering code.

resp = httpx.submit(
    url,
    json={"mannequin": settings.EMBEDDING_MODEL, "immediate": textual content},
    timeout=10.0
)

This request sends 2 key items of data to the embedding service:

the embedding mannequin identify (e.g., nomic-embed-text)
the enter textual content to embed

The timeout ensures the request doesn’t hold indefinitely. Embedding era is dear, however it ought to nonetheless fail quick if one thing goes incorrect.

Dealing with the Response

resp.raise_for_status()
return resp.json().get("embedding", [])

If the request succeeds, the embedding mannequin returns a numeric vector — usually an inventory of floating-point values.

This vector represents the semantic that means of the enter textual content and turns into the important thing used for similarity comparability within the cache.

At this stage, we deal with the vector as an opaque object. We don’t examine its dimensionality or normalize it right here.

Error Dealing with Technique

besides Exception as e:
    increase RuntimeError(f"Didn't generate embedding: {e}")

If embedding era fails for any purpose (community points, mannequin errors, timeouts), the operate raises an exception.

That is intentional.

If embeddings can’t be generated, the system can not safely carry out semantic matching. Silently persevering with would result in unpredictable conduct, so we fail loudly as a substitute.

Why the Embedder Is Deliberately Easy

Discover what this operate doesn’t do:

it doesn’t retailer embeddings
it doesn’t carry out similarity search
it doesn’t retry failed requests
it doesn’t fall again to different fashions

These choices are deliberate.

For Lesson 1, the embedder exists purely to transform textual content into vectors. Retaining it small and targeted makes the system simpler to know and check.

How the Embedder Is Used within the Pipeline

At runtime, the embedder is known as solely when needed:

Precise-match cache fails
The question is handed to embed_text()
The returned vector is distributed to the semantic cache
Similarity is computed towards saved embeddings

This ensures embeddings are generated solely when cheaper paths have already failed.

Key Takeaways

Embeddings are generated through a easy HTTP name to a neighborhood mannequin
The embedder has a single accountability
Errors are surfaced instantly
Embeddings act as semantic keys for cache lookup

With embedding era understood, we are actually prepared to take a look at the semantic cache itself, how embeddings and responses are saved, scanned, and matched.

Within the subsequent part, we are going to stroll by way of the semantic cache implementation, beginning with a intentionally naive however right linear scan method.

The Semantic Cache: Cosine Similarity, Redis Storage, and Reusing Which means

At this level, we perceive how queries enter the system and the way textual content is transformed into embeddings. What stays is the part that ties all the things collectively: the semantic cache itself.

The semantic cache is accountable for 2 issues:

Storing previous queries, embeddings, and responses
Retrieving the most effective reusable response for a brand new question

In Lesson 1, we deliberately implement the cache within the easiest right method doable: a linear scan over cached entries. This retains the implementation simple to purpose about and makes the request stream totally clear.

The Semantic Cache Module

The cache logic lives in semantic_cache.py:

class SemanticCache:

This class encapsulates all Redis interplay and similarity logic. The API layer by no means talks to Redis immediately.

Initializing the Cache

def __init__(self):
    self.r = redis.Redis(
        host=settings.REDIS_HOST,
        port=settings.REDIS_PORT,
        decode_responses=True
    )
    self.similarity_threshold = 0.85
    self.namespace = "semantic_cache:v1"

Right here we set up a Redis connection and configure 2 vital parameters:

Similarity threshold: Solely responses with sufficiently excessive semantic similarity are eligible for reuse.
Namespace prefix: All Redis keys are namespaced to keep away from collisions and permit future versioning.

For Lesson 1, the precise threshold worth just isn’t vital. What issues is {that a} threshold exists and is utilized constantly.

Storing Cache Entries

The primary core operation is storing new entries.

def retailer(self, question, embedding, response, metadata=None):

This methodology is known as solely after a profitable LLM response.

Making a Cache Entry

entry = CacheEntry(
    id=entry_uuid,
    question=question,
    query_hash=query_hash,
    embedding=json.dumps(embedding),
    response=response,
    created_at=int(time.time()),
    ttl=default_ttl(),
    metadata=metadata or {}
)

Every cache entry shops:

the unique question
a normalized question hash (used for precise matching)
the embedding (serialized for Redis storage)
the LLM response
timestamps and TTL
non-obligatory metadata for observability

This construction permits the cache to assist each exact-match and semantic lookups.

Writing to Redis

self.r.hset(redis_key, mapping=entry.dict())
self.r.sadd(f"{self.namespace}:keys", redis_key)

Every cache entry is saved as a Redis hash, and all entry keys are tracked in a Redis set.

This permits the cache to iterate over all entries throughout search operations.

For Lesson 1, this method is deliberately easy and specific.

Looking the Cache

The second core operation is lookup.

def search(self, embedding, exact_query=None):

This methodology helps 2 search modes, which map on to the layered cache technique used within the API.

Precise-Match Lookup (Quick Path)

if exact_query:
    query_hash = self._hash_query(exact_query)

When a precise question is offered, the cache first makes an attempt a hash-based lookup.

Every cached entry is scanned till an identical hash is discovered. If discovered, the entry is returned instantly with a similarity rating of 1.0.

No embeddings are concerned on this path.

Semantic Lookup (Versatile Path)

If no precise match is discovered and an embedding is offered, the cache performs a semantic search:

sim = self.cosine_similarity(query_embedding, cached_embedding)

Every cached embedding is in contrast towards the question embedding utilizing cosine similarity.

Solely entries that exceed the configured similarity threshold are thought-about candidates.

Deciding on the Greatest Match

Through the scan, the cache tracks the very best similarity rating and returns the most effective matching entry.

This ensures that even when a number of entries are comparable, probably the most related response is reused.

Why This Implementation Is O(N)

Each search scans all cached entries.

This isn’t an accident.

For Lesson 1, a linear scan has 3 benefits:

the conduct is straightforward to know
the logic is totally seen
debugging is easy

Extra superior indexing methods belong in later classes.

Why Expired Entries Are Cleaned Throughout Search

Whereas scanning entries, expired objects are eliminated opportunistically.

This prevents stale knowledge from accumulating indefinitely with out introducing background staff or schedulers.

Key Takeaways

The semantic cache owns all Redis interactions
Precise-match lookup is tried earlier than semantic matching
Semantic similarity is computed utilizing embeddings
A linear scan trades efficiency for readability
The cache returns the finest reusable response, not simply the primary match

At this stage, the system is totally useful: queries will be answered, cached, and reused.

Cache Entries: What Precisely Will get Saved?

Thus far, we’ve handled the cache as a logical idea: one thing that shops queries, embeddings, and responses.

On this part, we’ll make that concrete by the construction of a cache entry. Understanding this construction is vital as a result of it explains why the cache can assist each exact-match and semantic lookup — with out duplicating knowledge or logic.

The Cache Entry Schema

Cache entries are outlined utilizing a Pydantic mannequin:

class CacheEntry(BaseModel):
    id: str
    question: str
    query_hash: str
    embedding: str
    response: str
    created_at: int
    ttl: int
    metadata: Elective[Dict] = Subject(default_factory=dict)

Every discipline exists for a particular purpose. Let’s stroll by way of them one after the other.

Identification and Question Fields

id: str
question: str
query_hash: str

id: uniquely identifies the cache entry and is used to assemble the Redis key.
question: shops the unique person enter. That is helpful for debugging and inspection.
query_hash: shops a normalized hash of the question and permits exact-match lookup.

At this stage, it’s sufficient to know that the hash ensures similar queries will be matched rapidly. We’ll revisit how and why this normalization issues in a later lesson.

Embedding Storage

embedding: str

Embeddings are saved as a JSON-serialized string, not as a uncooked Python record.

This selection is deliberate:

Redis shops strings effectively
Serialization retains the schema easy
Deserialization occurs solely when similarity must be computed

For Lesson 1, the vital takeaway is that embeddings are saved as soon as, alongside the response they produced.

Response and Timing Data

response: str
created_at: int
ttl: int

response: is the textual content returned by the LLM.
created_at: data when the entry was generated.
ttl: defines how lengthy the entry is taken into account legitimate.

The cache doesn’t depend on Redis expiration right here. As an alternative, validity is checked at learn time. This provides the applying full management over when an entry must be reused or rejected.

We deliberately keep away from deeper TTL semantics on this lesson.

Metadata and Security

metadata: Elective[Dict] = Subject(default_factory=dict)

Metadata permits the cache to retailer contextual data corresponding to:

pipeline identify
mannequin identifier
request origin

Using default_factory=dict avoids shared mutable state throughout cache entries — a refined however vital correctness element.

At this stage, metadata is informational quite than useful.

Why This Schema Works Properly

This schema helps the layered caching technique naturally:

Precise match makes use of query_hash
Semantic match makes use of embedding
Freshness checks use created_at and ttl
Security checks use response and metadata

All required data is co-located in a single cache entry, making lookup and validation simple.

Finish-to-Finish Demo: Verifying Core Cache Habits

On this part, we are going to confirm that the semantic cache behaves as anticipated below a small set of managed situations.

These examples are supposed to be run regionally by the reader. The responses proven beneath are consultant and should differ barely relying on the mannequin and configuration.

Demo Case 1: Chilly Request (LLM Fallback)

We start with a question that has not been seen earlier than.

curl -X POST http://localhost:8000/ask 
  -H "Content material-Kind: software/json" 
  -d '{"question": "What's semantic caching?"}'

Anticipated conduct

Precise-match cache miss
Semantic cache miss
LLM name
Cache inhabitants

Response

**Determine 5:** Chilly request stream exhibiting a cache miss at each the exact-match and semantic cache layers, triggering an LLM fallback. The response is generated by the mannequin and saved for future reuse (supply: picture by the writer).

The important thing sign right here is "from_cache": false, confirming the request fell again to the LLM.

Demo Case 2: Precise-Match Cache Hit

Now we ship the identical question once more.

curl -X POST http://localhost:8000/ask 
  -H "Content material-Kind: software/json" 
  -d '{"question": "What's semantic caching?"}'

Anticipated conduct

Precise-match cache hit
No embedding era
No LLM name

Instance response

**Determine 6:** Precise-match cache conduct. The repeated question is served immediately from the cache through a precise string match, bypassing embedding era and the LLM totally (supply: picture by the writer).

Right here, the cache reused the response instantly utilizing an exact-match lookup.

Elective Demo: Whitespace Normalization

curl -X POST http://localhost:8000/ask 
  -H "Content material-Kind: software/json" 
  -d '{"question": "   What   is   semantic   caching?   "}'

This can hit the exact-match cache because of question normalization.

Demo Case 3: Semantic Cache Hit (Paraphrased Question)

Subsequent, we ship a paraphrased model of the unique question.

curl -X POST http://localhost:8000/ask 
  -H "Content material-Kind: software/json" 
  -d '{"question": "Are you able to clarify how semantic caching works?"}'

Anticipated conduct

Precise-match cache miss
Embedding era
Semantic cache hit
No LLM name

Instance response

**Determine 7:** Semantic cache hit for a paraphrased question. Though the enter textual content differs, the cached response is reused primarily based on embedding similarity, avoiding a brand new LLM name (supply: picture by the writer).

Though the question textual content is totally different, the cache efficiently reused the response primarily based on semantic similarity.

Demo Case 4: Forcing a Cache Miss with bypass_cache

The bypass_cache flag permits us to drive the system to skip each cache layers.

curl -X POST http://localhost:8000/ask 
  -H "Content material-Kind: software/json" 
  -d '{"question": "What's semantic caching?", "bypass_cache": true}'

Anticipated conduct

Precise-match cache skipped
Semantic cache skipped
LLM known as unconditionally

Instance response

**Determine 8:** Cache bypass conduct. The request explicitly skips all cache layers through `bypass_cache`, guaranteeing the LLM pipeline executes independently of cached responses (supply: picture by the writer).

That is helpful for debugging and validating that the LLM pipeline nonetheless works independently of the cache.

Observing Cache Metrics (Elective)

You may examine fundamental cache statistics utilizing the /inside/metrics endpoint:

curl http://localhost:8000/inside/metrics

Instance response

**Determine 9:** Inside cache metrics exhibiting hit, miss, and bypass counters, enabling light-weight observability of cache conduct throughout growth and debugging (supply: picture by the writer).

These metrics make cache conduct observable with out requiring exterior tooling.

For those who can reproduce these behaviors regionally, you’ve efficiently carried out a working semantic cache.

Within the subsequent lesson, we are going to take this technique and start hardening it for real-world use.

What’s subsequent? We advocate PyImageSearch College.

Course data:
86+ complete courses • 115+ hours hours of on-demand code walkthrough movies • Final up to date: April 2026
★★★★★ 4.84 (128 Scores) • 16,000+ College students Enrolled

I strongly consider that when you had the suitable trainer you can grasp laptop imaginative and prescient and deep studying.

Do you assume studying laptop imaginative and prescient and deep studying needs to be time-consuming, overwhelming, and sophisticated? Or has to contain advanced arithmetic and equations? Or requires a level in laptop science?

That’s not the case.

All you might want to grasp laptop imaginative and prescient and deep studying is for somebody to elucidate issues to you in easy, intuitive phrases. And that’s precisely what I do. My mission is to alter training and the way advanced Synthetic Intelligence matters are taught.

For those who’re critical about studying laptop imaginative and prescient, your subsequent cease must be PyImageSearch College, probably the most complete laptop imaginative and prescient, deep studying, and OpenCV course on-line immediately. Right here you’ll learn to efficiently and confidently apply laptop imaginative and prescient to your work, analysis, and tasks. Be part of me in laptop imaginative and prescient mastery.

Inside PyImageSearch College you may discover:

&verify; 86+ programs on important laptop imaginative and prescient, deep studying, and OpenCV matters
&verify; 86 Certificates of Completion
&verify; 115+ hours hours of on-demand video
&verify; Model new programs launched frequently, guaranteeing you possibly can sustain with state-of-the-art strategies
&verify; Pre-configured Jupyter Notebooks in Google Colab
&verify; Run all code examples in your net browser — works on Home windows, macOS, and Linux (no dev surroundings configuration required!)
&verify; Entry to centralized code repos for all 540+ tutorials on PyImageSearch
&verify; Straightforward one-click downloads for code, datasets, pre-trained fashions, and so on.
&verify; Entry on cellular, laptop computer, desktop, and so on.

Click on right here to affix PyImageSearch College

Abstract

On this lesson, we constructed a whole semantic caching system for LLM purposes from the bottom up. We began by wiring a FastAPI service and defining a clear request–response contract, then carried out a layered caching technique that prioritizes low cost exact-match lookups earlier than escalating to semantic similarity and, lastly, LLM inference.

We walked by way of how textual content queries are transformed into embeddings on demand, how cached responses and embeddings are saved in Redis, and the way the cache decides whether or not a previous response will be safely reused. By maintaining the implementation deliberately easy and specific, each step within the request stream stays observable and simple to purpose about.

Lastly, we verified the system end-to-end by operating managed demos: a chilly request falling again to the LLM, an exact-match cache hit, a semantic cache hit for a paraphrased question, and an specific cache bypass. At this level, you’ve got a working semantic cache that behaves appropriately, makes its choices seen, and serves as a stable basis for additional hardening and optimization.

Quotation Data

Singh, V. “Semantic Caching for LLMs: FastAPI, Redis, and Embeddings,” PyImageSearch, S. Huot, A. Sharma, and P. Thakur, eds., 2026, https://pyimg.co/yso6f

@incollection{Singh_2026_semantic-caching-for-llms-fastapi-redis-and-embeddings,
  writer = {Vikram Singh},
  title = {{Semantic Caching for LLMs: FastAPI, Redis, and Embeddings}},
  booktitle = {PyImageSearch},
  editor = {Susan Huot and Aditya Sharma and Piyush Thakur},
  yr = {2026},
  url = {https://pyimg.co/yso6f},
}

To obtain the supply code to this submit (and be notified when future tutorials are revealed right here on PyImageSearch), merely enter your electronic mail handle within the type beneath!

Obtain the Supply Code and FREE 17-page Useful resource Information

Enter your electronic mail handle beneath to get a .zip of the code and a FREE 17-page Useful resource Information on Pc Imaginative and prescient, OpenCV, and Deep Studying. Inside you may discover my hand-picked tutorials, books, programs, and libraries that will help you grasp CV and DL!

The submit Semantic Caching for LLMs: FastAPI, Redis, and Embeddings appeared first on PyImageSearch.