Embeddings Aren’t Magic: The Predictable Failure Modes of RAG Retrieval

Scene 1: A RAG system over just a few hundred pages of coverage paperwork goes dwell for a small workforce.

The very first thing that impresses everybody: it handles paraphrase. Somebody asks “how do I cancel?”, the doc by no means makes use of the phrase cancel, it makes use of termination procedures, and the system finds it anyway.
One other person asks in French whereas the coverage is in English, and the appropriate web page comes again. A typo right here, a phonetic spelling there, no downside. After just a few days the workforce is genuinely impressed. The closest factor RAG has to magic is sitting in entrance of them, and it didn’t take any hand-coded synonym desk to make it work.

Scene 2: The identical system, two weeks later.

The person asks “what’s the rule on contractor time beyond regulation?” The system solutions “I couldn’t discover that data.” The person, who occurs to be the enterprise knowledgeable who wrote half this guide, frowns, opens the PDF, sorts non-employee labor into Ctrl-F, and lands on the precise paragraph in three seconds. The fitting key phrase wasn’t time beyond regulation. It was the time period the doc really makes use of. The knowledgeable knew that; the embedding didn’t.
Fairly shortly, extra circumstances like this floor. Negation breaks. Precise contract reference numbers break. An inside product code returns the incorrect tier. None of it’s fixable by swapping the embedding supplier.

The place of the sequence, said up entrance: most enterprise reliability beneficial properties come from sturdy upstream filtering (knowledgeable key phrases, doc construction), not from a reranker stacked on prime of weak retrieval.

The classical stack ranks the layers by value:

low-cost embedding similarity on the backside,
an non-compulsory cross-encoder reranker between,
the chat-completion LLM on prime.

None of them is magic; every breaks in particular methods.

This text is one piece of the broader Entreprise Doc Intelligence Vol. 1 sequence, which builds enterprise RAG brick by brick from a baseline pipeline to corpus-scale structure.

1. What embeddings nail

Earlier than the failures, what embeddings really impress at. The failures solely make sense in distinction.

An embedding turns a chunk of textual content right into a vector. Texts with comparable phrases find yourself shut in vector house.

An embedding is a listing of numbers that captures the which means of a chunk of textual content: an extended checklist can carry extra nuance. Embeddings have improved with every era. Each case under runs on the identical 4 fashions, weakest to strongest:

Loading every is a one-liner. The 2 native fashions come from sentence-transformers (HuggingFace weights pulled to disk on first name); the 2 OpenAI fashions undergo the API shopper. Identical name form throughout all 4, returning a vector.

from sentence_transformers import SentenceTransformer
from openai import OpenAI

# Native fashions: weights downloaded from HuggingFace, run in-process.
glove  = SentenceTransformer("average_word_embeddings_glove.6B.300d")  # 2014, 300-dim
minilm = SentenceTransformer("all-MiniLM-L6-v2")                       # 2021, 384-dim

# OpenAI fashions: referred to as by means of the API.
shopper = OpenAI()
def openai_embed(textual content: str, mannequin: str) -> checklist[float]:
    return shopper.embeddings.create(enter=textual content, mannequin=mannequin).knowledge[0].embedding

# Identical name form throughout all 4; every returns a vector of its personal dimension.
v_glove  = glove.encode("coverage renewal")
v_minilm = minilm.encode("coverage renewal")
v_ada    = openai_embed("coverage renewal", "text-embedding-ada-002")   # 2022, 1536-dim
v_large  = openai_embed("coverage renewal", "text-embedding-3-large")   # 2024, 3072-dim

Every mannequin lives in its personal vector house with its personal cosine distribution, so uncooked scores throughout columns are usually not comparable. What’s significant is the separation inside a column: does the goal win towards the decoys, and by how a lot? Watching the hole widen throughout the gradient is the empirical proof that embeddings actually did get higher.

The primitive each comparability desk under makes use of is identical: embed the question and every candidate with the 4 fashions, rating with cosine similarity, return a row per candidate:

def _cos(u, v):
    """Cosine similarity : dot-product of two vectors, normalised by their lengths."""
    return float(np.dot(u, v) / (np.linalg.norm(u) * np.linalg.norm(v)))

def compare_models(question, candidates, goal=None):
    qg = glove.encode(question)
    qm = minilm.encode(question)
    qa = openai_embed(question, "text-embedding-ada-002")
    ql = openai_embed(question, "text-embedding-3-large")
    rows = []
    for c in candidates:
        rows.append({
            "candidate": c,
            "GloVe-avg":  _cos(qg, glove.encode(c)),
            "MiniLM":     _cos(qm, minilm.encode(c)),
            "ada-002":    _cos(qa, openai_embed(c, "text-embedding-ada-002")),
            "3-large":    _cos(ql, openai_embed(c, "text-embedding-3-large")),
        })
    return pd.DataFrame(rows).set_index("candidate")

1.1 Conceptual proximity

automotive matches passages about automobiles, vehicles, motor automobiles. fireplace injury finds passages on smoke injury and scorching. supervisor approval matches a clause about govt approval. The mannequin captures the semantic subject, not simply the floor phrases. That is what makes embeddings really feel highly effective: the person doesn’t should guess the doc’s vocabulary; the embedding bridges the remainder.

Informal question bridges to formal paraphrase. All 4 fashions decide TARGET; larger fashions widen the margin – Picture by writer

1.2 Synonyms and paraphrase

Cellphone quantity matches phone. Coverage cancellation matches a bit titled termination procedures. Price matches cost. Month-to-month value matches premium. Expiration matches coverage finish date. Physician matches doctor, lawyer matches legal professional, automotive matches automobile. Single phrases and multi-word compounds alike. The mannequin has discovered that two vocabularies say the identical factor, together with the hole between informal person phrasing and the formal language paperwork are written in. No person coded that mapping by hand.

The check: question what's the month-to-month payment towards a synonym TARGET (A flat cost of $9.99...), a literal-overlap decoy (Premium funds are due month-to-month..., which shares the literal month-to-month token), and two off-topic decoys.

*Question `month-to-month payment`. Three fashions bridge `payment ↔︎ cost`; GloVe picks the literal-overlap decoy – Picture by writer*

Solely GloVe-avg falls for the literal-overlap decoy. Sentence-encoder coaching (already in 2021’s MiniLM) is what offers actual synonym dealing with. With out it, a candidate that simply repeats the question’s tokens in any order wins. With it, the mannequin bridges payment ↔︎ cost regardless that the 2 phrases share no letters. The question can also be phrased as a query (what's the month-to-month payment) and the TARGET as an assertion (A flat cost of $9.99...). The synonym dealing with is what wins right here. However the precise reply (the naked quantity $9.99 alone, or Sure for a sure/no query) wouldn’t essentially win no matter mannequin power. Part 2.2 demonstrates that straight.

1.3 Typos and misspellings

insurence nonetheless embeds near insurance coverage. polciy nonetheless finds the coverage part. deductable with the incorrect vowel nonetheless lands on the deductible web page. Diacritics dropped on French phrases (resiliation with out the accent) nonetheless match the canonical kind. Fashionable embedding fashions had been educated on a web-scraped soup of textual content the place these typos are fixed, they usually have discovered to soak up the noise.

*Typoed question. GloVe collapses to detrimental cosines; margin to TARGET grows from MiniLM to 3-large – Picture by writer*

Take a look at the rating gaps, not absolutely the scores. GloVe-avg has no notion of typos. Misspelled tokens are out of vocabulary, so the embeddings collapse and the cosines go detrimental. The ordering is principally random. The OpenAI fashions soak up the typos cleanly. Character-level robustness is actual, and it scales with mannequin capability.

1.4 Cross-lingual matching

Multilingual embeddings place premium, prime and Prämie in close by areas of the house. Identical for deductible / franchise / Selbstbeteiligung, for declare / sinistre / Schadensfall. A French key phrase retrieves an English passage about the identical idea. For enterprises with mixed-language corpora (French contracts, English correspondence, German coverage schedules), that is genuinely helpful when it really works, and on fashionable fashions it normally does.

*French question towards English candidates. GloVe and MiniLM wrestle; ada-002 and 3-large bridge languages cleanly – Picture by writer*

GloVe fails outright: it picks Protection restrict: $50,000 per 12 months. over Annual premium: $1,200. as a result of the French annuelle lexically associates with 12 months in its averaged phrase house, and it has no concept that prime means premium. MiniLM technically picks TARGET, however the cosines sit round 0.12, principally noise. ada-002 and 3-large are multilingual by coaching, like BGE-M3 and multilingual-e5, they usually bridge French to English cleanly. The selection isn’t “vector vs key phrase”, it’s “multilingual vector mannequin vs English-only one”.

1.5 Compound polysemy

Polysemic phrases have a number of meanings that the context disambiguates:

financial institution (monetary establishment / river edge),
declare (insurance coverage occasion / assertion),
retailer (verb: put away / noun: retail outlet),
inexperienced card (immigration doc / a card coloured inexperienced),
sizzling canine (meals / a canine that’s sizzling).

When a candidate makes use of the literal phrase within the incorrect sense, a powerful embedding ought to nonetheless decide the semantically proper one. That is additionally the place the literal-token bias of weak fashions reveals clearest: GloVe-avg can not distinguish the 2 readings of a compound and picks whichever candidate shares probably the most tokens with the question. Sentence encoders progressively get better the appropriate sense, however how progressively is determined by how identified the compound is in coaching knowledge.

We check two compounds, simple first, then arduous.

First, inexperienced card, the simple case. The immigration sense is so closely attested in coaching corpora (information, authorized textual content, Wikipedia) that even MiniLM resolves the compound. The check: question inexperienced card, three candidates. A paraphrase of the immigration doc (TARGET, zero shared tokens), a gaming-context sentence that comprises each inexperienced and card in literal senses (the entice), and one off-topic decoy.

*`inexperienced card` towards immigration paraphrase vs gaming entice. Solely GloVe falls within the entice – Picture by writer*

Solely GloVe falls within the entice. Phrase-averaging fashions haven’t any notion that “inexperienced card” as a compound refers to immigration. They see two tokens, search for candidates sharing these tokens, and the gaming entice wins. MiniLM is already sufficient to flip it, as a result of sentence-level coaching captures the institutional sense. ada-002 picks TARGET by a snug margin; 3-large by a large one. That is the form of polysemy embeddings deal with nicely, as a result of the general public net teaches the compound in every single place.

Now sizzling canine, the arduous case. Identical structural setup (a compound that additionally reads actually), however the literal studying (a canine that’s sizzling) is additionally closely attested in coaching textual content. The mannequin has seen loads of sentences about sizzling climate and canine in it. The meals sense and the literal sense compete on near-equal footing, and the literal-token bias of weak and mid fashions wins.

*`sizzling canine` towards meals paraphrase vs literal-token entice. Solely 3-large flips the polysemy cleanly – Picture by writer*

That is the part 1 case the place the mannequin gradient helps probably the most. GloVe-avg, MiniLM, and ada-002 all fall within the entice. They latch onto the shared sizzling + canine tokens regardless of the incorrect sense. The identical impact was already seen on GloVe in part 1.2 (literal month-to-month token beating the payment ↔︎ cost synonym). Compound polysemy is the worst case of it: the literal tokens of the question seem within the decoy, so even ada-002 can not inform the 2 senses aside. 3-large is the primary mannequin that recovers: it picks the meals paraphrase by a large margin regardless that TARGET shares zero tokens with the question.

So the sensible query in your corpus isn’t “is there polysemy” however “how institutional is the polysemy I’ve”. An insurance coverage corpus has loads of compound polysemy that’s not within the public coaching distribution (declare dealing with as a verb in a workflow, pool as a risk-sharing instrument). On these, even ada-002 behaves like GloVe behaves on sizzling canine. The 2024-class mannequin is the reasonable repair; the remainder of the sequence goes after the structural one.

1.6 What these wins actually present, and don’t

The vocabulary on this part has one factor in widespread: it’s public. The mannequin noticed inexperienced card ↔︎ everlasting resident card, prime ↔︎ premium, polciy → coverage in hundreds of thousands of coaching paperwork. Embeddings deal with them nicely as a result of the equivalence is baked into the weights. What the literature calls the parametric reminiscence of the mannequin (the half that “is aware of” issues from coaching, with none retrieval) is doing a lot of the work.

Two penalties value naming earlier than we transfer on.

1. For these circumstances, you may not want RAG in any respect. Ask GPT-4 “what’s one other identify for inexperienced card?” and also you get the reply with out retrieval. The parametric a part of the mannequin already is aware of. RAG earns its place precisely the place the parametric half doesn’t: info that aren’t on the general public net, contract clauses that don’t generalise, inside product codes the mannequin by no means noticed. Part 1 used well-known vocabulary so the demos are reproducible and skim cleanly. Manufacturing RAG isn’t used to reply these questions.

2. The part 1 wins don’t switch to enterprise vocabulary. An insurance coverage firm has ShieldPro Elite (a product tier), pool (a risk-sharing instrument, not a swimming pool), non-employee labor (the contract’s phrase for contractor), regulatory citations like Solvency II Article 7. None of that is within the mannequin’s coaching distribution. On enterprise phrases, embeddings fail the identical method GloVe fails on sizzling canine, as a result of the institutional sense the embedding would wish to get better isn’t institutionalised anyplace exterior that firm.

The repair isn’t a much bigger embedding mannequin. The repair is the knowledgeable who is aware of the vocabulary, codified as a key phrase dictionary (part 3.3 develops this). Part 2.1 makes the failure concrete on the pool instance.

Part 2 catalogues the structural failures. Learn them with this in thoughts: each one in all them is the rule, not the exception, on enterprise corpora.

2. The place they break, and why

The skills in part 1 are actual; the failures under are equally actual, equally reproducible, and persist throughout all 4 fashions. A bigger mannequin doesn’t transfer the rating. The repair is architectural, not “decide a stronger embedding”.

Part 1.6 already raised the apparent counter (“for these circumstances, simply ask the LLM straight”). At corpus scale that doesn’t scale: a 200k-document corpus can’t be handed by means of an LLM on each question. Some retrieval step has to come back first. The mainstream pipeline stacks a reranker between embeddings and the LLM; the sequence’s reply is upstream filtering by means of knowledgeable key phrases and doc construction (articles 6, 7, 9). Both method, the failures catalogued under apply to the embedding stage. None of those layers is magic.

2.1 The only break: the time period isn’t within the mannequin

Earlier than the structural failures, probably the most primary one. Part 1.6 mentioned it in phrases. Right here is the demo.

Take pool. In an insurance coverage contract, pool is a risk-sharing instrument: a bunch of insureds that collectively soak up losses by means of aggregated premiums. Usually English, pool is a physique of water you swim in. Two senses of the identical phrase, with one stark distinction: the swimming sense is in every single place on the general public net; the risk-pool sense is buried in actuarial textbooks, regulatory filings, and reinsurance treaties that the mannequin barely noticed at coaching time.

The check mirrors the hot-dog setup from part 1.5, with one twist. Question the naked phrase pool. Three candidates: a swimming paraphrase (the general public sense, no pool token within the sentence), a reinsurance paraphrase utilizing actual trade jargon (the specialist sense, additionally no pool token), and a random management sentence a couple of practice departure (no pool token, no insurance coverage connection, no swimming).

*Question `pool`. The reinsurance sense ranks under a random management on three of 4 fashions – Picture by writer*

The swim paraphrase wins on each mannequin, by a large margin (0.353 to 0.843 cosine, relying on the mannequin). The reinsurance paraphrase, written in real trade vocabulary, ranks under the random train-departure management on three of the 4 fashions. Even ada-002, the workhorse of most enterprise RAG deployments, places the practice timetable 0.010 forward of the specialist sentence. Solely 3-large offers the specialist sense a 0.006 raise over the management, nicely contained in the noise of the measurement.

That is probably the most direct failure mode there may be: the embedding house merely doesn’t encode the specialist sense of pool. A reranker stacked on prime wouldn’t assist, as a result of the candidate scores it could re-evaluate are themselves noise. An even bigger embedding mannequin wouldn’t assist, as a result of the mannequin that noticed the swimming pool 1,000,000 occasions and the reinsurance pool possibly 100 occasions will hold weighting the swimming sense.

pool is in actual fact a smooth OOV case: the swim sense and the danger sense share a register and 3-large catches some sign. The more durable circumstances are strict OOV phrases: ShieldPro Elite (a fictional product tier), Solvency II Article 7 (an actual regulatory quotation), ZRX-2025 (an inside product code). For these the embedding has no anchor in any respect. The mannequin treats them as random byte strings; rating them towards another textual content is a coin flip biased by tokenization quirks.

The repair is the knowledgeable who is aware of the vocabulary, codified as a key phrase dictionary. Part 3.3 develops the workflow.

The remainder of part 2 walks by means of the structural failures that present up even when the time period is within the mannequin. The pool case is the easier break that comes first.

2.2 The structural break: time period similarity, not reply relevance

Part 2.1 coated the case the place the time period merely isn’t within the mannequin. The remainder of part 2 covers the case the place the time period is within the mannequin, and the embedding nonetheless offers the incorrect reply. These failures share one structural root. An embedding sees textual content and ranks it by time period similarity. It doesn’t signify the question-to-answer relation in any respect. Two of the best queries you’ll be able to ask make this concrete. They aren’t enterprise edge circumstances, they’re probably the most common questions on the earth.

*Sure/no query. The naked key phrase `Termination` beats the precise `Sure` reply on each mannequin – Picture by writer*

“Sure” is the appropriate reply to a sure/no query. It by no means wins. The literal copy of the question’s noun does. On each mannequin from 2014 to 2024.

A subtlety value naming. This explicit failure is much less dangerous in follow than it appears to be like. For a sure/no query, what we really need from retrieval isn’t the literal phrase sure. We would like the proof in regards to the subject: the web page the place the rule lives. The reply-phase LLM produces sure/no from that proof. So retrieval pulling Termination or Termination could also be required. (the topical matches) relatively than Sure, it's attainable. is nearer to the appropriate behaviour than the demo’s verdict suggests. The precept the article retains surfacing is right here too: the retrieval section isn’t the reply section, they usually should be separated and optimised as two distinct steps. Articles 6, 7, and eight develop the separation.

The failure is sharper on the following instance, the place retrieval really wants to search out the answer-bearing line.

Now the cleanest factoid on the earth: “What’s the capital of France?” The web has seen “Paris is the capital of France” hundreds of thousands of occasions. If question-answer mapping confirmed up anyplace in any embedding house, that is the place it could present up.

*Question `Capital of France`. Paris by no means wins; topic-decoys sharing `Capital of` or `France` at all times do – Picture by writer*

Paris isn’t #1. On three of the 4 fashions (GloVe, ada-002, 3-large) the winner is Capital of Italy, the candidate that shares the literal phrase Capital of with the question. On MiniLM a distinct decoy wins: France is in Europe., as a result of it shares the token France. Totally different decoys, identical root trigger: subject similarity, not reply relevance. Going from a 300-dim 2014 bag-of-word-vectors mannequin to a 3072-dim 2024 OpenAI mannequin doesn’t flip the entice. For a factoid query, retrieval ought to fetch the road that comprises the reply. As a substitute, each mannequin picks the road that matches the question’s vocabulary topically.

A second nuance value naming. Fashionable embedding fashions practice on question-passage pairs (MS MARCO, Pure Questions, BEIR). This does push answer-bearing passages a little bit nearer to the questions they reply. The bias exists. It’s weak. On very common factoids it generally flips the choice. On specialised vocabulary the mannequin by no means noticed at coaching (inside product codes, knowledgeable terminology, contract jargon), the bias vanishes. Matter similarity dominates once more.

The sections under catalogue this root trigger in 4 concrete failure shapes (negation, magnitudes, topical proximity, sign dilution) plus a survey of the apparent circumstances. Every is identical mechanism utilized to a distinct question sort.

2.3 Negation

A negation query turns the logical relation the other way up: the person needs the candidate that’s the complement of the subject, not the candidate that’s closest to the subject. Embeddings can’t try this. They measure topical proximity, not logical complementation. The starker the check, the clearer the failure.

Question: “What’s NOT a metropolis?” 4 candidates: three are actual entities (two particular cities + the literal phrase Metropolis), and one is Desk, a secular object that occurs to be the solely candidate that solutions the query accurately.

*Question `What's NOT a metropolis?`. Each mannequin ranks the right reply final; negation is invisible – Picture by writer*

Each mannequin fails the identical method. The candidates that match the subject (Metropolis, Paris, New York) sit on prime, and Desk, the one candidate that truly solutions the query, lands final. The question phrase NOT carries nearly no sign within the embedding house: the embedding sees a bag containing “metropolis” and ranks something city-related increased than something that isn’t. The repair isn’t a stronger embedding mannequin. It’s a step that detects the negation at question-parsing time and inverts the retrieval (Article 6).

“Certain, however no actual person writes a negation question.” An affordable objection that holds for a second after which breaks in manufacturing. Customers don’t pose “what’s NOT a metropolis?” They pose “what’s the premium quantity on this coverage?” The system returns the deductible by mistake. The person, pissed off, naturally tries to right: “I would like the premium quantity, not the deductible.” That second question is a negation, and it’s precisely the second an actual enterprise person writes one.

The intuition is cheap: a human reader treats not as an exclusion. The embedding does the other. By including deductible to the question, even prefixed with not, the embedding pulls deductible-bearing strains nearer, not additional. The person’s correction makes the failure strictly worse than the unique question.

That is the bigger precept the part retains surfacing: the uncooked query isn’t the appropriate enter to the retriever. The repair is upstream, in query parsing: negation will get detected, lifted out of the prose, encoded as a structured exclude-filter, and utilized after retrieval, not embedded with the remainder of the question. Sections 3.2 and three.3 return so far with a constructive model: what the retriever really consumes is a structured illustration (key phrases, filters, exclusions), not the person’s free-form sentence.

2.4 Magnitudes and thresholds

Numerical comparisons, dates, contract quantities, account balances. Something the place the reply is determined by the worth itself. Take a stripped-down model: question discover worth larger than 1M, 4 candidates which can be naked quantities.

*Question asks for worth > 1M. `1M` wins in every single place; `3B`, the one right reply, ranks final – Picture by writer*

Each mannequin picks 1M, the candidate that equals the brink however doesn’t strictly exceed it. The win is pure lexical match: the literal 1M token sits within the question. 3B, the one candidate that truly solutions the query, lands at #4 (useless final) on each ada-002 and 3-large. The embedding has no idea of magnitude. It sees 1M subsequent to 1M and that wins.

This generalizes to any value-comparison or threshold query: financial thresholds, dates (“after 2020”), durations (“longer than 30 days”), counts. Embeddings are dangerous at this nearly by design: they compress which means into dense vectors, and the discriminating sign (the worth itself, or the operator that picks amongst values) is precisely what compression destroys. The repair is well-known: BM25 / full-text indexing for the lexical match, plus a question-parsing step that lifts the operator and the brink out as structured fields (Article 6) so a downstream filter can do the comparability.

2.5 Topical proximity vs reply relevance

Consumer query: “Who signed the contract?” The corpus has one passage describing how contracts have to be signed (licensed consultant, signature necessities) and one passage with the precise signature (“Signed: John Smith, Advertising and marketing Director, dated 2025-03-15”). The primary passage talks about signing; the second is the signature. Which one wins?

*`Who signed the contract?`. The procedural passage about signing outranks the precise signature line – Picture by writer*

That is the structural failure that the mannequin gradient doesn’t repair. Embedding similarity measures topical proximity, not question-to-answer relationship. A web page that talks about a subject will usually rating increased than a web page that solutions a query in regards to the subject. Definitions outscore values. Background sections outscore conclusions. Procedures outscore the concrete cases they describe.

Three of 4 fashions affirm the sample right here (GloVe, ada-002, 3-large). MiniLM is the exception: its sentence-pair coaching pushes the concrete-answer phrasing barely increased than the procedural-density phrasing. The sample is secure on the opposite three, and reproduces throughout most factoid-against-procedure pairs we now have tried.

2.6 Sign dilution in lengthy context

The earlier exams used candidates roughly the size of the question. Actual corpus pages are usually not. An actual web page is 300-500 phrases, dense with particulars, with the reply to a selected query buried in a single sentence someplace within the center. While you embed the entire web page as a single vector, the sign of that one answer-bearing line will get averaged with all the pieces else, and the page-level embedding drifts towards the centroid of the encompassing noise.

The cleanest option to see this can be a one-variable experiment. Hold the reply sentence mounted. Prepend it with an rising variety of unrelated office-life sentences (workplace hours, parking guidelines, HR boilerplate, nothing about deductibles or water injury). Rating towards a set management candidate that shares no particular time period with the question, simply lives in the identical broad insurance coverage/claims vocabulary.

Question: deductible for water injury claims
Reply (diversified): For water injury claims, the usual deductible is $500. prepended with N ∈ {0, 1, 2, 4, 8, 16} unrelated sentences
Management (fixed throughout N): Claims should embrace images, restore estimates, and police experiences the place relevant.

*Reply sign vs noise: prepending unrelated sentences makes the reply rating collapse on each mannequin – Picture by writer*

Every mannequin fails in its personal time, however all of them fail. GloVe collapses instantly as a result of bag-of-words averaging drags the embedding towards the noise after a single sentence. MiniLM holds out for 4 sentences earlier than its sentence-encoder illustration offers up. ada-002 and 3-large, each 2022+ OpenAI fashions educated on question-passage pairs, final the longest, however by the point the candidate is 144 phrases (eight unrelated sentences), the appropriate reply ranks under a candidate that doesn’t include the phrases deductible, water, or injury in any respect. Embedding a 300-word web page is the manufacturing model of “reply + 16 noise sentences”.

This is the reason manufacturing pipelines that embed on the web page stage continuously miss the appropriate web page even when the reply is genuinely on it. The page-vector averages 300-500 phrases of topical noise round one or two answer-bearing strains. Part 3.1 is the architectural repair: embed line by line, not web page by web page. Solely combination as much as the web page when era wants the encompassing context. The fitting line on a loud web page turns into findable once more as a result of its embedding isn’t averaged with all the pieces else.

2.7 The plain circumstances (no demo wanted)

Some question sorts break embeddings so plainly {that a} four-model comparability would simply repeat the identical consequence. They’re listed right here for completeness, and to make a broader level: no embedding improve rescues them. The repair is upstream (query parsing, Article 6) or in a distinct instrument fully (BM25, metadata filter, aggregation pipeline).

OOV identifiers and inside jargon: contract references (Part 4.2.1), regulatory citations (GDPR Artwork. 17.3), bill numbers, ticket IDs, inside product names (ShieldPro Elite, SAP-MRP, KPI-This fall-V3). The embedding treats them as opaque sequences and can’t rank them semantically. Repair: BM25 or an exact-match index for the lookup, plus a glossary that maps aliases to canonical phrases (ShieldPro Elite → top-tier owners plan) maintained as knowledgeable key phrases (Article 6).
Boolean composition: “paperwork reviewed by Alice however not by Bob”, “claims with injury and witness”. Bag-of-words averaging erases the logical operators. Repair: parse the query right into a structured filter (Article 6) and apply it after retrieval.
Counting and aggregation: “What number of contracts did Alice signal?”, “Listing all open claims”. Embeddings return one most-similar passage; a counting reply wants a full scan or a SQL-style question over an index. Repair: route these to an aggregation pipeline (Articles 15-20).
Temporal predicates: “the newest model”, “claims filed after 2020”, “insurance policies expiring earlier than December”. Embeddings don’t signify temporal order. Repair: extract the temporal filter at question-parsing time and apply it as a metadata filter on the index.
Multi-hop reasoning: “Who’s the supervisor of the one who signed contract X?” Every hop is a separate retrieval; the embedding offers you one shot. Repair: an agentic chain, or a graph traversal over a correctly listed corpus.

The sample is constant. When an embedding fails clearly, the reply isn’t “purchase a much bigger embedding mannequin”. It’s “raise the question out of the embedding lane and into the appropriate instrument”.

2.8 Identical cracks at web page scale (actual doc)

The 4 failures above had been demonstrated on hand-written candidates. They present up identically when retrieval runs page-by-page on an actual doc. We embed each web page of Consideration Is All You Want (Vaswani et al. 2017; arXiv non-exclusive distribution license, declared on the arXiv summary web page; 15 pages) and run three questions; every surfaces a distinct rating pathology at web page granularity.

What every consequence reveals.

Q1, barely wins. Three pages inside 0.01 of one another; the appropriate web page (web page 7, the place the Adam learning-rate components lives) wins by 0.007. That’s the margin of luck, not retrieval. A variant of part 2.5 (topical proximity) compounded with common rating fragility.
Q2, top-3 saves us. Web page 8 outranks web page 9, however the reply (Desk 3, the d_k row of the ablation) lives on web page 9. Prime-3 is sufficient; top-1 would have failed silently. Identical flavour as part 2.4 (actual values inside a numeric desk).
Q3, whole failure. The reply web page (web page 8, ε_ls = 0.1) falls out of the top-3 fully. Web page 15 (with instance sentences filled with ε symbols in formulation) sneaks in as a substitute. That is part 1.5 (compound polysemy) firing on ε: the embedding can’t inform the ε of the Adam optimizer (web page 7), the ε_ls of label smoothing (web page 8), and the ε of unrelated formulation (web page 15) aside.

Identical failure classes, scaled as much as an actual doc. The repair is identical one part 3 develops.

3. How one can really use them

Part 1 confirmed what embeddings impress at. Part 2 confirmed the place they break, with two distinct roots: when the time period merely isn’t within the mannequin (part 2.1) and when the time period is within the mannequin however time period similarity isn’t reply relevance (sections 2.2 onward). The pure subsequent query: provided that, how will we really use them in manufacturing?

4 sections. Part 3.1: the appropriate psychological mannequin (line-level synonym-tolerant search). Part 3.2: the trick that bridges the question-to-answer hole isn’t actually about embeddings, it’s about extracting the key phrases the reply would include. Part 3.3: the manufacturing workflow that makes each work, by discovering the corpus’s vocabulary with specialists, codifying it right into a key phrase dictionary, then operating focused retrieval on prime. Part 3.4: the particular case of sentiment-heavy corpora (HR suggestions, buyer surveys, help tickets), the place the identical discovery mechanism applies to emotional vocabulary.

3.1 The reframing: line-level synonym-tolerant search

The only option to maintain what embeddings are: vector search is key phrase search that handles synonyms, typos, and different languages, utilized line by line. It’s not magic. It’s not “page-level semantic understanding”. On a single line, the mannequin treats cancel and terminate as shut. It absorbs polciy as coverage. It bridges prime and premium throughout languages. Each match that labored in part 1 labored because of this.

While you embed a complete web page right into a single vector (part 2.6 confirmed it straight), the sign of 1 good line will get averaged with the remainder, and the appropriate line hides inside a web page that largely talks about different issues. So embed line by line. Solely combination as much as the web page when era wants the encompassing context.

Web page-level embedding nonetheless earns its place in just a few circumstances: when no single line carries the key phrase (the web page is about automotive insurance coverage however by no means makes use of that phrase), when the subject is implied by surrounding vocabulary (medical web page mentioning A1C / insulin / blood sugar however by no means diabetes), when type or register issues, when the heading is generic (“Notes”, “Part 5”). Exterior these circumstances, line-level wins nearly each time.

The demo under makes it concrete on an actual paper. The earlier sections embedded brief hand-written candidates. Right here we embed each line of the Consideration Is All You Want paper (15 pages, ~1000 strains) and search by a brief key phrase anchor. The highest-Okay outcomes are strains, with their web page and line quantity. You’ll be able to learn every match and see why it matched: the anchor’s key phrase or a transparent paraphrase is true there within the textual content.

5 operations on prime of pandas and numpy: encode the question, stack the road embeddings right into a matrix, batch-compute cosine in a single matmul, kind by similarity, return the top-k. No vector database, no framework, no infra. The “vector retailer” is a DataFrame column plus a numpy dot product.

def top_lines_for(query: str, line_df: pd.DataFrame, okay: int = 10) -> pd.DataFrame:
    """Rank each line by cosine similarity to `query`. Return the top-k."""
    q_vec = get_embedding(query, shopper=shopper)
    line_matrix = np.vstack(line_df["embedding"].values)
    sims = line_matrix @ q_vec / (
        np.linalg.norm(line_matrix, axis=1) * np.linalg.norm(q_vec)
    )
    return (
        line_df.assign(similarity=sims)
        .nlargest(okay, "similarity")[["page_num", "line_num", "similarity", "text"]]
        .reset_index(drop=True)
    )

*Prime 10 strains for `multi-head consideration`: paraphrases and literal matches from pages 1, 4, 5, 10 – Picture by writer*

Two issues to remove from the line-level demo.

1. The matched strains actually present why each matched. No magic, no rating opacity. Each prime consequence comprises both the anchor’s key phrase or a transparent paraphrase of it. That’s line-level embedding in a single phrase: a fuzzy, synonym-tolerant Ctrl-F over the doc.

2. The matched line is an anchor, not the passage you ship to era. The road is a small factor the retriever can confidently find. The passage that goes to the LLM is normally bigger: the encompassing paragraph, the part, generally the entire web page. Article 7 develops this as a two-step sample: detect anchors first (line-level, keyword-level, structure-level), then select a passage round every anchor based mostly on what the query wants. Focused retrieval = small N round a pointy anchor, not 30 fuzzy pages thrown on the LLM.

3.2 HyDE: search what the reply would include, not the query

Part 2.2 confirmed that embeddings don’t see questions; they see time period similarity. The pure response: cease feeding the query into the retriever. Feed it textual content that appears to be like like the reply as a substitute. That’s the concept behind HyDE (Hypothetical Doc Embeddings). Write (or have an LLM write) a sentence that plausibly solutions the query, within the vocabulary the doc would use, and embed that. The retriever compares the hypothetical-answer vector to the corpus.

The purpose everybody makes about HyDE is the embedding facet: “the rewritten question lands within the doc’s neighbourhood as a substitute of the person’s”. That’s true and it helps. However the true worth of HyDE, particularly in enterprise contexts, is on a distinct layer. Writing a hypothetical reply is additionally an extraction step: it surfaces the key phrases the reply would include. “Termination procedures”, “rights of rescission”, “cancellation payment”. These are the phrases that anchor the search, whether or not the retriever is vector-based or keyword-based.

*Uncooked question ranks goal #4; HyDE rewrite injects doc vocabulary, goal climbs to #1 – Picture by writer*

Why HyDE labored right here, and what really did the work. The uncooked question says cancel. The goal line says rescission and terminate. Zero shared content material tokens. Three lexical decoys within the candidate pool every repeat cancel/cancellation a number of occasions, and collectively they push the formal goal all the way down to rank #4. The HyDE rewrite is a fictional reply that occurs to include rescission, terminate, written discover, renewal, the precise vocabulary the goal makes use of. As soon as these tokens enter the question facet, the rating flips and the goal climbs to #1.

The dominant issue is the key phrases the rewrite comprises. Register matching (the rewrite’s formal declarative tone aligning with the doc’s register) and latent semantic associations from the LLM’s coaching contribute smaller second-order results (Article 6 decomposes them in depth); in enterprise vocab-bounded corpora, these don’t transfer the consequence. Run key phrase search on the time period set {rescission, terminate, written discover, renewal} and also you get the identical goal with no embedding move in any respect.

HyDE is implicit key phrase growth routed by means of an embedding step. The LLM writes a full hypothetical reply, the system embeds it, the retriever runs cosine over the corpus. All of that work to inject a handful of key phrases into the question. Two easier paths do the identical vocabulary raise, explicitly:

Ask the LLM for the key phrases straight. One immediate: “What phrases would the reply to this query include in a typical insurance coverage contract?” Output: rescission, terminate, written discover, renewal. Use them in key phrase search. No fictional doc, no embed, no cosine.
Have the knowledgeable hand you the dictionary. Attorneys, claims adjusters, compliance officers already know that cancellation in person vocabulary equals rescission in contract vocabulary. Codifying that mapping as soon as is sturdy; asking the LLM to rediscover it on each question is wasteful.

Each paths beat the HyDE pipeline on three fronts. Auditability: the matched key phrases are seen to the workforce and to a regulator; a 0.83 cosine rating isn’t. Latency: one LLM name, no embed round-trip per question. Sturdiness: the key phrases persist in a dictionary, reusable throughout queries; HyDE regenerates the speculation from scratch each time. Article 6 (Query Parsing) formalises this as the express knowledgeable key phrase dictionary that grows with the corpus.

Shopper vs enterprise. On consumer-shaped corpora (common insurance coverage FAQs, e-commerce assist, public-service types), the LLM has seen loads of coaching textual content in the appropriate register, so its key phrase guess is normally first rate. HyDE works with out an knowledgeable within the loop. On enterprise corpora (inside product codes, regulatory citations, contract jargon, customized acronyms), the LLM falls again on generic legalese (“…might be outlined within the phrases and situations…”) and misses the doc’s precise vocabulary. The knowledgeable already is aware of that vocabulary. Asking the LLM to guess what the knowledgeable can hand you, on each single question, is the gradual path.

3.3 The manufacturing reply: uncover key phrases with specialists

The usual recommendation (“use embeddings for semantic retrieval”) is just too imprecise. A sharper query is when do they really earn their slot within the pipeline? 4 solutions, every pointing someplace completely different.

Already know the appropriate key phrases? Use key phrase search. It’s sooner, cheaper, auditable, and never opaque the way in which a vector match is. If a regulator asks why a selected passage was retrieved, “the road comprises power majeure and pandemic” is a defensible reply. “The cosine similarity was 0.83” isn’t.

Typos within the question? Repair the question. A single LLM name corrects polciy to coverage and also you’re again to wash key phrase search. No embedding pipeline required.

Typos within the paperwork? Now embeddings genuinely earn their place. OCR’d contracts, scanned types, hand-typed notes. Key phrase search actually can not match a misspelled token, however a line-level embedding nonetheless lands in the appropriate neighbourhood. That is the case the place vector search is structurally irreplaceable.

Multilingual corpus? Identical reply, completely different mechanism. Contracts in French, correspondence in English, regulatory annexes in German. A multilingual embedding lets the person question in a single language and floor strains from the others. prime annuelle finds Annual premium: $1,200. (part 1.4 confirmed it). Sustaining bilingual key phrase dictionaries by hand is feasible however costly; the multilingual embedding bridges the languages without spending a dime, and the knowledgeable retains the dictionary working in a single language with embeddings because the cross-language fallback. Requires a multilingual mannequin: ada-002, 3-large, BGE-M3 work; GloVe and English-only sentence encoders don’t.

Synonyms particular to your enterprise that you just don’t know but? That is probably the most production-relevant case, and the place embeddings are most helpful: as a discovery mechanism, not because the retriever itself.

The explanation issues. In authorized, medical, insurance coverage, monetary corpora, the significant synonyms aren’t dictionary synonyms. Power majeure and act of God imply the identical factor in a contract, however the embedding mannequin doesn’t know that. They’re not lexical neighbours and never embedding-space neighbours both. They’re business-specific equivalences that solely specialists (attorneys, claims adjusters, compliance officers) know.

Concrete pairs throughout domains. What “area synonyms” appears to be like like in follow:

Insurance coverage contracts: cancellation ↔︎ rescission, termination, lapse of canopy, give up of the coverage. deductible ↔︎ extra (UK), franchise (FR). declare ↔︎ loss notification, incident report. policyholder ↔︎ insured, assured, named get together.
Medical information: blood sugar ↔︎ glycemia, A1C, HbA1c, fasting plasma glucose. coronary heart assault ↔︎ myocardial infarction, MI, acute coronary occasion. hypertension ↔︎ hypertension, elevated BP studying.
Authorized and contract clauses: power majeure ↔︎ act of God, unforeseeable circumstances, occasions past cheap management. non-compete ↔︎ restrictive covenant, restraint of commerce clause. confidentiality ↔︎ non-disclosure, NDA, proprietary data clause.
HR and employment: dismissal ↔︎ termination of employment, separation, severance occasion. wage ↔︎ compensation, base pay, gross remuneration. harassment ↔︎ undesirable conduct, hostile surroundings, inappropriate behaviour.

None of those aliases are dictionary synonyms within the standard sense. They’re domain-specific equivalences validated by an insurance coverage underwriter, a clinician, a contract lawyer, an HR skilled. The embedding finds them as candidates; the knowledgeable says sure or no. Power majeure equals act of God provided that you recognize it does.

HyDE makes this implicit (the LLM invents the doc’s probably vocabulary on the fly, part 3.2 confirmed the place it falls brief). The sequence makes it express: a curated key phrase dictionary maintained by area specialists.

# Discovery loop. One corpus, seed phrases the knowledgeable already is aware of.
# Identical `top_lines_for` primitive from part 3.1: no new infrastructure.

SEED_TERMS = ["cancellation", "deductible", "claim", "policyholder"]

draft_aliases = {
    seed: top_lines_for(seed, corpus_lines, okay=10)
    for seed in SEED_TERMS
}
# Every draft is the top-k corpus phrasings closest to the seed.
# Hand to the knowledgeable: they hold the true aliases, drop the coincidences.

validated_dictionary = {
    "cancellation": ["rescission", "termination", "lapse of cover",
                     "surrender of the policy"],
    "deductible":   ["excess", "franchise"],
    "declare":        ["loss notification", "incident report"],
    "policyholder": ["insured", "assured", "named party"],
}

# Manufacturing retrieval hits this dictionary straight. No embedding name
# on the recent path; the embedding solely ran as soon as, at discovery time.

The outcomes, on a small insurance coverage corpus. Run the seed cancellation towards seven candidate strains (4 actual aliases, three off-topic decoys) and the 4 aliases rise to the highest.

*One seed question, seven candidates. The 4 actual aliases rank top-4 on three of 4 fashions – Picture by writer*

The sample is the invention workflow at work. The mannequin lists candidates ranked by similarity. The knowledgeable reads them, retains rescission, termination, lapse of canopy, give up of the coverage, drops premium funds and the opposite off-topic strains, and the dictionary entry for cancellation is in-built one assessment move. From that time on, retrieval is key phrase search on the dictionary.

The workflow is progressive and runs with the specialists, not round them. First few queries on a brand new corpus, run embeddings line-by-line as in part 3.1. They floor doc phrasings no one anticipated: the contract makes use of non-employee labor the place the person mentioned contractor; the medical file makes use of A1C the place the person mentioned blood sugar stage; the process guide makes use of part 4.2 the place the person mentioned time beyond regulation rule. Seize these phrasings as key phrase aliases in a rising dictionary, with the knowledgeable validating each (they know which aliases are actual equivalences and that are coincidences).

Subsequent queries undergo key phrase search with the enriched dictionary, no embedding name wanted. Every retrieval is now auditable (we all know which key phrases matched), sooner (no LLM/embedding latency on the recent path), and the dictionary itself turns into a sturdy enterprise asset that survives engineering turnover.

The reframing is sharper than the usual one. Embeddings aren’t the manufacturing retriever. They’re the bootstrap that builds the manufacturing retriever, one key phrase alias at a time, in collaboration with the individuals who already know the corpus. Article 6 (Understanding the Query) develops the dictionary engineering: area hints, knowledgeable aliases, a number of different phrasings, the suggestions loop with retrieval outcomes. Article 7 (Retrieval) develops the focused retrieval structure that consumes the dictionary.

3.4 The HR and customer-feedback case

Most enterprise paperwork aren’t sentiment-heavy. Contracts, regulatory texts, monetary experiences, technical specs are factual corpora; sections 3.1 by means of 3.3 are constructed for them. A subset of enterprise corpora is completely different: buyer survey verbatims, worker barometer feedback, help ticket free-text, model mentions on social. The vocabulary right here is emotional (drained, pissed off, delighted, let down) relatively than technical (power majeure, Solvency II, cedent).

The invention workflow nonetheless applies. An HR analyst constructing a burnout-signal lexicon sorts an express idea they care about, say feeling overwhelmed. The embedding surfaces phrasings from the corpus in the identical emotional cluster. The highest match under shares zero content material phrases with the question; all 4 fashions, GloVe by means of 3-large, rank it #1.

*Question `feeling overwhelmed` towards an emotional paraphrase with zero shared tokens. TARGET wins on each mannequin – Picture by writer*

No emotional understanding right here. Emotional vocabulary clusters within the mannequin’s house the way in which insurance coverage vocabulary does in part 1.2 (payment ↔︎ cost). TF-IDF + logistic regression hit roughly 88% on IMDB sentiment in 2010, earlier than contextual embeddings, as a result of emotional phrases carry sign on their very own. Embeddings prolong that with synonymy: overwhelmed, drained, empty, hole, on the sting are robotically shut within the house, so a question in a single time period surfaces sentences utilizing any of them. The identical mechanism as part 1.2, utilized to a distinct vocabulary.

A helpful break up for manufacturing. If sentiment classification is the purpose (rating every suggestions entry, combination tendencies, detect disaster spikes), a devoted sentiment mannequin outperforms a common embedding. The devoted mannequin is educated for the duty; the embedding is educated for similarity. For vocabulary discovery (what phrasings categorical misery in our corpus?), the embedding stays the appropriate instrument. It surfaces the lexicon the knowledgeable validates. Two duties, two instruments. Sarcasm (“Oh nice, one other Monday”) breaks each, and reliability there wants context the verbatim normally doesn’t present.

The sample right here is the article’s bigger one. First impression: this appears to be like like emergent emotional understanding. Look nearer: it’s keyword-similarity with a wiser notion of “shut”. Apply accordingly: use the mannequin to find the vocabulary you didn’t have; don’t ask it to know the intent behind the vocabulary.

4. Conclusion

Embeddings are one brick of Enterprise Doc Intelligence Quantity 1, which builds enterprise RAG brick by brick. The key phrase dictionary this text ends on is what manufacturing retrieval (Article 7) reads at question time, quick and auditably.

Embeddings are highly effective and restricted in particular, predictable methods.

Part 1: what they deal with. Synonyms, paraphrase, typos, cross-lingual queries, and polysemy work nicely, with every era of mannequin widening the protection margin.
Part 2: the place they break. Two distinct roots. First, generally the time period merely isn’t within the mannequin.
Part 2.1 made this concrete with pool: a random train-timetable sentence beat the reinsurance paraphrase on three of 4 fashions. Enterprise vocabulary lives right here. Second, when the time period is within the mannequin, the embedding ranks by time period similarity, not by question-to-answer mapping.
Part 2.2 confirmed this straight on the best queries. From that second root cascade negation (part 2.3), actual values (part 2.4), topical proximity beating reply relevance (part 2.5), and sign dilution in lengthy context (part 2.6). A complete catalog of “apparent” failures (OOV identifiers, Boolean composition, counting, temporal predicates, multi-hop reasoning) wants no demo.
Part 3: find out how to use them in manufacturing. Use embeddings line by line as a synonym-tolerant Ctrl-F (part 3.1). While you do have to bridge the question-to-answer hole, the load-bearing piece is the key phrases that the reply would include, not the embedding of the rewritten question (part 3.2). The manufacturing reply is a curated key phrase dictionary, constructed by specialists and bootstrapped by line-level embedding discovery (part 3.3). Embeddings aren’t the manufacturing retriever; they’re how you discover the key phrases that the manufacturing retriever then makes use of, quick, auditably, each time.

A case from actual tasks. A workforce constructed a RAG system over business insurance coverage contracts and spent three months chasing recall. They began with OpenAI’s text-embedding-3-small at 71% recall, benchmarked Voyage, Cohere, BGE-M3 (recall moved between 69% and 73%), then fine-tuned BGE on artificial question-passage pairs. Recall climbed to 76%. 5 factors after three months. Then they broke the 200 questions down by sort: 92% on conceptual, 23% on negation, 31% on exact-reference, 18% on internal-acronym. The mixture of 76% hid two classes at near-zero efficiency — no fine-tuning might repair them. Including BM25 alongside the vector search took two days, lifting exact-reference recall to 88%. Including a question growth step for acronyms through an organization glossary took one other day, lifting internal-acronym recall from 18% to 71%. One week of structural work outweighed three months of embedding fine-tuning.

Two indicators a workforce has over-invested in embeddings: the roadmap options “fine-tune the embedding mannequin” as the following milestone earlier than anybody has damaged down the precise failure circumstances; retrieval metrics are reported as a single recall quantity with no per-question-type breakdown, hiding the classes the place embeddings are structurally incorrect.

You simply watched embeddings fail in predictable, structural methods. The reflex, particularly for engineers from an ML background, is to repair the mannequin: extra coaching knowledge, fine-tune, swap suppliers, run a sweep. Article 3 makes the case that that is the incorrect body. The failures you simply noticed are usually not bugs the mannequin can study its method out of. RAG isn’t machine studying, and treating it like one is how groups waste six months optimising the a part of the system that wasn’t damaged.

5. Additional studying

The empirical sample on this article (synonyms, typos, polysemy work; negation, actual identifiers, OOV acronyms fail) matches each managed examine of dense retrievers on out-of-domain enterprise corpora. Reimers and Gurevych (Sentence-BERT, 2019) is the reference for what embedding a line means technically. Ravichander et al. (CONDAQA, 2022) doc the negation failure cleanly. The article reframes HyDE (Gao et al. 2023): the load-bearing piece is the key phrases the hypothetical reply comprises, not the embedding step itself; asking the LLM for the key phrases straight recovers the identical passage with much less infrastructure. High quality-tuning embeddings on enterprise corpora is out of scope right here and revisited in Article 21 (manufacturing).

Identical route because the article:

Reimers & Gurevych, Sentence-BERT, EMNLP 2019 (arXiv:1908.10084). The reference for what embedding a line means technically.
Ravichander et al., CONDAQA, EMNLP 2022 (arXiv:2211.00295). Paperwork that dense fashions systematically fail on negation. Identical route because the empirical sample on this article.
Gao et al., HyDE: Exact Zero-Shot Dense Retrieval with out Relevance Labels, ACL 2023 (arXiv:2212.10496). The HyDE method the article reframes: key phrases from the hypothetical reply are what does the work.
Formal et al., SPLADE, 2021 (arXiv:2107.05720). Discovered sparse retrieval; a bridge between key phrase and embedding worlds, in the identical spirit because the vector search is key phrase search framing.

Totally different angle, completely different context:

Karpukhin et al., Dense Passage Retrieval for Open-Area QA, EMNLP 2020 (arXiv:2004.04906). The canonical dense beats BM25 consequence on open-domain QA benchmarks. The context is in-domain coaching knowledge; this text appears to be like at out-of-domain enterprise corpora the place the consequence doesn’t switch cleanly.
Wang et al., Textual content Embeddings by Weakly-Supervised Contrastive Pre-training (E5), 2022 (arXiv:2212.03533) and Lee et al., NV-Embed, 2024 (arXiv:2405.17428). The scale-fixes-it line: bigger contrastive pre-training corpora shut the OOV hole. The article’s declare is that the failures are structural (compression destroys exact-value sign), not data-volume certain.
Khattab & Zaharia, ColBERT, SIGIR 2020 (arXiv:2004.12832). Late-interaction retrieval as a solution to exact-token matching on the embedding stage; related to the “actual values, inside acronyms” failure mode.
Muennighoff et al., MTEB: Large Textual content Embedding Benchmark, EACL 2023 (arXiv:2210.07316). The benchmark driving the “decide the highest-scoring embedding” mindset. Helpful for buying fashions; the article’s declare is that the leaderboard isn’t the related axis for enterprise OOD vocabulary.