Friday, July 3, 2026

The Untaught Classes of RAG Retrieval: Cosine Is Not the Basis


companion to Enterprise Doc Intelligence, the collection whose philosophy is specified by Amplify the Skilled. It zooms in on brick 3 (retrieval) of the four-brick structure and surfaces the teachings most tutorials skip.

The mainstream story has retrieval as embed the query, return top-k by cosine, optionally rerank. We disagree with virtually each a part of it. Retrieval is filtering on structured tables, not looking free textual content. Embeddings are the elective fallback, not the muse. Anchor and context are two granularities, not one. Every of those is a place we will defend, with penalties you possibly can measure.

the place this text sits within the collection: brick 7 (retrieval) highlighted – Picture by creator

📓 Runnable companion notebooks are on GitHub: doc-intel/notebooks-vol1.

The general public companion-code repo at doc-intel/notebooks-vol1 – Picture by creator

The naive baseline this text pushes again on

The architectural distinction: a single cosine sign over chunks vs three alerts in parallel on structured tables – Picture by creator

The naive pipeline chunks the doc, embeds each chunk, embeds the query, ranks by cosine. That single sign is opaque, and it throws away the doc’s construction. We maintain the doc as line_df + toc_df and run three retrieval alerts in parallel (key phrase on traces, TOC reasoning, embedding cosine), then let an LLM arbiter rank as soon as on the finish with all three units of hits in view.

Key phrases all the time run, the TOC all the time causes, embeddings hearth solely when the vocabulary mismatches – Picture by creator

Beneath are the six untaught classes of this brick.

Lesson 1 – Retrieval is filtering, not looking

As soon as parsing is completed, retrieval is a SQL-like filtering drawback over line_df and toc_df, the reverse of the chunk-embed-cosine-top-k framing. The shift is straightforward to state: the query has columns, the doc has columns, and retrieval is the be a part of.

Why it issues. Search and filter are usually not synonyms , the 2 operations have totally different mechanics. Search scores each candidate on a steady similarity (cosine , BM25), forces a top-k cutoff, and all the time returns one thing, even when the reply will not be within the doc. Filter applies a boolean situation (line.accommodates("X") , toc.title in [...]), retains each row that matches and no extra, and might return zero rows when the doc doesn’t carry the reply. The audit consequence is the most important a part of the hole: a filter’s situation is one line of inspectable code that runs the identical manner in six months; a search’s rating will depend on which dimensions of the embedding mattered, and you can’t replay that judgment with out re-running the mannequin.

Concrete distinction. The consumer asks “What positional encoding does the paper use?”. Naive RAG embeds the query, scores 300+ chunks, returns the top-5. Collection RAG filters line_df the place the road accommodates "positional encoding" (4 hits), filters toc_df the place the part title accommodates "positional" (1 part, 3.5 Positional Encoding), and the arbiter sees each, anchor: the road; scope: the part. No cosine wanted.

Article 7A: Retrieval is filtering, not search lays out the psychological mannequin.

Lesson 2 – Anchor and context, stored aside

You anchor on the one line that mentions “premium” (exact) however go the entire surrounding part to era (ample context); conflating them breaks precision and protection in a single transfer. Prime-k forces you to choose: tiny chunks lose context, large chunks lose precision. We get each, by preserving them aside.

Concrete distinction. For a definition query, the anchor is the one line ( "the deductible is the quantity the insured pays earlier than protection begins" ), the scope is the paragraph round it ( three sentences of context the LLM must phrase the reply ). Naive top-k both returns the road (no context) or the paragraph (anchor unclear). Collection retrieval returns anchor + scope as a typed pair.

Article 7A: Retrieval is filtering, not search attracts the road between anchor and context.

Lesson 3 – Embeddings come final, not first

Key phrases all the time run (low cost, deterministic); the doc’s personal TOC is a first-class retrieval methodology; embeddings are the elective closing sign, solely when vocabulary mismatch is anticipated. The 2024-era reflex begins with embeddings; we go away them for the circumstances the place the cheaper alerts failed.

Concrete distinction. A factual lookup on insurance coverage coverage: “efficient date?”. Naive RAG embeds, returns 5 chunks. Collection runs key phrase on "efficient" and "date" → 1 line discovered → achieved. Embeddings by no means ran. Price: one regex go over line_df; a number of milliseconds. The two-cent cosine search didn’t occur.

Article 7B: Discovering the appropriate anchors builds the three-signal pipeline.

Lesson 4 – Key phrases show absence; embeddings can’t

A zero on key phrase search means the reply is genuinely not there; a zero on embedding similarity could possibly be absence or simply totally different phrases, so embeddings are a refinement, not a choice gate. This asymmetry is the case for key phrases as the first sign in enterprise RAG.

Concrete distinction. The consumer asks “does this contract cowl earthquake harm?” on a flood-only coverage. Key phrase seek for "earthquake" returns zero matches in line_df . The pipeline can ship answer_found = False confidently. Embedding cosine returns 5 chunks (the closest topically associated traces about pure disasters ) and the LLM, seeing them, could infer a incorrect sure. Key phrases saved the day.

Article 7B: Discovering the appropriate anchors explains the keyword-first self-discipline.

Lesson 5 – Co-occurrence beats BM25 on slender corpora

BM25 ranks by time period frequency, however the enterprise reply form is one point out of a subject subsequent to a particular worth, so co-occurrence boosts and high-value regex anchors beat statistical IDF on slender corpora. The IDF assumptions break on a 20-document corpus the place each time period is “uncommon” by Wikipedia requirements.

Concrete distinction. The query is “what’s the deductible quantity?”. BM25 ranks by frequency of "deductible"; the road that seems 12 occasions in a glossary part ranks first. Co-occurrence search ranks traces that include each "deductible" and a quantity; the precise coverage line ( "the deductible is $1000" ) ranks first as a result of it co-occurs with $1000 , and the LLM can extract the worth cleanly.

Article 7B: Discovering the appropriate anchors measures co-occurrence towards BM25.

Lesson 6 – One LLM go over the TOC

Handing the 20-100 row toc_df to a small mannequin and asking which sections reply the query prices one cached name and catches the paraphrases (“exit early” ≈ “Termination”) key phrase matching misses.

TOC reasoning is likely one of the most under-used retrieval alerts in manufacturing RAG.

Concrete distinction. The consumer asks “when can I go away the coverage early?”. Substring matching on "go away" returns zero TOC entries. An LLM name on the total TOC ( 28 rows, suits in a single small immediate ) returns part “Termination and Cancellation”, the right paraphrase. One cached LLM name, deterministic afterwards, and the appropriate anchor.

Article 7B causes over the TOC, and Article 7C: An LLM as arbiter provides the arbiter.

The six classes share one transfer: refuse the chunk-embed-cosine reflex, and deal with retrieval as filtering on structured tables as an alternative. Key phrases all the time run as a result of they show absence; the TOC is a first-class sign as a result of the doc already declared its construction; embeddings are the elective refinement, not the muse. The deep-dives (7A, 7B, 7C, 7bis) ship runnable code on actual paperwork; this piece is {the catalogue} that factors at them.

Throughout sectors and professions

The identical three-signal retrieval sample ( key phrase on line_df + reasoning on toc_df + embedding fallback ) holds in each area. The vocabulary and the TOC depth differ; the sign hierarchy doesn’t. 5 sectors beneath, one retrieval sample, one audit hint per name.

Embeddings hearth solely on the medical row the place vocabulary diverges from the doc – Picture by creator

Embeddings hearth solely on the medical row, the place the consumer’s vocabulary ( “tachycardia” ) diverges from the doc’s ( “fast coronary heart charge” ). The opposite 4 rows resolve totally on key phrase + TOC. Key phrases show absence (Lesson 4), the TOC catches paraphrases (Lesson 6), and the anchor / scope break up retains precision and context aside (Lesson 2) in each row. The fee gradient is actual: the 4 keyword-resolved rows run in milliseconds with zero LLM tokens; the medical row pays for one embedding go and one arbiter name.

Sources and additional studying

The mainstream literature on retrieval is formed by web-scale search and shorter shopper corpora. The collection stance assumes a small enterprise corpus the place the construction is understood and the vocabulary is the asset.

Related Articles

Latest Articles