From Regex to Imaginative and prescient Fashions: Which RAG Method Suits Which Drawback

Ms don’t deserve the basic playbook. Article 3 mentioned there is no such thing as a THE RAG method. You continue to have to choose one. This text is the diagnostic that tells you which ones.

Most groups constructing RAG programs attain for a similar playbook: parse the doc into chunks, embed each chunk, drop them in a vector retailer, embed the query, retrieve the top-k by cosine similarity, hand the outcome to an LLM. Name it the basic RAG playbook. Each tutorial teaches it. Each demo runs on it.

The precise issues range way more than the playbook suggests. A couple of actual circumstances.

Three circumstances at three totally different extremes.

Templated, high-volume paperwork. Insurance coverage certificates, KYC kinds, regulatory filings, month-to-month brokerage statements. The identical software program writes the identical format on each doc. 100 traces of regex extract the fields in microseconds. The basic playbook runs right here too nevertheless it pays an LLM to do what the format gave you at no cost.

Similar form throughout industries: payroll stubs, financial institution statements, lab take a look at stories, tax filings, compliance attestations, provider invoices from one ERP. Wherever one piece of software program writes each doc, the format is a contract.

Sarcasm in customer-service transcripts. “Discover each sarcastic comment on this month’s name recordings.” Customary sentiment scoring (anger, frustration, pleasure) is basically solved by a sentiment lexicon: unacceptable, ridiculous, annoyed all flag clearly. Sarcasm is the canonical exception. “Oh, improbable service, solely needed to wait 45 minutes” scores constructive on each lexicon, and the embedding clusters it with the honest model as a result of the floor phrases are almost the identical. The one sincere methodology is an LLM that reads every name in full and judges the hole between what is claimed and what’s meant.

Similar form throughout features: HR exit interviews in search of hidden frustration, internal-chat archives in search of cultural purple flags earlier than an M&A detailed, earnings-call transcripts in search of locations the CFO hedged, sales-call recordings in search of guarantees the contract didn’t authorise. Tone and intent, no anchor within the textual content.

Engineering schematics (a distinct axis altogether). Drawings, slides the place knowledge lives within the chart, technical specs with embedded pictures. Pure-text RAG returns the caption and misses the schematic. Imaginative and prescient fashions match right here, and solely right here.

Similar form: architectural blueprints, scanned handwritten information, slide decks the place knowledge lives within the chart, lab pocket book pages, medical imaging stories. Wherever the that means lives within the pixels.

The basic playbook is overkill on templated paperwork (regex would do), dimensionally improper on name transcripts (no anchor exists), and modality-blind on schematics (imaginative and prescient is required). It suits a center band of issues and ships as if it lined all the things. That center band is actual and Part 3.3 catalogues it; the price of mismatch on the remaining is what this text exists to forestall.

This text is the diagnostic. Three steps, so as.

Establish the 2 axes: RAG issues aren’t a single downside. They sit on an image with two axes: how structured your paperwork are, and the way managed your questions are. Every mixture requires a distinct stack.
Establish the strategies per area: Every area of the image has its personal stack: regex, part retrieval, hybrid retrieval (lexical search + embedding similarity), imaginative and prescient, SQL aggregation. A 3rd axis (the agentic dimension, part 2.4) sits on prime of those and decides how a lot runtime management the LLM will get. The catalog later within the article maps every area to its method zone.
Find your personal case: The place do your paperwork sit on the complexity axis? The place do your questions sit on the management axis? The intersection factors to a area, and to the strategies that match it.

You’re not right here to construct all the things. You’re right here to search out the place you sit, then learn the elements of the sequence that match. Most readers will skip half of it.

A observe earlier than the article will get technical. Most enterprise RAG is in two shapes: extracting fields from templated paperwork (the regex case within the opener), or answering free-form questions on heterogeneous paperwork like contracts and stories (the place the remainder of the sequence spends most of its time). Conversational transcripts are an actual third form, widespread in customer support, HR, and compliance; sarcasm is the toughest query they increase. Pure imaginative and prescient content material (schematics, slide decks) and corpus-scale questions (Half IV) come up much less usually. You could meet one or two of those. The grid under helps you to find your case on sight.

This diagnostic is one piece of a bigger framing: Enterprise Doc Intelligence Quantity 1 builds enterprise RAG brick by brick, and the areas of the grid this text maps level to the articles within the sequence the place every method will get constructed.

1. Two axes: doc complexity and query management

Each downside we’ll meet on this sequence sits someplace on two axes:

Doc complexity: How redundant is the construction throughout your paperwork? Can a parser deal with fields by place, by heading, or do you want a mannequin that sees the web page?
Query management: Who frames the query? An engineer writing a hard and fast immediate, or a consumer typing freely right into a chat field, probably with no concept what to ask?

These two axes are virtually unbiased. The one coupling: a fixed-template doc (Tier 1, under) normally forces engineer-templated questions (Tier A), because the consumer by no means varieties a query. Outdoors that nook, any doc tier can pair with any query tier.

1.1 Doc axis: from a hard and fast template to a imaginative and prescient mannequin

Quantity 1 stays contained in the PDF scope. Multi-format paperwork (Phrase, Excel, PowerPoint, mail) are Quantity 2’s territory; all the things under describes one PDF at a time.

Paperwork range in structural redundancy: how a lot of their format is shared throughout the corpus. 5 tiers cowl most enterprise conditions.

*5 tiers of doc complexity, with the method that matches every – Picture by writer*

Tier 1: Fastened template: Each doc has the identical construction, the identical fields in the identical place, usually produced by the identical software program: insurance coverage certificates from a single dealer, KYC kinds, tax filings, inside compliance attestations. The construction is so predictable which you could deal with fields by their coordinates on the web page. Method: regex or coordinate-based extraction, no mannequin.

Tier 2: Household of templates: Paperwork observe a recognizable sample with variations (totally different vendor, totally different software program, totally different yr): invoices throughout suppliers, leases throughout landlords, employment contracts throughout firms in the identical authorized framework. Method: a regex per template plus a few-shot LLM as fallback when the template drifts.

Tier 3: Heterogeneous structured: Every doc has its personal construction (sections, headings, tables of contents) however the buildings don’t repeat throughout paperwork: customized authorized contracts, technical manuals from totally different distributors, monetary stories. Method: parse the construction, retrieve through the doc’s personal desk of contents.

Tier 4: Unstructured / OCR’d: Scanned PDFs, photographs of paper, emails, free-form notes: the textual content is there however the format is degraded or absent. Method: OCR with confidence scoring, then hybrid retrieval (lexical + embeddings) over the noisy textual content.

Tier 5: Visually wealthy: Paperwork the place the that means lives within the visuals: schematics, dense knowledge tables embedded as pictures, slide decks with charts, engineering drawings. A pure-text parse loses the reply. Method: a vision-capable mannequin on the web page picture, usually mixed with text-side RAG.

The additional down this axis you sit, the extra you pay per doc. The precise transfer is to push each downside as far up as sincere evaluation permits. A group that decides their corpus is “too advanced for regex” with out checking the structural redundancy is selecting the costly reply by default.

1.2 Query axis: from a hard and fast immediate to a multi-turn chatbot

The query axis is the one most groups skip. Two questions can look an identical syntactically but require fully totally different stacks. The dimension that issues is who controls the query and the way a lot.

*4 tiers of query management, from a hard and fast engineer immediate to a free consumer question with clarification – Picture by writer*

Tier A: Engineer-templated: The query is a parameter of the system: “Extract the efficient date.”, “What’s the coverage quantity?”. The engineer wrote the immediate, calibrated it, examined it on a thousand paperwork. The consumer, if any, doesn’t even kind a query. Method: subject extraction, structured output, no question-parsing step wanted.

Tier B: Consumer fills slots: The query is a template with user-supplied values: “Present me the clause about {subject} on this contract.” The consumer picks the subject from an inventory, or varieties a tag. The form of the question is mounted, just one slot varies. Method: part retrieval, lookup in opposition to a identified taxonomy.

Tier C: Free consumer question, one-shot: The consumer varieties no matter they need, the system solutions in a single go: “Why does this contract differ from final yr’s?”. That is the basic chat-with-your-document setup, the place the pipeline should parse the query, resolve what to retrieve, and reply. Method: single-document RAG with query parsing.

Tier D: Free question plus clarification. Similar as C, however the system can ask the consumer again when the query is ambiguous: “Which web page do you imply? Did you imply the sub-tenant or the primary tenant?” That is what actual chatbots do, and it dramatically widens the vary of questions a system can serve. Method: query parsing plus a clarification loop.

A small instance to make the clarification concept concrete. Think about a consumer asks: “What’s the deductible?” on a single insurance coverage contract that mentions deductibles in three sections (house, auto, journey protection). A naive pipeline retrieves one thing believable and returns a assured improper reply. A system that can ask again (“Which protection: house, auto, or journey?”) fixes the issue on the supply.

This pushes a constraint upstream into parsing. To detect that the consumer talked about “web page 3” or “the second appendix”, your parser will need to have preserved web page numbers, part indices, and heading textual content as metadata on each chunk. The web page quantity sounds trivial while you have a look at any single doc, however it’s the easiest instance of a parsing resolution that the query facet will depend on. Article 5 covers this intimately.

Query scale is a separate query, not a tier on this axis. “What number of PDFs are in your corpus, and are they homogeneous or heterogeneous?” is a data-side concern, picked up by part 3.2 of the diagnostic and developed in Half IV (Articles 14-17). Mixing it into the query axis blurs two various things, so it stays out.

1.3 From case to method zone

Cross the 2 axes and each single-PDF RAG downside lands someplace on the image. Every area requires a distinct stack. Most groups construct for one or two areas and fake the remaining don’t exist. The grid under is a considering device, not a strict taxonomy: actual issues usually sit between two circumstances, and the boundaries between zones are fuzzy on function.

*every case (a doc tier × query tier) maps to the only method that matches – Picture by writer*

The top-left nook (rows 1-2, columns A-B) is deterministic territory. Fastened templates, managed questions. No LLM is required for the sphere extraction itself; the LLM seems at most as a fallback when the template drifts. That is the place the insurance-broker mistake from the opening lives. Most enterprise doc workflows fall right here, and most of them are over-engineered. The dealer case from the opening is the canonical instance: an LLM stack at sixty thousand euros a yr when a hundred-line regex would do.

The center band (rows 2-4, columns C-D) is single-document RAG. The chat-with-your-PDF use case each vendor demo exhibits. It’s actual, it’s arduous, and the remainder of the sequence spends most of its time right here. Chunking (splitting the doc into searchable models), retrieval (choosing the right ones), reranking (a precision move on the shortlist), and analysis (figuring out it really works) all matter when the doc is heterogeneous and the query is open.

The backside row (row 5, all columns) is imaginative and prescient territory. Charts, schematics, dense tables. A textual content parser loses the reply no matter how intelligent the retrieval is. Imaginative and prescient fashions match right here, and solely right here. Article 10 discusses when the imaginative and prescient step is value its value and when it isn’t.

Corpus-scale circumstances sit off the grid, because the grid is one PDF at a time. When the query targets many PDFs without delay (“discover each provider contract with a legal responsibility cap under a million”), the diagnostic routes to Half IV (Articles 14-17): classification at ingestion, structured fields, SQL on the structured facet, RAG on the residual unstructured questions.

The grid isn’t a recipe. It’s a sanity verify. Find your downside, have a look at the method zone, and ask whether or not the system you’re constructing matches. In the event you’re constructing deeper than the case requires, you’re paying for nothing. In the event you’re constructing shallower, you’ll uncover the hole in manufacturing.

2. The strategies per case, and what isn’t a way

When you’ve positioned your downside on the grid, you realize roughly which household of strategies applies. The remainder of the sequence develops every method intimately.

*every card is one method with its devoted article; learn those that match your case, skip the remaining – Picture by writer*

The deterministic household (regex, part anchors that find a heading by identify, coordinate-based extraction that pulls a subject from a hard and fast bounding field on the web page) doesn’t have its personal article. It’s the baseline each engineer ought to already know. Each engineer studying this sequence ought to already know find out how to write a regex. The purpose of together with it on the map is to remind you that it’s an possibility. When the construction of your enter is mounted, it’s the choice.

The one-document RAG household is what Elements II and III of the sequence are about. Structure-aware parsing (Article 5), query parsing and calibration (Article 6), retrieval as scope choice (Article 7), era as managed execution (Article 8), hybrid retrieval and TOC routing (Article 9), adaptive parsing together with imaginative and prescient (Article 10), cross-references (Article 11), itemizing and synthesis (Article 12), composite pipelines with suggestions loops (Article 13). Every of those is a way you’ll attain for within the central band of the grid.

The corpus-scale household is Half IV. The corpus downside (Article 14), getting ready a queryable corpus from a folder of PDFs (Article 15), the corpus ontology (Article 16), querying with SQL filter first and retrieval second (Article 17). These are available in while you go from one PDF to a corpus of PDFs.

In case your downside is within the top-left nook of the grid, you’ll be able to cease studying the sequence after Article 5 (parsing) and skip forward to Article 15 (getting ready a queryable corpus). In case your downside sits within the center band, you’ll want Elements II and III. In case your downside is corpus-scale, you’ll want Half IV on prime of the inspiration. The map tells you which ones.

2.1 Decide the only method that works

The intuition of each engineering group is to construct essentially the most highly effective pipeline they’ll justify. That intuition is improper right here. The precise intuition is to choose the least highly effective method that solves the precise downside. Three causes:

Price: At two million docs a yr, a regex on a VM is a rounding error; an LLM per doc is sixty thousand euros.
Latency: Microseconds vs seconds, the distinction between “feels prompt” and “appears like ready”.
Reliability: A regex both matches or it doesn’t and the engineer can learn the rule; an LLM produces solutions which might be typically subtly improper with failure modes tougher to detect, which disqualifies it for audit-grade extraction.

Most manufacturing doc workflows land on a hybrid: a deterministic core dealing with the majority cleanly, with an LLM fallback for the circumstances the place the format breaks. That hybrid is nearly all the time the correct form, and virtually by no means what groups construct first.

2.2 Lengthy context isn’t a approach out

Each few months somebody publicizes that “RAG is lifeless” as a result of context home windows simply acquired greater. The argument: dump the entire doc within the immediate and let the mannequin determine it out.

This works for one doc and one consumer. It doesn’t work in manufacturing for 4 causes:

Wasteful: A typical query doesn’t want the entire doc. The efficient date of a contract sits on one web page; sending the opposite thirty-nine pays for tokens that received’t be used.
Misses data: Transformers reliably learn what’s at the beginning and finish of a protracted context and routinely skip what’s within the center, so the related web page would possibly by no means be learn even when it’s within the immediate.
Doesn’t scale: Actual use circumstances contain many paperwork. No context window will ever maintain a company archive; at any significant scale you need to select what to ship, and that selection is retrieval.
No grounded reply: With out specific retrieval and quotation, you’ll be able to’t inform which a part of the doc the reply got here from, you’ll be able to’t confirm it, you’ll be able to’t audit it. For any enterprise use case the place the reply must be traceable, that’s disqualifying.

Lengthy contexts are helpful as a device, particularly for single-document deep evaluation. They’re not an alternative to retrieval. Anybody telling you in any other case is promoting one thing.

2.3 Fancy strategies are normally key phrase work in disguise

Methods bought as “superior” usually grow to be key phrase work in one other kind, and infrequently the improper kind. HyDE (Hypothetical Doc Embeddings, Gao et al., 2022) is the clearest instance. The protocol asks an LLM to put in writing the hypothetical doc that may reply the question, then retrieves in opposition to the embedding of that hypothetical. The pitch is that the hypothetical carries the vocabulary an actual reply would use, widening the cosine margin.

The companion pocket book checks this on the Consideration paper: ask why multi-head consideration, let HyDE generate its passage, examine in opposition to the precise vocabulary of part 3.2.2. The 2 lists overlap on precisely one phrase, the part title. HyDE writes ML-textbook vocabulary (semantic relationships, contextual dependencies, parallel processing, consideration patterns); the paper writes operational vocabulary (consideration layers, encoder-decoder consideration, totally different positions, linear transformations).

HyDE understood the query. It by no means learn the doc. In enterprise the key phrases exist someplace on the web page and the area skilled who has learn the web page is aware of them. HyDE pays per question to invent vocabulary that always doesn’t even land on the web page. The skilled dictionary (Article 6), a curated listing of the corpus’s precise vocabulary constructed as soon as with the area skilled, will get the identical job achieved at a fraction of the fee, reused throughout each future query.

2.4 Letting the LLM choose the case

Every mixture of doc tier and query tier is an elementary case, with one matching method. In Quantity 1, the engineer picks the case at compile-time and ships the method. The dispatcher (Article 13) encodes the group’s routing knowledge in Python; the LLM critiques outputs inside mounted loops; each brick is auditable. That’s sufficient for the overwhelming majority of enterprise RAG.

A pure extension has the LLM itself choose the case at runtime, trying on the query, classifying it right into a case, and selecting the method to use. That’s what 2026 business calls agentic RAG. Quantity 3 (Agentic Bricks) builds that runtime-pick layer on prime of the bricks Quantity 1 produces. The shift is about who decides when, not in regards to the bricks themselves: agentic stacks nonetheless attain for a similar parsing, retrieval, and era primitives that Quantity 1 audits and checks.

3. Find your case, in follow

3.1 Place the system across the skilled who exists

The diagnostic under wants one enter most groups skip: who’s the consumer of this technique?

For nearly all enterprise RAG, the reply is the skilled who already is aware of the paperwork. Not an open-domain consumer typing any query. Not a curious browser exploring a public archive. The lawyer studying a contract. The underwriter checking a quote. The compliance officer auditing a clause. Somebody who has learn paperwork like these for years, and who is aware of the vocabulary, the circumstances the place one time period means two issues, and the failure modes to observe for.

The job of the system is then clear: amplify that skilled, not exchange them. Codify their vocabulary, their disambiguations, their year-by-year heuristics. Let the pipeline deal with the amount; let the skilled keep the supply of reality.

This issues earlier than the grid, as a result of it modifications which circumstances are lifelike. A group that claims “anybody can ask something throughout the entire archive” is selecting the bottom-right case by default: open query, combined corpus, the toughest one. A group that claims “our underwriter checks a identified subject on a identified doc kind” is selecting the top-left, usually regex territory.

The framing is never a property of the paperwork or the questions. It’s a selection the group makes. Most groups inherit it from client chatbots with out noticing. First, place the system across the skilled who’s already there. Then learn the case on the grid the reply factors to.

3.2 The diagnostic questions

Earlier than writing any code, work by means of these questions. Out loud, in entrance of a whiteboard, with the area specialists within the room.

Concerning the paperwork: How alike are they throughout the corpus? Native textual content or OCR? What number of PDFs do you’ve, and are they homogeneous or heterogeneous? (that is the place corpus-scale considerations enter the diagnostic — they path to Half IV). Static or day by day ingestion? The place on the doc axis do they sit?

Concerning the questions: Who frames them? An engineer at design time, or a consumer at run time? Is the system one-shot or can it ask again for clarification? Is the reply all the time in a single doc, or distributed throughout a number of? What does no reply imply: acceptable, or unacceptable? The place on the query axis do they sit?

Concerning the constraints: Does the reply should be traceable to the supply? How exact (best-effort, or audit-grade: each quotation traceable to a supply line, each reply replayable)? What’s the fee finances per doc? Typically the distinction between regex and LLM is the distinction between worthwhile and never.

The solutions level you to a case on the grid. The case factors you to a way zone. The method zone factors you to the articles in the remainder of the sequence you’ll want.

3.3 Widespread enterprise circumstances on the grid

A handful of patterns present up repeatedly in actual engagements. Most readers will acknowledge themselves in one in all these.

Area extraction from a fixed-template kind. Suppose insurance coverage certificates from one dealer, KYC kinds from one financial institution, tax filings from one administration: the identical software program writes the identical format on each web page. Case: doc tier 1, query A, top-left nook. Stack: regex on coordinate-addressable fields, with an LLM fallback for the uncommon drift. The basic playbook is overkill right here, and that’s the commonest mistake we meet in actual tasks.

Area extraction throughout template variants. Suppose invoices throughout lots of of suppliers, leases throughout landlords, employment contracts throughout firms in the identical authorized framework: each doc follows one in all a handful of recognizable patterns. Case: doc tier 2, query A or B. Stack: a regex per acknowledged template, plus a few-shot LLM extraction when the doc doesn’t match something within the registry. Classification earlier than extraction.

Q&A on a protracted customized contract: Every contract is structured otherwise, sections range, ten-page glossaries don’t repeat. The consumer asks free-form questions in regards to the contract in entrance of them. Case: doc tier 3, query C or D, center band. Stack: full single-document RAG with TOC routing, hybrid retrieval, schema-driven era. That is the place the 4 bricks of the sequence every carry their very own weight.

Studying a slide deck or a schematic: Suppose engineering drawings, monetary decks the place knowledge lives within the chart, technical specs with embedded pictures: pure-text parsing loses the reply outright. Case: doc tier 5, any query column, backside row. Stack: vision-capable mannequin on the web page picture, mixed with text-side RAG for the prose across the visuals.

Off the grid – corpus territory: “Discover each provider contract with a legal responsibility cap under a million” on lots of or hundreds of contracts. The one-PDF grid stops being the correct body; the query targets the corpus, not one doc. Stack: subject extraction at ingestion, structured fields saved in a database, SQL on the structured facet, RAG solely as a fallback for the residual unstructured questions. Articles 14-17 (Half IV) develop this.

Off the grid – no construction to anchor on: a novel, an intent classification, sarcasm detection. The doc has no construction, the vocabulary has no attribute phrases, and the query requires understanding tone or intent reasonably than finding a passage. Stack: an LLM that scans the entire textual content paragraph by paragraph, deciding what to flag. Not a RAG downside in Quantity 1’s sense; part 2.4 hints at the place this type of runtime decision-making belongs (Quantity 3).

In case your case doesn’t fairly match any of those, stroll the diagnostic in part 3.2 and the outcome will let you know which of the patterns above is closest.

4. Conclusion

Run the diagnostic by yourself corpus earlier than writing code, ideally with the area specialists within the room. The output is the listing of articles in the remainder of the sequence it is advisable to learn, and the listing you’ll be able to skip. Groups that get RAG to ship in manufacturing are those that positioned their downside on the grid first. Groups nonetheless tuning six months in are normally those that began constructing earlier than they did.

The subsequent article opens Half II with the primary brick: doc parsing. Every part misplaced there can’t be recovered later, irrespective of how intelligent the retrieval.

5. Sources and additional studying

The 2-axis grid is a map of the place every method suits throughout doc complexity and query management on a single PDF. The long-context-doesn’t-replace-retrieval declare the grid leans on is grounded by Liu et al. (Misplaced within the Center, TACL 2024) and Lee et al. (long-context benchmark, 2024). The imaginative and prescient row maps to Faysse et al. (ColPali, 2024). The HyDE demo makes use of the method from Gao et al. (HyDE, 2022). The agentic extension hinted at in part 2.4 (the LLM selecting the case at runtime) is the course Quantity 3 develops on prime of the bricks constructed right here.

Similar course because the article:

Liu et al., Misplaced within the Center: How Language Fashions Use Lengthy Contexts, TACL 2024 (arXiv:2307.03172). Fashions systematically miss data mid-input. Helps the declare that lengthy context just isn’t a approach out.
Lee et al., Can Lengthy-Context Language Fashions Subsume Retrieval, RAG, SQL, and Extra?, 2024 (arXiv:2406.13121). Concrete knowledge on the place long-context replaces retrieval and the place it breaks.
Faysse et al., ColPali: Environment friendly Doc Retrieval with Imaginative and prescient Language Fashions, 2024 (arXiv:2407.01449). Imaginative and prescient-language retrieval on the web page picture itself. Anchors the visible row of the grid.
Gao et al., Exact Zero-Shot Dense Retrieval with out Relevance Labels (HyDE), 2022 (arXiv:2212.10496). The hypothetical-document-embedding method examined in part 2.3.

Totally different angle, totally different context:

Yao et al., ReAct: Synergizing Reasoning and Performing in Language Fashions, ICLR 2023 (arXiv:2210.03629). Founding paper of the LLM-picks-tools-at-runtime line. Quantity 3 develops this on prime of the bricks Quantity 1 builds.
Schick et al., Toolformer: Language Fashions Can Train Themselves to Use Instruments, NeurIPS 2023 (arXiv:2302.04761). Similar course as ReAct.
Gao et al., Retrieval-Augmented Technology for Massive Language Fashions: A Survey, 2024 (arXiv:2312.10997). RAG survey; treats RAG as one paradigm with shared considerations (retriever high quality, generator faithfulness).