LumberChunker: Lengthy-Type Narrative Doc Segmentation – Machine Studying Weblog | ML@CMU

March 19, 2026

79

Hyperlinks:
Paper | Code | Knowledge

LumberChunker lets an LLM determine the place a protracted story needs to be break up, creating extra pure chunks that assist Retrieval Augmented Era (RAG) methods retrieve the fitting info.

Introduction

Lengthy-form narrative paperwork often have an specific construction, akin to chapters or sections, however these items are sometimes too broad for retrieval duties. At a decrease degree, essential semantic shifts occur inside these bigger segments with none seen structural break. After we break up textual content solely by formatting cues, like paragraphs or mounted token home windows, passages that belong to the identical narrative unit could also be separated, whereas unrelated content material may be grouped collectively. This misalignment between construction and that means produces chunks that include incomplete or combined context, which reduces retrieval high quality and impacts downstream RAG efficiency. Because of this, segmentation ought to purpose to create chunks which can be semantically unbiased, slightly than relying solely on doc construction.

So how can we protect the story’s circulate and nonetheless maintain chunking sensible?

In lots of instances, a reader can simply acknowledge the place the narrative begins to shift—for instance, when the textual content strikes to a distinct scene, introduces a brand new entity, or modifications its goal. The issue is that the majority automated chunking strategies don’t take into account this semantic sign and as an alternative rely solely on floor construction. Because of this, they might produce segmentations that look cheap from a formatting perspective however break the underlying narrative coherence.

To make this concrete, take into account the brief passage beneath and determine the optimum chunking boundary!

LumberChunker: Phase 2 (Quiz)

1 Learn the passage

The LumberChunker Methodology

Within the instance above, Possibility C supplies probably the most coherent segmentation. The boundary aligns with the purpose the place the narrative turns into semantically unbiased from the previous context.

Our purpose is to make such a segmentation resolution sensible at scale. The problem is that human-quality boundary detection requires understanding narrative context, which is pricey to use throughout hundreds of paragraphs in long-form paperwork.

LumberChunker approaches this by treating segmentation as a boundary-finding downside: given a brief sequence of consecutive paragraphs, we ask a language mannequin to establish the earliest level the place the content material clearly shifts. This formulation permits segments to fluctuate in size whereas remaining aligned with the underlying narrative construction. In follow, LumberChunker consists of those steps:

1) Doc Paragraph Extraction

Cleanly break up the e-book into paragraphs and assign secure IDs (ID:1, ID:2, …). This preserves the doc’s pure discourse items and provides us protected candidate boundaries.

Instance: From a novel, we extract:

ID:1 “The morning solar filtered by the dusty home windows…”
ID:2 “She walked slowly to the door, hesitating…”
ID:3 “In the meantime, throughout city, Detective Morrison reviewed the case information…”
ID:4 “The earlier evening’s occasions had left him puzzled…”

Every paragraph will get a singular ID for monitoring boundaries.

2) IDs Grouping for LLM

Construct a gaggle G_i by appending paragraphs till the group’s size reaches a token price range θ. This supplies sufficient context for the mannequin to evaluate when a subject/scene really shifts.

Instance: With θ = 550 tokens, we construct, per instance:

G_1 = [ID:1, ID:2, ID:3, ID:4, ID:5, ID:6]

This window, by spanning a number of paragraphs, will increase the prospect that a minimum of one significant narrative shift is current throughout the context.

3) LLM Question

Immediate the mannequin with the paragraphs in G_i and ask it to return the first paragraph the place content material clearly modifications relative to what got here earlier than. Use that returned ID because the chunk boundary; begin the following group at that paragraph and repeat to the tip of the e-book.

Instance: Given G_1 = [p1, p2, p3, p4, p5, p6], the LLM responds: p3

Reply Extraction:
We extract p3 because the boundary. This creates:

Chunk 1: [p1, p2]
Subsequent group (G_2) begins at p3

GutenQA: A Benchmark for Lengthy-Type Narrative Retrieval

To guage our chunking method, we introduce GutenQA, a benchmark of 100 rigorously cleaned public-domain books paired with 3,000 needle-in-a-haystack sort of questions. This enables us to measure retrieval high quality immediately after which observe how higher retrieval results in extra correct solutions in a RAG system.

Key Findings

Retrieval: LumberChunker leads ⭐

LumberChunker leads throughout each DCG@okay and Recall@okay. By okay=20, it reaches DCG ≈ 62.1% and Recall ≈ 77.9%, displaying that higher segmentation improves not solely which passages seem first, but in addition how reliably the fitting context is retrieved.

LumberChunker: Phase 4 (Tables)

Retrieval Efficiency Comparability

	1	2	5	10	20
Semantic Chunking	29.50	35.31	40.67	43.14	44.74
Paragraph-Stage	36.54	42.11	45.87	47.72	49.00
Recursive Chunking	39.04	45.37	50.66	53.25	54.72
HyDE^†	33.47	39.74	45.06	48.14	49.92
Proposition-Stage	36.91	42.42	44.88	45.65	46.19
LumberChunker	48.28	54.86	59.37	60.99	62.09

	1	2	5	10	20
Semantic Chunking	29.50	38.70	50.60	58.21	64.51
Paragraph-Stage	36.54	45.37	53.67	59.34	64.34
Recursive Chunking	39.04	49.07	60.64	68.62	74.35
HyDE^†	33.47	43.41	55.11	64.61	71.61
Proposition-Stage	36.91	45.64	51.04	53.41	55.54
LumberChunker	48.28	58.71	68.58	73.58	77.92

Downstream QA: Focused Retrieval Outperforms Massive Context Home windows

We discover that even with very massive context home windows, a non-retrieval setup nonetheless performs worse than RAG, displaying that deciding on centered, related passages is simpler than merely growing the quantity of uncooked context. Underneath this setting, when built-in into a typical RAG pipeline on a GutenQA subset, our RAG-LumberChunker is second solely to RAG-Guide, which makes use of hand-segmented ground-truth chunks.

LumberChunker: Phase 4 (Bar Chart)

Downstream QA Accuracy (%)

A Candy Spot Round θ ≈ 550 Tokens

We sweep θ ∈ [450, 1000] tokens and discover that θ ≈ 550 constantly maximizes retrieval high quality: massive sufficient for context, sufficiently small to maintain the mannequin centered on the present flip within the story.

LumberChunker: Phase 5 (Theta Slider + Line Chart)

DCG@okay vs Token Price range (θ)

This doesn’t imply the ensuing chunks are massive. In follow, because the desk exhibits, the common chunk dimension is about 334 tokens, which means that LumberChunker usually detects earlier semantic shifts throughout the window.

LumberChunker: Phase 5 (Token Rely Desk)

Common variety of tokens per chunk and the whole variety of
chunks after segmenting GutenQA
Methodology	Avg. #Tokens / Chunk	Whole #Chunks
Semantic Chunking	185 tokens	191059
Paragraph Stage	79 tokens	248307
Recursive Chunking	399 tokens	31787
Proposition-Stage	12 tokens	914493
LumberChunker	334 tokens	36917

Conclusion

LumberChunker reframes doc chunking as a semantic boundary detection downside. As an alternative of counting on mounted token limits or floor construction, it makes use of a rolling context window to establish the earliest level the place the that means of the textual content turns into unbiased from what got here earlier than, producing segments that higher align with the underlying narrative construction.

On the GutenQA benchmark, LumberChunker constantly improves retrieval and downstream QA over conventional fixed-size and recursive strategies, approaching the standard of guide, human-curated segmentations.

These outcomes recommend that segmentation is not only a preprocessing step, however a core design selection for retrieval methods. By creating semantically unbiased chunks, LumberChunker supplies a sensible manner to enhance how long-form paperwork are retrieved and utilized in RAG pipelines.

Quotation

For those who discover LumberChunker helpful in your analysis, please take into account citing:

@inproceedings{duarte-etal-2024-lumberchunker,
    title = "{L}umber{C}hunker: Lengthy-Type Narrative Doc Segmentation",
    creator = "Duarte, Andr{'e} V.  and Marques, Jo{~a}o DS  and Gra{c{c}}a, Miguel  and Freire, Miguel  and Li, Lei  and Oliveira, Arlindo L.",
    editor = "Al-Onaizan, Yaser  and Bansal, Mohit  and Chen, Yun-Nung",
    booktitle = "Findings of the Affiliation for Computational Linguistics: EMNLP 2024",
    month = nov,
    12 months = "2024",
    deal with = "Miami, Florida, USA",
    writer = "Affiliation for Computational Linguistics",
    url = "https://aclanthology.org/2024.findings-emnlp.377/",
    doi = "10.18653/v1/2024.findings-emnlp.377",
    pages = "6473--6486",
    summary = "LumberChunker reframes doc chunking as a semantic boundary detection downside..."
}

Weblog created by Raymond Jiang and André Duarte

LumberChunker: Lengthy-Type Narrative Doc Segmentation – Machine Studying Weblog | ML@CMU

Introduction

The LumberChunker Methodology

1) Doc Paragraph Extraction

2) IDs Grouping for LLM

3) LLM Question

GutenQA: A Benchmark for Lengthy-Type Narrative Retrieval

Key Findings

Retrieval: LumberChunker leads ⭐

Downstream QA: Focused Retrieval Outperforms Massive Context Home windows

A Candy Spot Round θ ≈ 550 Tokens

Conclusion

Quotation

Related Articles

Bluetti moveable energy stations and photo voltaic turbines are cheaper than they’ve been all yr for Prime Day

Greatest Mass Payout Platform 2026

How you can Construct AI Brokers That Really Be taught

Latest Articles

Bluetti moveable energy stations and photo voltaic turbines are cheaper than they’ve been all yr for Prime Day

Greatest Mass Payout Platform 2026

How you can Construct AI Brokers That Really Be taught

Oracle’s AI-based layoffs might not be over

DeepReinforce Releases Ornith-1.0: An Open-Supply Coding Mannequin Household That Learns Its Personal RL Scaffolds