Summary
When a sweep includes the semantic chunking method together with multiple overlaps, every overlap value produces byte-for-byte identical chunks, because semantic never receives the overlap parameter.
Where
server/core/chunkers/__init__.py:39 dispatches chunk_semantic(text, chunk_size) — the overlap argument is dropped.
server/core/chunkers/semantic.py:9 — def chunk_semantic(text, chunk_size) doesn't accept overlap.
server/models/config.py expand_sweep still Cartesians chunk_sizes × overlaps for all methods including semantic.
Impact
With e.g. overlaps: [50, 100, 150], the three semantic runs are identical but are each embedded, stored in Atlas, queried, and scored — 3× the API/storage/compute cost, plus three indistinguishable rows in the results table that look like a comparison but aren't.
Reproduce
Run any sweep with methods: [semantic] and overlaps: [50, 100, 150] → three runs, identical chunk output.
Proposed fix
Implement a real overlap for semantic chunking: carry the trailing sentence(s) of each semantic group into the start of the next group (sentence-granular, consistent with how the sentence chunker handles overlap). This makes overlap a meaningful dimension for semantic rather than a no-op.
Alternative considered: dedupe semantic in expand_sweep so it runs once. Rejected in favour of giving overlap real meaning, but happy to go that route if you prefer.
I'm happy to open a PR for the proposed fix if you're on board with the direction.
Summary
When a sweep includes the
semanticchunking method together with multipleoverlaps, every overlap value produces byte-for-byte identical chunks, becausesemanticnever receives the overlap parameter.Where
server/core/chunkers/__init__.py:39dispatcheschunk_semantic(text, chunk_size)— theoverlapargument is dropped.server/core/chunkers/semantic.py:9—def chunk_semantic(text, chunk_size)doesn't acceptoverlap.server/models/config.pyexpand_sweepstill Cartesianschunk_sizes × overlapsfor all methods including semantic.Impact
With e.g.
overlaps: [50, 100, 150], the three semantic runs are identical but are each embedded, stored in Atlas, queried, and scored — 3× the API/storage/compute cost, plus three indistinguishable rows in the results table that look like a comparison but aren't.Reproduce
Run any sweep with
methods: [semantic]andoverlaps: [50, 100, 150]→ three runs, identical chunk output.Proposed fix
Implement a real overlap for semantic chunking: carry the trailing sentence(s) of each semantic group into the start of the next group (sentence-granular, consistent with how the
sentencechunker handles overlap). This makesoverlapa meaningful dimension for semantic rather than a no-op.Alternative considered: dedupe semantic in
expand_sweepso it runs once. Rejected in favour of giving overlap real meaning, but happy to go that route if you prefer.I'm happy to open a PR for the proposed fix if you're on board with the direction.