Skip to content

semantic chunker ignores overlap — produces duplicate runs across overlap sweep values #44

Description

@entzyeung

Summary

When a sweep includes the semantic chunking method together with multiple overlaps, every overlap value produces byte-for-byte identical chunks, because semantic never receives the overlap parameter.

Where

  • server/core/chunkers/__init__.py:39 dispatches chunk_semantic(text, chunk_size) — the overlap argument is dropped.
  • server/core/chunkers/semantic.py:9def chunk_semantic(text, chunk_size) doesn't accept overlap.
  • server/models/config.py expand_sweep still Cartesians chunk_sizes × overlaps for all methods including semantic.

Impact

With e.g. overlaps: [50, 100, 150], the three semantic runs are identical but are each embedded, stored in Atlas, queried, and scored — 3× the API/storage/compute cost, plus three indistinguishable rows in the results table that look like a comparison but aren't.

Reproduce

Run any sweep with methods: [semantic] and overlaps: [50, 100, 150] → three runs, identical chunk output.

Proposed fix

Implement a real overlap for semantic chunking: carry the trailing sentence(s) of each semantic group into the start of the next group (sentence-granular, consistent with how the sentence chunker handles overlap). This makes overlap a meaningful dimension for semantic rather than a no-op.

Alternative considered: dedupe semantic in expand_sweep so it runs once. Rejected in favour of giving overlap real meaning, but happy to go that route if you prefer.

I'm happy to open a PR for the proposed fix if you're on board with the direction.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions