Patot

פָּת֤וֹת אֹתָהּ֙ פִּתִּ֔ים וְיָצַקְתָּ֥ עָלֶ֖יהָ שָׁ֑מֶן מִנְחָ֖ה הִֽוא

Patot is a Python toolkit for Hebrew/English-aware semantic chunking. It chunks texts into semantically coherent, model-ready units for downstream AI workflows such as embedding, retrieval, and question answering.

Sefaria is an open-source library and database of Jewish texts. To download Sefaria data for use with Patot, visit developers.sefaria.org.

Why Patot Exists

Working with long-form Jewish texts in AI systems requires balancing three goals:

Preserve meaning — chunks should follow shifts in topic, not arbitrary character counts.
Preserve structure — Sefaria's segment boundaries are meaningful and should be respected.
Respect model limits — every chunk must fit the embedding model's token constraints.

Patot implements a practical multi-pass chunking pipeline to satisfy all three.

Core Chunking Approach

Patot processes one Sefaria section at a time.

Input: ordered Sefaria segments for a single section.
Chunks may combine adjacent segments within that section.
Chunks never cross section boundaries.

Pass 1: Inter-segment semantic chunking

Patot uses Aurelio Labs' semantic-chunkers library (specifically StatisticalChunker) over the ordered segment list.

Each Sefaria segment is treated as an atomic unit.
Segment embeddings are compared against local semantic context.
Split points are chosen where semantic continuity drops.
Semantic continuities/drops are based on comparing Gemini embeddings of the adjacent windows
The chunker auto-selects thresholds to keep median chunk size near a configured token target.

Result: coherent multi-segment chunks where appropriate, while preserving original segment boundaries.

Pass 2: Intra-segment chunking (singleton-only)

Only singleton segments left untouched by Pass 1 are eligible for further splitting.

The segment is split into sentence/clause units.
The same statistical chunking method is applied inside that segment.

Result: every output chunk is either a combination of whole Sefaria segments, or a subdivision of one single segment. Patot never mixes partial text from multiple segments in one chunk.

Pass 3: Hard token limit enforcement

Semantic chunking optimizes coherence but does not strictly guarantee maximum token size compliance. Patot performs a final validation against max_split_tokens.

Oversized multi-segment chunks are split on segment boundaries.
Oversized single-segment chunks are split into fixed token windows as a final fallback.

Result: all chunks are semantically informed and model-safe.

Design Guarantees

No cross-section chunking
No splitting inside grouped multi-segment chunks after Pass 1
No chunk containing partial text from multiple segments
All output chunks bounded by max_split_tokens

Conceptual Example

Given one section with ordered segments:

Segments 1–3 cover one topic → grouped together.
Segment 4 shifts topic → split into a new chunk.
Segment 5 is very long and standalone → internally chunked by sentence/clause semantics.
Any chunk exceeding hard token limits → split safely in final enforcement.

Use Cases

Patot is suitable for:

RAG pipelines over Sefaria corpora
Source-aware Q&A assistants
Curriculum and class prep tools
Topic clustering and thematic analysis
Source sheet recommendation workflows

Installation

Patot is not on PyPI. Install directly from GitHub:

pip install "patot[chunking,pdf] @ git+https://github.com/Sefaria/patot@v0.1.0"

Core install (pip install patot) provides JSON segment loading and the Gemini embedding client/cache.
[chunking] adds the statistical chunker (PatotChunker), which depends on transformers, huggingface-hub, semantic-chunkers, and semantic-router.
[pdf] adds patot.debug_report for rendering chunking debug traces to PDF, which depends on reportlab and python-bidi.

Usage

from patot import ChunkerConfig, PatotChunker, load_segment_records_from_section

config = ChunkerConfig(debug=False)
chunker = PatotChunker(api_key="...", config=config)  # Google Gemini API key
result = chunker.chunk_segments(segment_records)

Development

pip install -e ".[chunking,pdf]"
pip install pytest
pytest

External Dependency

Patot's semantic-first strategy is built on semantic-chunkers by Aurelio Labs.

License

GPL-3.0

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.github/workflows		.github/workflows
assets		assets
src/patot		src/patot
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Patot

Why Patot Exists

Core Chunking Approach

Pass 1: Inter-segment semantic chunking

Pass 2: Intra-segment chunking (singleton-only)

Pass 3: Hard token limit enforcement

Design Guarantees

Conceptual Example

Use Cases

Installation

Usage

Development

External Dependency

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Patot

Why Patot Exists

Core Chunking Approach

Pass 1: Inter-segment semantic chunking

Pass 2: Intra-segment chunking (singleton-only)

Pass 3: Hard token limit enforcement

Design Guarantees

Conceptual Example

Use Cases

Installation

Usage

Development

External Dependency

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages