Skip to content

Commit 309a28c

Browse files
tbitcsoz-agent
andcommitted
feat: integrate Merkur patents — architecture, requirements, test specs
Patent: US 2024/0248922 A1 (Michael Merkur, filed Jan 2024) Title: System and Methods for Searching Text Utilizing Categorical Touch Inputs Core concepts integrated into Glossa Lab architecture: 1. Kandles phonetic color-coding system (7 consonant groups → 7 colors) - Visual 'phonetic fingerprints' of text via color-coded grids - Cross-language comparison through shared sound→color mapping - Complements entropy analysis: entropy says 'is it language?', Kandles shows 'what does the language look like?' 2. Hierarchical text decomposition (volumes → stories → slices → blocks) - Structured corpus navigation for ancient inscriptions - Each level independently analysable 3. Semantic cluster tagging (Culture, Nations, Nature, Religion, etc.) - Concept-based text exploration with configurable taxonomy - Manual tagging for human-in-the-loop annotation Architecture updates: - DEC-007: Merkur patent integration (accepted) - 4-layer analysis framework (Statistical → Structural → Visual → Semantic) - Pipeline architecture diagram showing full analysis flow - Kandles phonetic mapping specification New requirements (10): - REQ-PIPE-001..003: pipeline engine (implemented) - REQ-KDL-001..004: Kandles system (draft) - REQ-HTD-001..002: hierarchical decomposition (draft) - REQ-SEM-001..002: semantic tagging (draft) New test specifications (12): - TEST-PIPE-001..003, TEST-KDL-001..004, TEST-HTD-001..002, TEST-SEM-001..002 Patent PDFs copied to docs/patents/ with reference README. Co-Authored-By: Oz <oz-agent@warp.dev>
1 parent a3093ee commit 309a28c

6 files changed

Lines changed: 502 additions & 0 deletions

File tree

docs/REQUIREMENTS.md

Lines changed: 143 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -347,3 +347,146 @@ The `version` field in the health endpoint response MUST match the version decla
347347
- **Testable:** yes
348348
- **Test:** TEST-INT-002
349349
- **Status:** draft
350+
351+
---
352+
353+
## Analysis Pipelines
354+
355+
Components: `PIPE` — Pipeline/analysis engine
356+
357+
### REQ-PIPE-001 — Block entropy pipeline
358+
359+
The system MUST compute normalised block entropy H_N/ln(L) for block sizes N=1..6 on any uploaded text corpus. Results MUST include raw (nats) and normalised values.
360+
361+
- **Priority:** P1
362+
- **Platform:** all
363+
- **Testable:** yes
364+
- **Test:** TEST-PIPE-001
365+
- **Status:** implemented
366+
- **Reference:** Rao et al. (2009), Science 324:1165
367+
368+
### REQ-PIPE-002 — Character frequency pipeline
369+
370+
The system MUST compute symbol frequencies, rank-frequency distribution, and Zipf exponent for any uploaded text corpus.
371+
372+
- **Priority:** P1
373+
- **Platform:** all
374+
- **Testable:** yes
375+
- **Test:** TEST-PIPE-002
376+
- **Status:** implemented
377+
378+
### REQ-PIPE-003 — Pipeline engine
379+
380+
The system MUST process queued jobs asynchronously via a background engine. Jobs MUST transition through pending → running → completed/failed states. Results MUST be stored and retrievable.
381+
382+
- **Priority:** P1
383+
- **Platform:** all
384+
- **Testable:** yes
385+
- **Test:** TEST-PIPE-003
386+
- **Status:** implemented
387+
388+
---
389+
390+
## Kandles Phonetic-Visual Analysis
391+
392+
Components: `KDL` — Kandles system (per US 2024/0248922 A1, Merkur)
393+
394+
### REQ-KDL-001 — Kandles phonetic mapping
395+
396+
The system MUST implement the Kandles phonetic-to-color mapping: 7 consonant sound groups mapped to 7 colors (Yellow, Grey, Red, Blue, Green, Purple, Brown). Vowel-initial words MUST be mapped to a distinct group (group 0).
397+
398+
- **Priority:** P1
399+
- **Platform:** all
400+
- **Testable:** yes
401+
- **Test:** TEST-KDL-001
402+
- **Status:** draft
403+
- **Patent:** US 2024/0248922 A1 [0109]-[0110]
404+
405+
### REQ-KDL-002 — Kandles color-coded text
406+
407+
The system MUST generate color-coded text output where each word is assigned a color based on the phonetic sound at the beginning of the word, per the Kandles mapping.
408+
409+
- **Priority:** P1
410+
- **Platform:** all
411+
- **Testable:** yes
412+
- **Test:** TEST-KDL-002
413+
- **Status:** draft
414+
- **Patent:** US 2024/0248922 A1 [0007], [0117]
415+
416+
### REQ-KDL-003 — Kandles color grid
417+
418+
The system MUST generate a color-coded grid (equal rows and columns) from any text, where each cell corresponds to a word and is colored by the Kandles system. The grid MUST also encode the Kandles number (1-7).
419+
420+
- **Priority:** P1
421+
- **Platform:** all
422+
- **Testable:** yes
423+
- **Test:** TEST-KDL-003
424+
- **Status:** draft
425+
- **Patent:** US 2024/0248922 A1 [0124]-[0125], FIG. 29 step 2916
426+
427+
### REQ-KDL-004 — Cross-language Kandles comparison
428+
429+
The system MUST be able to generate Kandles grids for texts in different languages/scripts and compare the resulting color patterns. The comparison MUST produce a similarity metric.
430+
431+
- **Priority:** P2
432+
- **Platform:** all
433+
- **Testable:** yes
434+
- **Test:** TEST-KDL-004
435+
- **Status:** draft
436+
- **Patent:** US 2024/0248922 A1 [0110], FIG. 20
437+
438+
---
439+
440+
## Hierarchical Text Decomposition
441+
442+
Components: `HTD` — Hierarchical text decomposition (per US 2024/0248922 A1, Merkur)
443+
444+
### REQ-HTD-001 — Text decomposition into stories and slices
445+
446+
The system MUST support organizing a written work into one or more stories, where each story is comprised of one or more slices. Each slice MUST be independently addressable.
447+
448+
- **Priority:** P2
449+
- **Platform:** all
450+
- **Testable:** yes
451+
- **Test:** TEST-HTD-001
452+
- **Status:** draft
453+
- **Patent:** US 2024/0248922 A1 [0072], [0095], FIG. 29 steps 2902-2904
454+
455+
### REQ-HTD-002 — Slice filtering by clusters and tags
456+
457+
The system MUST support filtering slices by user-selected semantic clusters and/or manual tags. Multiple clusters MUST be combinable (AND/OR).
458+
459+
- **Priority:** P2
460+
- **Platform:** all
461+
- **Testable:** yes
462+
- **Test:** TEST-HTD-002
463+
- **Status:** draft
464+
- **Patent:** US 2024/0248922 A1 [0095]-[0098], FIG. 29 step 2906
465+
466+
---
467+
468+
## Semantic Cluster Tagging
469+
470+
Components: `SEM` — Semantic analysis
471+
472+
### REQ-SEM-001 — Configurable semantic taxonomy
473+
474+
The system MUST support a configurable taxonomy of semantic clusters. Default clusters MUST include at least: Culture, Nations, Nature, Religion, People, and Spiritual.
475+
476+
- **Priority:** P2
477+
- **Platform:** all
478+
- **Testable:** yes
479+
- **Test:** TEST-SEM-001
480+
- **Status:** draft
481+
- **Patent:** US 2024/0248922 A1 [0010], [0080]
482+
483+
### REQ-SEM-002 — Manual tagging
484+
485+
The system MUST support manual tagging of text segments with user-defined labels.
486+
487+
- **Priority:** P3
488+
- **Platform:** all
489+
- **Testable:** yes
490+
- **Test:** TEST-SEM-002
491+
- **Status:** draft
492+
- **Patent:** US 2024/0248922 A1 [0011], [0104]

docs/TEST_SPEC.md

Lines changed: 213 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -628,3 +628,216 @@ Test cases for Glossa Lab, linked to requirements in `docs/REQUIREMENTS.md`.
628628
**Expected result:** Versions match exactly
629629
**Pass criteria:** Versions are identical strings
630630
**Fail criteria:** Version mismatch
631+
632+
---
633+
634+
## Analysis Pipelines
635+
636+
### TEST-PIPE-001 — Block entropy pipeline produces valid output
637+
638+
**Requirement:** REQ-PIPE-001
639+
**Type:** smoke
640+
**Platform:** all
641+
**Automated:** yes (test_study_synthetic.py, test_study_rao2009.py)
642+
643+
**Steps:**
644+
1. Submit a text corpus to the block_entropy pipeline
645+
2. Verify result contains block_entropies array with N=1..6
646+
3. Verify each entry has raw_nats and normalized fields
647+
4. Verify normalized values are in plausible range [0, max_n]
648+
649+
**Expected result:** Valid block entropy results
650+
**Pass criteria:** All fields present, values in range
651+
**Fail criteria:** Missing fields or out-of-range values
652+
653+
### TEST-PIPE-002 — Character frequency pipeline produces valid output
654+
655+
**Requirement:** REQ-PIPE-002
656+
**Type:** smoke
657+
**Platform:** all
658+
**Automated:** planned
659+
660+
**Steps:**
661+
1. Submit a text corpus to the char_freq pipeline
662+
2. Verify result contains total_symbols, unique_symbols, frequencies, zipf_exponent
663+
3. Verify frequencies sum to total_symbols
664+
665+
**Expected result:** Valid frequency results
666+
**Pass criteria:** All fields present, frequencies consistent
667+
**Fail criteria:** Missing fields or inconsistent counts
668+
669+
### TEST-PIPE-003 — Pipeline engine processes jobs
670+
671+
**Requirement:** REQ-PIPE-003
672+
**Type:** smoke
673+
**Platform:** all
674+
**Automated:** yes (test_jobs.py)
675+
676+
**Steps:**
677+
1. Create a job with pipeline="block_entropy" and valid text_id
678+
2. Wait for engine to process
679+
3. Verify job status transitions to completed
680+
4. Verify results are retrievable via GET /api/v1/jobs/{id}/results
681+
682+
**Expected result:** Job processed, results stored
683+
**Pass criteria:** Job completed, results accessible
684+
**Fail criteria:** Job stuck in pending/running, or no results
685+
686+
---
687+
688+
## Kandles Phonetic-Visual Analysis
689+
690+
### TEST-KDL-001 — Kandles phonetic mapping is correct
691+
692+
**Requirement:** REQ-KDL-001
693+
**Type:** unit
694+
**Platform:** all
695+
**Automated:** planned
696+
**Patent:** US 2024/0248922 A1
697+
698+
**Steps:**
699+
1. Map the word "cat" → expect group 1 (K/G/J/Ch), color Yellow
700+
2. Map the word "moon" → expect group 2 (M/N), color Grey
701+
3. Map the word "tree" → expect group 3 (T/D/Th), color Red
702+
4. Map the word "river" → expect group 4 (R/L), color Blue
703+
5. Map the word "water" → expect group 5 (Y/W/H/Kh), color Green
704+
6. Map the word "fire" → expect group 6 (P/B/F/V), color Purple
705+
7. Map the word "sun" → expect group 7 (S/Z/Sh), color Brown
706+
8. Map the word "apple" → expect group 0 (vowel-initial)
707+
708+
**Expected result:** Each word maps to the correct Kandles group
709+
**Pass criteria:** All 8 mappings correct
710+
**Fail criteria:** Any mapping incorrect
711+
712+
### TEST-KDL-002 — Kandles color-coded text output
713+
714+
**Requirement:** REQ-KDL-002
715+
**Type:** unit
716+
**Platform:** all
717+
**Automated:** planned
718+
**Patent:** US 2024/0248922 A1
719+
720+
**Steps:**
721+
1. Input: "The cat sat on the mat"
722+
2. Generate Kandles color-coded output
723+
3. Verify "The" → Red (T group), "cat" → Yellow (K group), "sat" → Brown (S group), etc.
724+
4. Verify output includes both color name and hex code
725+
726+
**Expected result:** Each word correctly color-coded
727+
**Pass criteria:** All words have correct color assignments
728+
**Fail criteria:** Any word miscolored
729+
730+
### TEST-KDL-003 — Kandles grid generation
731+
732+
**Requirement:** REQ-KDL-003
733+
**Type:** unit
734+
**Platform:** all
735+
**Automated:** planned
736+
**Patent:** US 2024/0248922 A1
737+
738+
**Steps:**
739+
1. Input: a text of 36 words
740+
2. Generate Kandles grid
741+
3. Verify grid is 6x6 (equal rows and columns)
742+
4. Verify each cell has color, number (1-7), and original word
743+
5. Verify grid matches expected Kandles mapping for each word
744+
745+
**Expected result:** Valid Kandles grid with correct dimensions and coloring
746+
**Pass criteria:** Grid dimensions correct, all cells properly mapped
747+
**Fail criteria:** Wrong dimensions or incorrect color assignments
748+
749+
### TEST-KDL-004 — Cross-language Kandles comparison
750+
751+
**Requirement:** REQ-KDL-004
752+
**Type:** integration
753+
**Platform:** all
754+
**Automated:** planned
755+
**Patent:** US 2024/0248922 A1
756+
757+
**Steps:**
758+
1. Generate Kandles grid for an English text
759+
2. Generate Kandles grid for a transliterated Tamil text
760+
3. Compute similarity metric between the two grids
761+
4. Verify similarity metric is a number in [0, 1]
762+
763+
**Expected result:** Valid cross-language comparison with similarity score
764+
**Pass criteria:** Similarity metric computed, in valid range
765+
**Fail criteria:** Comparison fails or metric out of range
766+
767+
---
768+
769+
## Hierarchical Text Decomposition
770+
771+
### TEST-HTD-001 — Text decomposition into stories and slices
772+
773+
**Requirement:** REQ-HTD-001
774+
**Type:** unit
775+
**Platform:** all
776+
**Automated:** planned
777+
**Patent:** US 2024/0248922 A1
778+
779+
**Steps:**
780+
1. Upload a multi-section text
781+
2. Decompose into stories and slices
782+
3. Verify each slice is independently addressable (has unique ID)
783+
4. Verify slices can be retrieved individually
784+
785+
**Expected result:** Text decomposed into navigable hierarchy
786+
**Pass criteria:** All slices addressable and retrievable
787+
**Fail criteria:** Slices not independently accessible
788+
789+
### TEST-HTD-002 — Slice filtering by clusters and tags
790+
791+
**Requirement:** REQ-HTD-002
792+
**Type:** unit
793+
**Platform:** all
794+
**Automated:** planned
795+
**Patent:** US 2024/0248922 A1
796+
797+
**Steps:**
798+
1. Create slices with different cluster tags
799+
2. Filter by a single cluster → verify correct subset returned
800+
3. Filter by multiple clusters (AND) → verify intersection
801+
4. Filter by multiple clusters (OR) → verify union
802+
803+
**Expected result:** Filtering returns correct subsets
804+
**Pass criteria:** All filter operations return expected slices
805+
**Fail criteria:** Incorrect filtering results
806+
807+
---
808+
809+
## Semantic Cluster Tagging
810+
811+
### TEST-SEM-001 — Default semantic taxonomy exists
812+
813+
**Requirement:** REQ-SEM-001
814+
**Type:** smoke
815+
**Platform:** all
816+
**Automated:** planned
817+
**Patent:** US 2024/0248922 A1
818+
819+
**Steps:**
820+
1. Query the system for available semantic clusters
821+
2. Verify at least Culture, Nations, Nature, Religion, People, Spiritual are present
822+
823+
**Expected result:** Default taxonomy available
824+
**Pass criteria:** All 6 default clusters present
825+
**Fail criteria:** Any default cluster missing
826+
827+
### TEST-SEM-002 — Manual tagging
828+
829+
**Requirement:** REQ-SEM-002
830+
**Type:** unit
831+
**Platform:** all
832+
**Automated:** planned
833+
**Patent:** US 2024/0248922 A1
834+
835+
**Steps:**
836+
1. Upload a text segment
837+
2. Apply a manual tag "test-label"
838+
3. Retrieve the segment
839+
4. Verify the tag is present
840+
841+
**Expected result:** Manual tag persisted and retrievable
842+
**Pass criteria:** Tag stored and returned correctly
843+
**Fail criteria:** Tag lost or incorrect

0 commit comments

Comments
 (0)