ConceptSketch

A high-performance corpus-based collocation analysis tool built on BlackLab corpus search software (which relies on Apache Lucene). This project implements word and dependecy sketch functionality (grammatical relations and collocations), semantic field exploration, and conceptual mining for corpus linguistics research and NLP applications.

Features

Fast Collocation Analysis: O(1) instant lookup with precomputed collocations
BCQL Grammar: 40+ grammatical relations defined as BCQL numbered-label patterns (1:, 2:), covering surface patterns and dependency relations
logDice Scoring: Association strength metric (0-14 scale)
Dependency Sketches: 20 dependency-based relations (nsubj, obj, amod, obl, conj, etc.) leveraging CoNLL-U dependency parses
Concordance Examples: View real corpus sentences for any word pair with highlighting
REST API: HTTP server with 14+ endpoints for sketches, semantic field exploration, concordance, and visualization
Web Interface: Interactive Semantic Field Explorer with D3.js visualization
Multi-Seed Exploration: Explore semantic fields using multiple seed words

Quick Start (5 minutes)

Prerequisites

Java 17+ (Java 21+ recommended)
Maven 3.6+
Python 3 (for web server)

1. Build

mvn clean package

Corpus Data for Testing

You can test ConceptSketch by downloading this indexed and tagged corpus:

Frontiers in Psychology Corpus, https://doi.org/10.18150/4LJ9WD

It is sufficiently large to provide interesting insights about the language of contemporary psychology (2010-2021, before the advent of AI-generated papers).

2. Create an Index

Step 1 — Prepare a CoNLL-U corpus

Tag your text with any CoNLL-U-producing tool. The project includes a Stanza GPU script for efficient tagging:

Option A: Use the Stanza script (recommended)

# Download model (one-time)
python tag_with_stanza.py --download --lang en

# Tag corpus (uses GPU automatically if available)
python tag_with_stanza.py \
  --input corpus.txt \
  --output corpus.conllu \
  --lang en

For GPU tuning and more options, see STANZA_GPU.md.

Option B: Use UDPipe 2 directly

udpipe --tokenize --tag --parse --output=conllu english.udpipe corpus.txt > corpus.conllu

Option C: Use another CoNLL-U tagger (Stanza in Python without GPU, spaCy, etc.)

Step 2 — Preprocess: add `<s>` sentence markers

BlackLab's tabular parser requires explicit inline tags for sentence boundaries. The project ships a script that converts CoNLL-U blank-line sentence boundaries into <s> / </s> inline tags:

python scripts/conllu_to_wpl.py corpus.conllu corpus_s.conllu

Move the output file into a dedicated input directory:

mkdir input_dir
mv corpus_s.conllu input_dir/

Step 3 — Index with BlackLab

The shaded JAR bundles BlackLab's IndexTool. Run it from the project root (so --format-dir . can find conllu-sentences.blf.yaml):

java -cp target/concept-sketch-1.6.0-shaded.jar \
  nl.inl.blacklab.tools.IndexTool create \
  --format-dir . \
  my_index/ input_dir/ conllu-sentences

Argument	Meaning
`--format-dir .`	Directory containing `conllu-sentences.blf.yaml`
`my_index/`	Output index directory (created automatically)
`input_dir/`	Directory with preprocessed `.conllu` files
`conllu-sentences`	Format name (matches the `.blf.yaml` filename)

3. Start API Server

# Terminal 1
java -jar target/concept-sketch-1.6.0-shaded.jar server --index my_index/ --port 8080

CORS configuration: By default the API allows requests from http://localhost:3000. To allow a different origin, pass the cors.allow.origin JVM system property:
java -Dcors.allow.origin=https://myapp.example.com \
     -jar target/concept-sketch-1.6.0-shaded.jar server --index my_index/ --port 8080

Server startup output:

API server started on port 8080
Endpoints:
  GET  /health
  GET  /api/sketch/{lemma}
  GET  /api/sketch/{lemma}/{relation}
  GET  /api/sketch/{lemma}/dep
  GET  /api/sketch/{lemma}/dep/{deprel}
  GET  /api/relations
  GET  /api/relations/dep
  GET  /api/semantic-field/explore
  GET  /api/semantic-field/explore-multi
  GET  /api/semantic-field/compare
  GET  /api/semantic-field/examples
  GET  /api/concordance/examples
  POST /api/visual/radial
  POST /api/bcql

4. Start Web Interface

# Terminal 2
python -m http.server 3000 --directory webapp

Open browser to: http://localhost:3000

The web interface allows to produce some radial plots for collocates:

And you use some semantic exploration features:

5. Try a Query

# Find adjectives describing "house"
curl "http://localhost:8080/api/sketch/house"

# Get example sentences for "house" + "big"
curl "http://localhost:8080/api/concordance/examples?seed=house&collocate=big&top=5"

# Explore semantic field from "theory" (noun_adj_predicates, alias "adj_predicate")
curl "http://localhost:8080/api/semantic-field/explore?seed=theory&relation=adj_predicate"

# Multi-seed exploration
curl "http://localhost:8080/api/semantic-field/explore-multi?seeds=theory,model,hypothesis&top=10"

Core Usage

Index a Corpus

Prerequisites

A corpus in CoNLL-U format (columns: ID, FORM, LEMMA, UPOS, XPOS, FEATS, HEAD, DEPREL, DEPS, MISC)
The project's conllu-sentences.blf.yaml format file (in the project root)
Java 21+ and the shaded JAR (target/concept-sketch-1.6.0-shaded.jar)

Step 1 — Preprocess CoNLL-U: add sentence markers

BlackLab's tabular parser needs explicit <s> / </s> inline tags to index sentence spans. The bundled script converts CoNLL-U blank-line boundaries:

python scripts/conllu_to_wpl.py corpus.conllu corpus_s.conllu

What the script does:

Skips comment lines (#) and multi-word token lines (1-2, 1.1, …)
Emits <s> before the first token of each sentence
Emits </s> after the last token
Preserves all 10 CoNLL-U columns as tab-separated values

Step 2 — Create a BlackLab index

mkdir input_dir
cp corpus_s.conllu input_dir/

# Run from the project root (so --format-dir finds conllu-sentences.blf.yaml)
java -cp target/concept-sketch-1.6.0-shaded.jar \
  nl.inl.blacklab.tools.IndexTool create \
  --format-dir . \
  my_index/ input_dir/ conllu-sentences

To add more documents to an existing index later:

java -cp target/concept-sketch-1.6.0-shaded.jar \
  nl.inl.blacklab.tools.IndexTool add \
  --format-dir . \
  my_index/ more_input_dir/ conllu-sentences

Indexed annotations

Annotation	Source column	Forward index
`word`	FORM (col 2)	✓
`lemma`	LEMMA (col 3)	✓
`pos`	UPOS (col 4)	✓
`xpos`	XPOS (col 5)	✓
`deprel`	DEPREL (col 8)	✓
`wordnum`	ID (col 1)	—
`feats`	FEATS (col 6)	—
`head`	HEAD (col 7)	—

Query via Command Line

# Find all collocations for "theory"
java -jar target/concept-sketch-1.6.0-shaded.jar \
  blacklab-query --index my_index/ --lemma theory

# Find adjectival modifiers of "theory" (deprel=amod)
java -jar target/concept-sketch-1.6.0-shaded.jar \
  blacklab-query --index my_index/ --lemma theory --deprel amod
# Increase result count and filter by logDice
java -jar target/concept-sketch-1.6.0-shaded.jar \
  blacklab-query --index my_index/ --lemma theory \
  --deprel nsubj --limit 50 --min-logdice 4.0

Grammar Configuration

The grammar configuration is externalized in JSON. Relations use BCQL numbered-label patterns where 1: marks the head word and 2: marks the collocate.

Config file: grammars/relations.json (version 2.0)

{
  "version": "2.0",
  "description": "BCQL grammar — positions derived from numbered labels (1: = head, 2: = collocate)",
  "bcql": true,
  "relations": [
    {
      "id": "noun_adj_predicates",
      "name": "Adjectives (predicative)",
      "description": "Adjective predicates with copula (e.g., 'hypothesis is valid')",
      "pattern": "1:[xpos=\"NN.*\"] [lemma=\"be|appear|seem|...\"] 2:[xpos=\"JJ.*\"]",
      "relation_type": "SURFACE",
      "dual": false
    },
    {
      "id": "noun_modifiers",
      "name": "Modifiers (adjectives)",
      "description": "Adjectives modifying nouns (e.g., 'big house')",
      "pattern": "2:[xpos=\"JJ.*\"] 1:[xpos=\"NN.*\"]",
      "relation_type": "SURFACE",
      "dual": false
    },
    {
      "id": "dep_nsubj",
      "name": "Dependency: nominal subject",
      "description": "Verb with its nominal subject (e.g., 'theory explains')",
      "pattern": "2:[xpos=\"NN.*\" & deprel=\"nsubj\"] 1:[xpos=\"VB.*\"]",
      "relation_type": "DEP"
    },
    ...
  ]
}

Fields:

Field	Description
`id`	Unique relation identifier (used in API queries)
`name`	Human-readable display name
`description`	Natural-language explanation of the relation
`pattern`	BCQL pattern with `1:` (head) and `2:` (collocate) positional labels
`relation_type`	`SURFACE` or `DEP` (dependency-based)
`dual`	(optional) `true` for head/collocate-symmetric relations

Pattern syntax:

1:[xpos="NN.*"] — head word must be a noun (XPOS tag)
2:[xpos="JJ.*"] — collocate must be an adjective
[lemma="be|appear|..."] — intervening copula (positional label omitted = not counted as head or collocate)
[xpos="NN.*" & deprel="nsubj"] — constraints combined with &

API endpoint:

To view active relations, use GET /api/relations (surface) and GET /api/relations/dep (dependency).

To modify relations or add new ones, edit grammars/relations.json and restart the server.

REST API Endpoints

Health Check

curl http://localhost:8080/health

Get Word Sketch

curl "http://localhost:8080/api/sketch/house"

To filter a full sketch to relations whose head is a specific POS group, use query parameters:

# Only show relations where the head is a verb
curl "http://localhost:8080/api/sketch/theory?head_pos=verb"

Accepted values: noun, verb, adj, adv.

Response:

{
  "status": "ok",
  "lemma": "house",
  "patterns": {
    "noun_modifiers": {
      "name": "Modifiers (adjectives)",
      "cql": "2:[xpos=\"JJ.*\"] 1:[xpos=\"NN.*\"]",
      "total_matches": 3421,
      "collocations": [
        {
          "lemma": "big",
          "frequency": 287,
          "logDice": 11.24,
          "relativeFrequency": 0.084
        }
      ]
    }
  }
}

Single-Seed Semantic Field Exploration

curl "http://localhost:8080/api/semantic-field/explore?seed=theory&relation=adj_predicate&top=15&min_logdice=2"

Common relation IDs for noun-head exploration:

Relation ID	Pattern	Example
`noun_adj_predicates`	"X is ADJ" (copula)	"theory is correct"
`noun_modifiers`	"ADJ X"	"correct theory"
`subject_of`	"X VERB" (strict local)	"theory suggests"
`noun_verbs`	"X ... VERB" (looser window)	verbs near "theory"
`object_of`	"VERB X" (strict local)	"develop theory"
`noun_compounds`	"X NOUN"	"theory development"
`noun_prepositions`	"X PREP"	"theory of"

Any relation from GET /api/relations can be used. For dependency-based relations (e.g., dep_amod, dep_nsubj), use GET /api/sketch/{lemma}/dep/{deprel} instead.

Response:

{
  "status": "ok",
  "seed": "theory",
  "seed_collocates": [
    {"word": "correct", "log_dice": 4.21, "frequency": 142},
    {"word": "practical", "log_dice": 3.73, "frequency": 98}
  ],
  "core_collocates": [...],
  "discovered_nouns": [
    {
      "word": "development",
      "shared_count": 5,
      "shared_collocates": ["correct", "practical", "quantum"]
    }
  ]
}

Multi-Seed Semantic Field Exploration

curl "http://localhost:8080/api/semantic-field/explore-multi?seeds=theory,model,hypothesis&relation=adj_predicate&top=10"

Response:

{
  "status": "ok",
  "seeds": ["theory", "model", "hypothesis"],
  "seed_collocates": [
    {"word": "correct", "log_dice": 4.21, "frequency": 142}
  ],
  "seed_collocates_count": 23,
  "core_collocates": [],
  "common_collocates": [],
  "common_collocates_count": 0,
  "discovered_nouns": ["theory", "model", "hypothesis"],
  "edges": [
    {"source": "theory", "target": "correct", "log_dice": 4.21, "type": "SURFACE"}
  ]
}

Note: All seed_collocates items have the same shape {word, log_dice, frequency} across both endpoints.

Concordance Examples for Word Pairs

curl "http://localhost:8080/api/concordance/examples?seed=house&collocate=big&top=10"

Get actual example sentences from the corpus containing both words (lemmas). This feature validates collocations by showing real usage contexts.

How It Works:

Uses SpanNearQuery to efficiently find sentences where both lemmas appear within 10 words
Decodes token data (word, lemma, tag, position) from BinaryDocValues (tokens field)
Generates HTML with <mark> tags highlighting both target words
Returns sentence text, highlighted HTML, and position arrays

Technical Details:

The HYBRID index stores tokens as BinaryDocValues, decoded via TokenSequenceCodec
Lemma field is indexed with positions, enabling fast SpanQueries
No need to store lemma/word/tag as separate StoredFields - DocValues provide O(1) lookup
Query complexity: O(log N) for SpanQuery + O(k) for decoding k matching documents

Parameters:

seed (required) - Headword (lemma)
collocate (required) - Collocate word (lemma)
top (optional) - Number of examples to return (default: 10)
relation (optional) - Grammatical relation ID (default: noun_adj_predicates)

Response:

{
  "status": "ok",
  "seed": "house",
  "collocate": "big",
  "relation": "noun_adj_predicates",
  "top": 10,
  "total_results": 3,
  "examples": [
    {
      "sentence": "The big house! - The big house.",
      "raw": "The big house ! - The big house ."
    },
    {
      "sentence": "Houses Big and beautiful house with 4 bedrooms Houses big...",
      "raw": "Houses Big and beautiful house with 4 bedrooms Houses big ..."
    }
  ]
}

Response Fields:

sentence - Raw sentence text from the corpus
raw - Tokenized sentence (space-separated)

Use Cases:

Validate collocations before citing in research
Understand usage contexts and frequency patterns
Discover idiomatic expressions and multi-word units
Quality check corpus tagging and lemmatization

Integration with Web UI:

Word Sketch tab: Click any collocation word to see inline examples
Semantic Field Explorer: Click graph edges to see example sentences
Examples appear in expandable panels below the visualization
Up to 10 examples shown with "Load More" option for additional contexts

Web Interface (Semantic Field Explorer)

The webapp/ directory contains an interactive web interface built with D3.js.

Features

Word Sketch Search
- Browse collocations for any lemma
- Filter by POS tags
- Click any collocation to see example sentences from the corpus
- Examples appear in a panel below with highlighted target words
- Adjust logDice thresholds
Single-Seed Exploration
- Bootstrap from one seed word
- Select grammatical relation
- Discover semantically similar words
- Force-directed graph visualization
Multi-Seed Exploration
- Explore from multiple seeds at once
- See all collocates per seed
- Identify common patterns
- Cluster-based semantic field analysis

Start Both Services

# Terminal 1: API Server
java -jar target/concept-sketch-1.6.0-shaded.jar server --index <corpus_path> --port 8080

# Terminal 2: Web Server
python -m http.server 3000 --directory webapp

# Open browser to http://localhost:3000

To configure a non-default CORS origin (e.g., for production), use the cors.allow.origin JVM property:

java -Dcors.allow.origin=https://myapp.example.com \
    -jar target/concept-sketch-1.6.0-shaded.jar server --index <corpus_path> --port 8080

CQL Pattern Syntax

Basic Patterns

Pattern	Meaning
`"house"`	Match lemma "house"
`[tag="NN.*"]`	Match POS tag regex (nouns)
`[tag="JJ"]`	Match exact POS tag
`[word="the"]`	Match word form

Constraints

[tag="JJ.*"]              # Adjectives (any type)
[tag="VB.*"]              # Verbs (any type)
[tag="NN.*"]              # Nouns
[tag!="NN.*"]             # NOT nouns
[tag="JJ"|tag="RB"]       # Adjectives OR adverbs

Distance Modifiers

[tag="JJ"]                # Adjacent (distance = 1)
[tag="JJ"] ~ {0,3}        # Within 0-3 words
[tag="JJ"] ~ {1,5}        # 1-5 words apart

Examples

# Adjectives modifying a noun
[tag="jj.*"]

# Verbs taking noun as object
[tag="vb.*"]

# Adjectives within 3 words
[tag="jj.*"] ~ {0,3}

Architecture

Query Pipeline

User Input
    ↓
Grammar Config (grammars/relations.json)
    ↓
BCQL Pattern → BlackLab CQL Parser (library)
    ↓
Lucene SpanQuery Compiler (BlackLab library)
    ↓
Index Lookup (Lucene — BlackLab-managed)
    ↓
logDice Scorer (LogDiceUtils.java)
    ↓
Response (JSON via SketchResponseAssembler / ExploreResponseAssembler)

Index Structure (BlackLab Annotations)

BlackLab manages the index; annotations are derived from CoNLL-U columns:

Annotation	Source column	Forward index	Purpose
`word`	FORM (col 2)	✓	Raw word form
`lemma`	LEMMA (col 3)	✓	Lemma for search
`pos`	UPOS (col 4)	✓	Universal POS tag
`xpos`	XPOS (col 5)	✓	Language-specific POS tag (used in grammar patterns)
`deprel`	DEPREL (col 8)	✓	Dependency relation label (used in DEP relations)
`wordnum`	ID (col 1)	—	Token position in sentence
`feats`	FEATS (col 6)	—	Morphological features
`head`	HEAD (col 7)	—	Dependency head ID

Collocation Computation

logDice (Default)

logDice = log₂(2 * f(A,B) / (f(A) + f(B))) + 14

Scale: 0-14 (14 = perfect association)
Symmetric measure - same value regardless of direction

MI3 (Mutual Information)

MI3 = log₂((f(A,B) * N) / (f(A) * f(B)))

Higher values indicate stronger association
Good for finding rare but informative collocations

T-Score

T = (f(A,B) - expected) / sqrt(expected)
where expected = (f(A) * f(B)) / N

Measures statistical significance
Higher absolute values indicate more significant associations

Log-Likelihood (G-squared)

G2 = 2 * f(A,B) * log(f(A,B) / expected)

Measures deviance from expected co-occurrence
Higher values indicate greater statistical significance

Parameters:

f(A,B) = co-occurrence frequency (collocate with headword)
f(A) = headword frequency
f(B) = collocate total frequency
N = total tokens in corpus

Query API:

The server uses logDice scoring by default. Simply query the sketch endpoint:

curl "http://localhost:8080/api/sketch/house"

Project Structure

concept-sketch/
├── src/main/java/pl/marcinmilkowski/word_sketch/
│   ├── Main.java                    # CLI entry point
│   ├── api/
│   │   ├── WordSketchApiServer.java          # REST API server (14+ endpoints)
│   │   ├── ComparisonResponseAssembler.java   # Builds JSON responses for comparison results
│   │   ├── ConcordanceHandlers.java          # Handlers for concordance/examples endpoints
│   │   ├── CorpusQueryHandlers.java          # Handler for BCQL corpus query endpoint
│   │   ├── ExplorationHandlers.java          # Handlers for semantic field exploration endpoints
│   │   ├── ExploreResponseAssembler.java     # Builds JSON response maps for exploration results
│   │   ├── ExportUtils.java                  # CSV/TSV export utilities
│   │   ├── HttpApiUtils.java                 # HTTP utilities: sendJsonResponse, CORS, method enforcement
│   │   ├── RequestEntityTooLargeException.java  # RuntimeException for HTTP 413 responses
│   │   ├── SketchHandlers.java               # Handlers for word sketch endpoints
│   │   ├── SketchResponseAssembler.java      # Builds JSON responses for word sketch results
│   │   ├── VisualizationHandlers.java        # Handler for radial plot endpoint (POST)
│   │   └── model/                            # API-specific DTOs
│   │       ├── CollocateEntry.java
│   │       ├── CollocateProfileEntry.java
│   │       ├── ComparisonResponse.java
│   │       ├── CoreCollocateEntry.java
│   │       ├── DiscoveredNounEntry.java
│   │       ├── EdgeEntry.java
│   │       ├── ExampleEntry.java
│   │       ├── ExamplesResponse.java
│   │       ├── ExploreResponse.java
│   │       ├── RelationEntry.java
│   │       ├── RelationListEntry.java
│   │       ├── RelationListResponse.java
│   │       ├── SeedCollocateEntry.java
│   │       └── SketchResponse.java
│   ├── config/
│   │   ├── GrammarConfig.java                # Immutable grammar configuration (relations, version)
│   │   ├── GrammarConfigLoader.java          # Loads grammar config from JSON
│   │   ├── RelationConfig.java               # Single relation: pattern, relation_type
│   │   └── RelationUtils.java               # Relation validation, alias resolution
│   ├── exploration/
│   │   ├── CollocateProfileComparator.java   # Compares adjective profiles across seed nouns
│   │   ├── ExplorationException.java         # Unchecked exception for corpus access failures
│   │   ├── MultiSeedExplorer.java            # Multi-seed semantic field exploration
│   │   ├── SemanticFieldExplorer.java        # Coordination facade for SEF (single + multi seed)
│   │   ├── SingleSeedExplorer.java           # Core single-seed exploration algorithm
│   │   └── spi/
│   │       └── ExplorationService.java       # Public SPI interface for all exploration operations
│   ├── indexer/
│   │   └── blacklab/
│   │       ├── BlackLabConllUIndexer.java    # CoNLL-U corpus indexer for BlackLab
│   │       └── ConlluConverter.java          # Converts CoNLL-U to WPL chunk format
│   ├── model/
│   │   ├── PosGroup.java                     # POS group enum: NOUN, VERB, ADJ, ADV, OTHER
│   │   ├── RelationType.java                 # Enum: SURFACE | DEP
│   │   ├── exploration/
│   │   │   ├── CollocateProfile.java         # Adjective collocate profile for SEF comparison
│   │   │   ├── ComparisonResult.java         # Result DTO for compareCollocateProfiles()
│   │   │   ├── CoreCollocate.java            # High-coverage shared collocate
│   │   │   ├── DiscoveredNoun.java           # Noun discovered via shared adjectives
│   │   │   ├── Edge.java                     # Graph edge for D3.js visualization
│   │   │   ├── ExplorationOptions.java       # Base options for SEF exploration
│   │   │   ├── ExplorationResult.java        # Top-level result DTO for SEF exploration
│   │   │   ├── FetchExamplesOptions.java     # Options for fetchExamples
│   │   │   ├── FetchExamplesResult.java      # Result DTO for fetchExamples()
│   │   │   ├── RelationEdgeType.java         # Enum for edge types in exploration graphs
│   │   │   ├── SharingCategory.java          # Enum: FULLY_SHARED, PARTIALLY_SHARED, SPECIFIC
│   │   │   └── SingleSeedExplorationOptions.java  # Options for single-seed exploration
│   │   └── sketch/
│   │       ├── BcqlPage.java                 # Paginated BCQL query result
│   │       ├── CollocateResult.java          # A single collocate hit with sentence context
│   │       ├── ConcordanceHit.java           # A concordance hit for a word pair
│   │       ├── ConcordanceResult.java        # A concordance (KWIC) result entry
│   │       └── WordSketchResult.java         # Top-level word sketch result with logDice score
│   ├── query/
│   │   ├── BlackLabQueryExecutor.java        # BlackLab-backed query executor
│   │   ├── BlackLabSnippetParser.java        # Parses BlackLab XML snippets
│   │   ├── CollocateQueryHelper.java         # Low-level collocate frequency/example lookup
│   │   ├── QueryExecutor.java               # Wide query executor interface (extends SPI ports)
│   │   └── spi/
│   │       ├── CollocateQueryPort.java       # Narrow SPI: collocate-frequency-focused queries
│   │       └── SketchQueryPort.java          # Narrow SPI: word-sketch-pattern queries
│   ├── utils/
│   │   ├── CqlUtils.java                    # CQL parsing: splitCqlTokens, escapeForRegex
│   │   ├── JsonUtils.java                   # JSON serialization helpers
│   │   ├── LogDiceUtils.java                # logDice scoring
│   │   └── MathUtils.java                   # Math utilities: round2dp
│   └── viz/
│       └── RadialPlot.java                  # Radial plot data builder
├── webapp/
│   ├── index.html                   # Web UI (D3.js visualization)
│   └── assets/                      # CSS, D3.js
├── grammars/
│   └── relations.json               # BCQL grammar config (40+ relations)
├── scripts/
│   └── conllu_to_wpl.py             # CoNLL-U to WPL preprocessor
├── src/test/java/                   # 40+ unit tests
├── pom.xml                          # Maven config
└── README.md                        # This file

Technical Deep Dive

Concordance Examples Implementation

The concordance feature efficiently retrieves example sentences containing word pairs using a two-stage approach:

Stage 1: SpanQuery for Fast Document Retrieval

// Build SpanNearQuery: both lemmas within 10 words
SpanTermQuery span1 = new SpanTermQuery(new Term("lemma", "house"));
SpanTermQuery span2 = new SpanTermQuery(new Term("lemma", "big"));

SpanNearQuery nearQuery = SpanNearQuery.newUnorderedNearQuery("lemma")
    .addClause(span1)
    .addClause(span2)
    .setSlop(10)  // Max distance: 10 tokens
    .build();

TopDocs results = searcher.search(nearQuery, limit);

Stage 2: DocValues Decoding for Token Details

// For each matching document, decode tokens from BinaryDocValues
BinaryDocValues tokensDV = reader.getBinaryDocValues("tokens");
tokensDV.advanceExact(docId);
BytesRef tokensBytes = tokensDV.binaryValue();

// Decode using TokenSequenceCodec
List<Token> tokens = TokenSequenceCodec.decode(tokensBytes);

// Each token contains: position, word, lemma, tag, startOffset, endOffset

Why This Design?

Compact Storage: Tokens stored as binary (varint encoding) instead of separate fields
- Typical sentence (~20 tokens): 400-600 bytes vs 1-2KB for separate fields
- 62M sentence corpus: ~30GB vs ~80GB storage
Fast Retrieval:
- SpanQuery uses inverted index with positions → O(log N) lookup
- DocValues provide O(1) document access (memory-mapped)
- No need to reconstruct from stored text
Position Accuracy:
- Positions preserved from tagging pipeline
- Support for multi-word tokens and contractions
- Exact alignment with original text offsets

Binary Encoding Format (TokenSequenceCodec):

[token_count: varint]
For each token:
  [position: varint]
  [word_length: varint][word: UTF-8]
  [lemma_length: varint][lemma: UTF-8]
  [tag_length: varint][tag: UTF-8]
  [start_offset: varint]
  [end_offset: varint]

Varint encoding saves space for common cases (positions < 128 = 1 byte).

Dependency Sketches

What are Dependency Sketches?

Dependency sketches are visual or data-driven representations of how words relate to each other based on syntactic dependencies in the corpus. They help users understand grammatical and semantic relationships beyond simple collocations, leveraging dependency parsing to reveal patterns such as subject, object, modifier, and predicate relations.

Usage

Dependency sketches are generated from parsed corpora (e.g., CoNLL-U format) and can be explored via the API and web UI. They provide insights into grammatical structures and are useful for linguistic analysis, semantic field exploration, and advanced querying. Dependency relations are defined in grammars/relations.json with "relation_type": "DEP" and constrain collocates by deprel annotation.

API Endpoints

# Full dependency sketch for a lemma
curl "http://localhost:8080/api/sketch/theory/dep"

# Specific dependency relation
curl "http://localhost:8080/api/sketch/theory/dep/dep_nsubj"

# List available dependency relations
curl "http://localhost:8080/api/relations/dep"

See also: MULTI_SEED_EXPLORATION.md for advanced semantic field features.

Usage Examples

Example 1: Adjectives Describing "Theory"

curl "http://localhost:8080/api/sketch/theory"

Result: Top collocates for "theory"

correct (logDice: 4.21)
practical (logDice: 3.73)
wrong (logDice: 3.58)
mathematical (logDice: 3.47)
quantum (logDice: 2.89)

Example 2: Find Words "House" Can Be Object Of

curl "http://localhost:8080/api/semantic-field/explore?seed=house&relation=object_of&top=10"

Result: Find verbs that take "house" as object

locate (logDice: 5.12)
build (logDice: 4.89)
buy (logDice: 4.21)

Discovered nouns (words that share these verbs):

hotel (shared: build, locate)
apartment (shared: build, buy, locate)
property (shared: buy, locate)

Example 3: Multi-Seed Cluster Analysis

curl "http://localhost:8080/api/semantic-field/explore-multi?seeds=dog,cat,horse&relation=subject_of&top=8"

Result: What do dogs, cats, and horses do?

All seeds can: eat, run, live
Dog-specific: bark, beg, fetch
Cat-specific: meow, purr, scratch

Development

Run Tests

mvn test

Build Documentation

See plans/ directory for:

concept-sketch-spec.md - Overall technical specification
precomputed-collocations-spec.md - Precomputed algorithm details
hybrid-index-spec.md - Hybrid index architecture

Code Quality

Tests cover:

Grammar config loading and validation
BCQL pattern construction and alias resolution
logDice calculation
API endpoints (sketch, exploration, concordance, visualization)
Multi-seed and single-seed exploration
Concordance retrieval and snippet parsing
Indexer (CoNLL-U conversion and BlackLab indexing)

Name		Name	Last commit message	Last commit date
Latest commit History 393 Commits
.claude		.claude
.github/workflows		.github/workflows
.settings		.settings
.vscode		.vscode
__pycache__		__pycache__
diagnostics		diagnostics
docs		docs
grammars		grammars
index		index
plans		plans
scripts		scripts
src		src
test-data		test-data
webapp		webapp
.classpath		.classpath
.gitignore		.gitignore
.project		.project
BLACKLAB_SETUP.md		BLACKLAB_SETUP.md
CLAUDE.md		CLAUDE.md
MULTI_SEED_EXPLORATION.md		MULTI_SEED_EXPLORATION.md
README.md		README.md
STANZA_GPU.md		STANZA_GPU.md
blacklab-core-pom.xml		blacklab-core-pom.xml
conllu-integrated.blf.yaml		conllu-integrated.blf.yaml
conllu-sentences.blf.yaml		conllu-sentences.blf.yaml
filter_conllu_boilerplate.py		filter_conllu_boilerplate.py
filter_text_corpus.py		filter_text_corpus.py
index-conllu-blacklab.ps1		index-conllu-blacklab.ps1
index_corpus.ps1		index_corpus.ps1
index_corpus.sh		index_corpus.sh
install-blacklab.ps1		install-blacklab.ps1
install-blacklab.sh		install-blacklab.sh
pom.xml		pom.xml
requirements-stanza-gpu.txt		requirements-stanza-gpu.txt
screenshot1.png		screenshot1.png
screenshot2.png		screenshot2.png
semantic-exploration.png		semantic-exploration.png
tag_and_build.ps1		tag_and_build.ps1
tag_with_stanza.py		tag_with_stanza.py
word_sketch.ps1		word_sketch.ps1
word_sketch.sh		word_sketch.sh

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

ConceptSketch

Features

Quick Start (5 minutes)

Prerequisites

1. Build

Corpus Data for Testing

2. Create an Index

Step 1 — Prepare a CoNLL-U corpus

Step 2 — Preprocess: add <s> sentence markers

Step 3 — Index with BlackLab

3. Start API Server

4. Start Web Interface

5. Try a Query

Core Usage

Index a Corpus

Prerequisites

Step 1 — Preprocess CoNLL-U: add sentence markers

Step 2 — Create a BlackLab index

Indexed annotations

Query via Command Line

Grammar Configuration

REST API Endpoints

Health Check

Get Word Sketch

Single-Seed Semantic Field Exploration

Multi-Seed Semantic Field Exploration

Concordance Examples for Word Pairs

Web Interface (Semantic Field Explorer)

Features

Start Both Services

CQL Pattern Syntax

Basic Patterns

Constraints

Distance Modifiers

Examples

Architecture

Query Pipeline

Index Structure (BlackLab Annotations)

Collocation Computation

logDice (Default)

MI3 (Mutual Information)

T-Score

Log-Likelihood (G-squared)

Project Structure

Technical Deep Dive

Concordance Examples Implementation

Dependency Sketches

What are Dependency Sketches?

Usage

API Endpoints

Usage Examples

Example 1: Adjectives Describing "Theory"

Example 2: Find Words "House" Can Be Object Of

Example 3: Multi-Seed Cluster Analysis

Development

Run Tests

Build Documentation

Code Quality

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Step 2 — Preprocess: add `<s>` sentence markers

Packages