Tokink is the accompanying library to ScribeTokens: Fixed-Vocabulary Tokenization of Digital Ink, a Byte-Pair Encoding (BPE) tokenizer designed specifically for digital ink (online handwriting). It enables a compressed and discrete representation of digital ink, improving compatibility with the transformer architecture.
pip install tokinkfrom tokink import Ink, Tokinkizer
from tokink.processor import scale, to_int
# Load or create digital ink
ink = Ink.example() # or Ink.from_json("path/to/ink.json")
# Preprocess: scale down for better compression
ink = to_int(scale(ink, 1/16))
# Initialize tokenizer
tokenizer = Tokinkizer.from_pretrained(vocab_size=32_000)
# Encode ink to tokens
tokens = tokenizer.encode(ink)
# Decode tokens back to ink
reconstructed_ink = tokenizer.decode(tokens)
# Visualize
reconstructed_ink.plot()π‘ Try it interactively: Check out
examples/quickstart.ipynbfor a hands-on notebook walkthrough.
Digital ink is naturally represented as lists of strokes and points β verbose and continuous-valued, which is awkward for transformers that work best with discrete, compressed sequences. Naive discretization (one token per coordinate) leads to massive vocabularies and out-of-vocabulary problems.
Tokink takes a different approach inspired by Bresenham's line algorithm, which rasterizes lines on pixelated displays. We decompose all pen movements into 8 directional arrows: β, β, β, β, β, β, β, β.
For example, rendering a line from (0, 0) to (10, 4):
Combined with special [UP] and [DOWN] tokens for pen state, any digital ink can be expressed using just 10 base tokens β giving us a tiny vocabulary, high BPE compression, and zero out-of-vocabulary issues.
Example Tokenization: Pen strokes are decomposed into unit directional steps via Bresenham's algorithm, then compressed with BPE. Each color denotes a distinct BPE token; faint colors indicate pen-in-air movement between strokes. The zoom shows the sequence of arrows making up an example token.
Complete pipeline for recognizing handwritten text:
from tokink import Ink, Tokinkizer
from tokink.processor import jitter, rotate, scale, to_int
SCALE_FACTOR = 1 / 16
VOCAB_SIZE = 32_000
def preprocess_ink(ink: Ink) -> Ink:
"""Scale down coordinates for better tokenization compression."""
return scale(ink, SCALE_FACTOR)
def augment_ink(ink: Ink) -> Ink:
"""Apply rotation and jittering for data augmentation."""
ink = rotate(ink, angle_degrees=5)
ink = jitter(ink, sigma=0.5)
return ink
# Load dataset
dataset = [
(Ink.example(), "By Trevor Williams. A move"),
# Add more (ink, label) pairs...
]
# Preprocess and augment
processed_data = []
for ink, label in dataset:
# Original (preprocessed)
processed_data.append((to_int(preprocess_ink(ink)), label))
# Augmented (preprocess then augment)
processed_data.append((to_int(augment_ink(preprocess_ink(ink))), label))
# Tokenize
tokenizer = Tokinkizer.from_pretrained(vocab_size=VOCAB_SIZE)
tokenized_data = [(tokenizer.encode(ink), label) for ink, label in processed_data]
# Train your model with tokenized data
# model.train(tokenized_data)See examples/htr.py for the complete example.
Generate handwriting from text prompts:
from tokink import Ink, Tokinkizer
from tokink.processor import resample, scale, smooth, to_int
SCALE_FACTOR = 1 / 16
VOCAB_SIZE = 32_000
def postprocess_generated(ink: Ink) -> Ink:
"""
Post-process generated ink for smooth, natural appearance.
Steps:
1. Scale back to original coordinate space
2. Resample to increase point density
3. Apply Savitzky-Golay smoothing to reduce tokenization artifacts
"""
ink = scale(ink, 1 / SCALE_FACTOR)
ink = resample(ink, sample_every=2)
ink = smooth(ink)
return ink
# Initialize tokenizer
tokenizer = Tokinkizer.from_pretrained(vocab_size=VOCAB_SIZE)
# Generate tokens from your model
# generated_tokens = model.generate("Hello world")
# Decode and post-process
raw_ink = tokenizer.decode(generated_tokens)
smooth_ink = postprocess_generated(raw_ink)
smooth_ink.plot()See examples/htg.py for the complete example.
Train a custom tokenizer on your own digital ink dataset:
from tokink import Ink, Tokinkizer
from tokink.processor import scale, to_int
# Load your dataset
dataset = [
Ink.from_json("sample1.json"),
Ink.from_json("sample2.json"),
# ... more ink samples
]
# Preprocess: scale down for better compression
SCALE_FACTOR = 1 / 16
preprocessed = (to_int(scale(ink, SCALE_FACTOR)) for ink in dataset)
# Train tokenizer with custom vocabulary size
tokenizer = Tokinkizer.train(preprocessed, vocab_size=50_000)
# Save for later use
tokenizer.save("my_tokenizer/")
# Load your custom tokenizer
custom_tokenizer = Tokinkizer.from_pretrained("my_tokenizer/")When to train your own:
- Your ink has unique characteristics (e.g., specific writing styles, languages, or symbols)
- You need a different vocabulary size for your model architecture
- You want to optimize compression for your specific use case
See examples/train_tokenizer.py for a complete example with evaluation and best practices.
-
Ink: Represents digital ink as strokesInk.example(): Load example inkInk.from_json(path): Load from JSON fileplot(): Visualize the ink
-
Tokinkizer: BPE tokenizer for digital inkfrom_pretrained(vocab_size): Load pretrained tokenizerencode(ink): Convert ink to token IDsdecode(tokens): Convert token IDs back to ink
All available in tokink.processor:
scale(ink, factor): Scale coordinatesrotate(ink, angle_degrees): Rotate inkjitter(ink, sigma): Add Gaussian noise for augmentationresample(ink, sample_every): Resample points for density controlsmooth(ink): Apply Savitzky-Golay smoothingto_int(ink): Convert float coordinates to integers
MIT License - see LICENSE for details.
Contributions are welcome! Please feel free to submit a Pull Request.