Fuzzy Matching: Near-Duplicate Detection

Overview

When comparing extracted Notepad tabs against historical sessions, notepad-cleanup identifies two kinds of duplicates:

Exact matches: Identical content (same SHA-256 hash after normalization)
Near-matches: Content that differs by a small number of characters (e.g., a to-do list with one new item added)

Near-match detection uses a heuristic threshold that scales with file size -- smaller files get stricter matching, larger files get more tolerance.

The Threshold Formula

allowed_diff = 1.396 * ln(size)^2 - 6.75 * ln(size) + 10.14

Where:

size = file length in characters (after normalization)
ln = natural logarithm
allowed_diff = maximum number of character differences for a file to be considered a near-match (minimum 1)

What the formula produces

File Size	Allowed Diff	Percentage
10 chars	2 chars	20.0%
25 chars	3 chars	12.0%
50 chars	5 chars	10.0%
100 chars	9 chars	9.0%
200 chars	14 chars	7.0%
500 chars	22 chars	4.4%
1,000 chars	30 chars	3.0%
2,000 chars	39 chars	1.9%
5,000 chars	54 chars	1.1%
10,000 chars	66 chars	0.7%
50,000 chars	101 chars	0.2%

You can view this table anytime with:

notepad-cleanup compare <folder> --show-threshold

How We Derived the Formula

Step 1: Heuristic anchor points

We started with human intuition about what "similar enough" means at different file sizes:

A 10-character note changing by 2 characters is still recognizably the same
A 25-character note changing by 3 characters is borderline
A 50-character note with 5 character changes is probably the same note with minor edits
A 1,000-character note (a typical short document) with 30 character changes is likely an updated version
A 50,000-character file with 100 character changes is almost certainly the same file

These anchor points reflect a key observation: tolerance should grow sublinearly. Doubling the file size should not double the allowed differences -- a 100KB file with 2000 changes is clearly a different file, even though 2000/100000 is only 2%.

Step 2: Candidate curve fitting

We evaluated five families of curves against the anchor points:

Candidate	Formula	Avg Error
A. Square root	`0.707 * sqrt(size)`	20.9%
B. Logarithmic	`8.34 * ln(size) - 27.6`	37.4%
C. Power law	`0.539 * size^0.569`	34.2%
D. Sqrt capped	`sqrt + 10% cap`	26.2%
E. Power capped	`power + 10% cap`	41.9%

None of these fit well across all anchor points. The problem: the anchors follow different growth rates at different scales -- strict for tiny files, generous in the "typical note" range, flattening for large files.

Step 3: Log-quadratic discovery

Plotting allowed_diff against ln(size) revealed an approximately quadratic relationship. We solved the system of equations using three anchor points (size=50, size=1000, size=50000):

5   = a * ln(50)^2    + b * ln(50)    + c
30  = a * ln(1000)^2  + b * ln(1000)  + c
100 = a * ln(50000)^2 + b * ln(50000) + c

Solving:

a = 1.396
b = -6.75
c = 10.14

Step 4: Validation

The resulting formula was validated against all seven anchor points:

Size	Expected	Actual	Error
10	2	2.0	0.0%
25	3	2.9	4.1%
50	5	5.1	2.0%
200	15	13.6	9.6%
1,000	30	30.1	0.4%
5,000	50	53.9	7.8%
50,000	100	100.5	0.5%

Average error: 3.5% -- every anchor point within 10%.

The exploration scripts used during derivation are preserved in tests/one-offs/test_threshold_curve.py.

How Character Differences Are Counted

We use Python's difflib.SequenceMatcher to find the longest common subsequences between two normalized texts, then count the characters that don't appear in any matching block:

matcher = difflib.SequenceMatcher(None, text_a, text_b, autojunk=False)
matching_chars = sum(block.size for block in matcher.get_matching_blocks())
diff_count = max(len(text_a), len(text_b)) - matching_chars

This means:

Insertions count as differences (adding "- walk dog\n" to a list = ~11 chars)
Deletions count as differences
Substitutions count as differences ("Tuesday" -> "Wednesday" = a few chars)
Reordering lines counts as many differences (the matching blocks algorithm won't see them as the same)

Text normalization before comparison

Before hashing or comparing, text is normalized:

Line endings converted to \n (strips \r)
Trailing whitespace stripped from each line
Trailing empty lines stripped

This prevents false differences from Windows vs Unix line endings or trailing spaces.

Using a Custom Formula

If the default threshold doesn't match your workflow, you can customize it.

Option 1: Override the coefficients

Set environment variables to change the curve shape:

# More permissive (allow more differences)
set NOTEPAD_CLEANUP_THRESH_A=1.8
set NOTEPAD_CLEANUP_THRESH_B=-8.0
set NOTEPAD_CLEANUP_THRESH_C=12.0

# More strict (fewer allowed differences)
set NOTEPAD_CLEANUP_THRESH_A=1.0
set NOTEPAD_CLEANUP_THRESH_B=-5.0
set NOTEPAD_CLEANUP_THRESH_C=7.0

Option 2: Derive your own formula

Decide on your own anchor points. Ask yourself: "For a file of N characters, how many character changes still mean 'same file'?"

# Your anchor points: {file_size: allowed_diff}
anchors = {
    50: 3,       # stricter than default
    1000: 20,    # stricter
    50000: 80,   # stricter
}

Solve for coefficients. With three anchor points (size_1, y_1), (size_2, y_2), (size_3, y_3), solve:

y_1 = a * ln(size_1)^2 + b * ln(size_1) + c
y_2 = a * ln(size_2)^2 + b * ln(size_2) + c
y_3 = a * ln(size_3)^2 + b * ln(size_3) + c

Or use the provided fitting script:

python tests/one-offs/test_threshold_curve.py

Modify the anchors dict at the top to use your values, and it will evaluate multiple curve families and show the best fit.

Set your coefficients via environment variables or config.

Option 3: Size-based fuzzy control

Instead of changing the formula, control which files get fuzzy matching at all:

notepad-cleanup compare <folder> --fuzzy "lte 100KB"    # only fuzzy for files up to 100KB
notepad-cleanup compare <folder> --fuzzy "lte 10KB"     # very conservative
notepad-cleanup compare <folder> --fuzzy all             # fuzzy everything (slow for large files)
notepad-cleanup compare <folder> --no-fuzzy              # exact matches only

Size operators: lt, lte, gt, gte, eq. Size units: B, KB, MB, GB, TB.

Performance Notes

Fuzzy matching is O(n * m) per comparison where n and m are the file sizes in characters. For each new file that isn't an exact hash match, it's compared against every unique file in the historical index.

With 100 new files and 150 unique historical files, that's up to 15,000 pairwise comparisons. Pre-filters reduce this:

Exact hash match eliminates true duplicates instantly
Size ratio filter (>2x difference) skips obviously different files
Size threshold (default 50KB) skips large files entirely
Only after all pre-filters pass does the expensive SequenceMatcher run

For typical notepad-cleanup sessions (50-100 small text files), comparison completes in seconds. If you have large files (>50KB), use --fuzzy small (the default) to skip them.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Fuzzy Matching: Near-Duplicate Detection

Overview

The Threshold Formula

What the formula produces

How We Derived the Formula

Step 1: Heuristic anchor points

Step 2: Candidate curve fitting

Step 3: Log-quadratic discovery

Step 4: Validation

How Character Differences Are Counted

Text normalization before comparison

Using a Custom Formula

Option 1: Override the coefficients

Option 2: Derive your own formula

Option 3: Size-based fuzzy control

Performance Notes

Uh oh!

Uh oh!

FilesExpand file tree

fuzzy-matching.md

Latest commit

History

fuzzy-matching.md

File metadata and controls

Fuzzy Matching: Near-Duplicate Detection

Overview

The Threshold Formula

What the formula produces

How We Derived the Formula

Step 1: Heuristic anchor points

Step 2: Candidate curve fitting

Step 3: Log-quadratic discovery

Step 4: Validation

How Character Differences Are Counted

Text normalization before comparison

Using a Custom Formula

Option 1: Override the coefficients

Option 2: Derive your own formula

Option 3: Size-based fuzzy control

Performance Notes