When comparing extracted Notepad tabs against historical sessions, notepad-cleanup identifies two kinds of duplicates:
- Exact matches: Identical content (same SHA-256 hash after normalization)
- Near-matches: Content that differs by a small number of characters (e.g., a to-do list with one new item added)
Near-match detection uses a heuristic threshold that scales with file size -- smaller files get stricter matching, larger files get more tolerance.
allowed_diff = 1.396 * ln(size)^2 - 6.75 * ln(size) + 10.14
Where:
size= file length in characters (after normalization)ln= natural logarithmallowed_diff= maximum number of character differences for a file to be considered a near-match (minimum 1)
| File Size | Allowed Diff | Percentage |
|---|---|---|
| 10 chars | 2 chars | 20.0% |
| 25 chars | 3 chars | 12.0% |
| 50 chars | 5 chars | 10.0% |
| 100 chars | 9 chars | 9.0% |
| 200 chars | 14 chars | 7.0% |
| 500 chars | 22 chars | 4.4% |
| 1,000 chars | 30 chars | 3.0% |
| 2,000 chars | 39 chars | 1.9% |
| 5,000 chars | 54 chars | 1.1% |
| 10,000 chars | 66 chars | 0.7% |
| 50,000 chars | 101 chars | 0.2% |
You can view this table anytime with:
notepad-cleanup compare <folder> --show-thresholdWe started with human intuition about what "similar enough" means at different file sizes:
- A 10-character note changing by 2 characters is still recognizably the same
- A 25-character note changing by 3 characters is borderline
- A 50-character note with 5 character changes is probably the same note with minor edits
- A 1,000-character note (a typical short document) with 30 character changes is likely an updated version
- A 50,000-character file with 100 character changes is almost certainly the same file
These anchor points reflect a key observation: tolerance should grow sublinearly. Doubling the file size should not double the allowed differences -- a 100KB file with 2000 changes is clearly a different file, even though 2000/100000 is only 2%.
We evaluated five families of curves against the anchor points:
| Candidate | Formula | Avg Error |
|---|---|---|
| A. Square root | 0.707 * sqrt(size) |
20.9% |
| B. Logarithmic | 8.34 * ln(size) - 27.6 |
37.4% |
| C. Power law | 0.539 * size^0.569 |
34.2% |
| D. Sqrt capped | sqrt + 10% cap |
26.2% |
| E. Power capped | power + 10% cap |
41.9% |
None of these fit well across all anchor points. The problem: the anchors follow different growth rates at different scales -- strict for tiny files, generous in the "typical note" range, flattening for large files.
Plotting allowed_diff against ln(size) revealed an approximately quadratic relationship. We solved the system of equations using three anchor points (size=50, size=1000, size=50000):
5 = a * ln(50)^2 + b * ln(50) + c
30 = a * ln(1000)^2 + b * ln(1000) + c
100 = a * ln(50000)^2 + b * ln(50000) + c
Solving:
a = 1.396b = -6.75c = 10.14
The resulting formula was validated against all seven anchor points:
| Size | Expected | Actual | Error |
|---|---|---|---|
| 10 | 2 | 2.0 | 0.0% |
| 25 | 3 | 2.9 | 4.1% |
| 50 | 5 | 5.1 | 2.0% |
| 200 | 15 | 13.6 | 9.6% |
| 1,000 | 30 | 30.1 | 0.4% |
| 5,000 | 50 | 53.9 | 7.8% |
| 50,000 | 100 | 100.5 | 0.5% |
Average error: 3.5% -- every anchor point within 10%.
The exploration scripts used during derivation are preserved in tests/one-offs/test_threshold_curve.py.
We use Python's difflib.SequenceMatcher to find the longest common subsequences between two normalized texts, then count the characters that don't appear in any matching block:
matcher = difflib.SequenceMatcher(None, text_a, text_b, autojunk=False)
matching_chars = sum(block.size for block in matcher.get_matching_blocks())
diff_count = max(len(text_a), len(text_b)) - matching_charsThis means:
- Insertions count as differences (adding "- walk dog\n" to a list = ~11 chars)
- Deletions count as differences
- Substitutions count as differences ("Tuesday" -> "Wednesday" = a few chars)
- Reordering lines counts as many differences (the matching blocks algorithm won't see them as the same)
Before hashing or comparing, text is normalized:
- Line endings converted to
\n(strips\r) - Trailing whitespace stripped from each line
- Trailing empty lines stripped
This prevents false differences from Windows vs Unix line endings or trailing spaces.
If the default threshold doesn't match your workflow, you can customize it.
Set environment variables to change the curve shape:
# More permissive (allow more differences)
set NOTEPAD_CLEANUP_THRESH_A=1.8
set NOTEPAD_CLEANUP_THRESH_B=-8.0
set NOTEPAD_CLEANUP_THRESH_C=12.0
# More strict (fewer allowed differences)
set NOTEPAD_CLEANUP_THRESH_A=1.0
set NOTEPAD_CLEANUP_THRESH_B=-5.0
set NOTEPAD_CLEANUP_THRESH_C=7.0- Decide on your own anchor points. Ask yourself: "For a file of N characters, how many character changes still mean 'same file'?"
# Your anchor points: {file_size: allowed_diff}
anchors = {
50: 3, # stricter than default
1000: 20, # stricter
50000: 80, # stricter
}- Solve for coefficients. With three anchor points (size_1, y_1), (size_2, y_2), (size_3, y_3), solve:
y_1 = a * ln(size_1)^2 + b * ln(size_1) + c
y_2 = a * ln(size_2)^2 + b * ln(size_2) + c
y_3 = a * ln(size_3)^2 + b * ln(size_3) + c
Or use the provided fitting script:
python tests/one-offs/test_threshold_curve.pyModify the anchors dict at the top to use your values, and it will evaluate multiple curve families and show the best fit.
- Set your coefficients via environment variables or config.
Instead of changing the formula, control which files get fuzzy matching at all:
notepad-cleanup compare <folder> --fuzzy "lte 100KB" # only fuzzy for files up to 100KB
notepad-cleanup compare <folder> --fuzzy "lte 10KB" # very conservative
notepad-cleanup compare <folder> --fuzzy all # fuzzy everything (slow for large files)
notepad-cleanup compare <folder> --no-fuzzy # exact matches onlySize operators: lt, lte, gt, gte, eq. Size units: B, KB, MB, GB, TB.
Fuzzy matching is O(n * m) per comparison where n and m are the file sizes in characters. For each new file that isn't an exact hash match, it's compared against every unique file in the historical index.
With 100 new files and 150 unique historical files, that's up to 15,000 pairwise comparisons. Pre-filters reduce this:
- Exact hash match eliminates true duplicates instantly
- Size ratio filter (>2x difference) skips obviously different files
- Size threshold (default 50KB) skips large files entirely
- Only after all pre-filters pass does the expensive
SequenceMatcherrun
For typical notepad-cleanup sessions (50-100 small text files), comparison completes in seconds. If you have large files (>50KB), use --fuzzy small (the default) to skip them.