All notable changes to notepad-cleanup will be documented in this file.
comparenow finds new-formatnc-*session folders (#14). Since v0.2.2,extractsaves sessions using thenc-YYYY-MM-DD__hh-mm-ssnaming format, but session discovery still only matched the legacynotepad-cleanup-*pattern. As a result, anync-*folders in search directories were silently skipped during compare, leading to missed duplicatesfind_session_dirs()now iterates both patterns (notepad-cleanup-*andnc-*) and validates each candidate by checking formanifest.json. This rejects false positives likenc-backups,nc-scratch, or any folder starting withnc-that isn't a real extraction_get_session_dir()helper now matches both formats when walking up from a file
DEFAULT_SESSION_PATTERN(single) is nowDEFAULT_SESSION_PATTERNS(list). The singular constant is kept for backward compatibility- Test helper
make_session()now creates amanifest.jsonmarker so test fixtures match the validation logic
test_find_sessions_both_formats: verifies both old and new formats discoveredtest_find_sessions_rejects_false_positives: verifiesnc-backupsand similar folders withoutmanifest.jsonare not treated as sessions
- Link-aware organize:
organizenow creates symlinks inorganized/for dedup-linked files instead of copying data. Preserves the connection network so linked files point back to their canonical provenance root. Fallback chain: symlink -> hardlink -> dazzlelink -> copy linkscommand withseparateandjoinactions:links separate --last: moves symlinked files fromorganized/intoorganized-links/preserving category structure. Shows only new fileslinks join --last: moves them back, restoring the full collection- Both support
--dry-runand--dir-namefor custom directory names
- Organized link manifest (
organized/_organized_links.json): tracks which files inorganized/are symlinks vs copies for reliable detection - Previous session reference in AI prompt: when linked files exist, Claude receives a reference section with category names from previous sessions for naming consistency
load_link_manifest()andget_linked_paths()in dedup.py as shared data layer for link-aware operations
execute_plan()now acceptslinked_pathsparameter; checks each file against the dedup link manifest before deciding to copy or symlinkgenerate_prompt()now acceptslinked_pathsfor reference section- Organize summary shows separate counts for copied vs linked files
- Prompt template (
organize.md) gains{skip_section}and{reference_section}template variables
- docs/parameters.md: full command reference with all options, flags, and examples
- docs/install.md: installation guide (pip, venv, source, Claude CLI)
- Backfill script for ghtraf daily history (tests/one-offs/backfill_ghtraf_history.py)
- README slimmed down: moved per-command details to docs/parameters.md, installation details to docs/install.md. Kept How It Works, Output Structure, and Features
- Restored tree-style output structure with visual hierarchy indicators
- GitHub Traffic Tracker (ghtraf): badge gists, archive gist, traffic-badges workflow with CI trigger, stats dashboard at docs/stats/
- PyPI publishing via Trusted Publisher (OIDC): publish.yml workflow triggers on GitHub Release, builds and uploads automatically
- pyproject.toml (modern Python packaging metadata, replaces setup.py as primary)
- README badges: PyPI version, Release Date, Installs (via ghtraf endpoint)
- setup.py updated with long_description, project_urls, additional classifiers
- README updated with v0.2.0 features, new workflow section, links to docs
- Deduplication system (
comparecommand): detect exact and near-duplicate files across historical extraction sessions before organizing with AI- Heuristic fuzzy matching with log-quadratic threshold curve (3.5% fit error
across anchor points). See
docs/fuzzy-matching.mdfor derivation - Configurable fuzzy modes:
--fuzzy small(default, <50KB),--fuzzy all,--fuzzy "lte 100KB",--no-fuzzy - Progress bar with per-file and per-candidate detail showing fuzzy pipeline
stage (
[vs: filename [chk:4]]) - Hash caching for fast repeat scans (mtime + size invalidation)
- Compare results caching (
_compare_results.json) with staleness detection - Historical session indexing: prefers
organized/files over rawwindow*/when both exist; only indexes known text file extensions
- Heuristic fuzzy matching with log-quadratic threshold curve (3.5% fit error
across anchor points). See
- Filesystem linking (
--linkflag oncompare): replace duplicates with hardlinks, symlinks, or DazzleLink JSON descriptors- Auto-detect best strategy per platform (
--link auto) - Backup originals as
.origbefore linking - Confirmation prompt before modifying files
- Link manifest (
_dedup_links.json) tracks all operations
- Auto-detect best strategy per platform (
- Diff script generation:
compareauto-generates_compare_diffs.cmd(Windows) /_compare_diffs.sh(Unix) to spot-check each matched pair in Beyond Compare, WinMerge, VS Code, or other configured diff tool diffcommand: find and launch the generated diff script (diff --last)- Configuration system (
configcommand,~/.notepad-cleanup.json):- Unified folder registry with
...notation (...= output,...1/...2= other folders,...-1/...-2= recent extractions MRU) ConfigManagerclass in dedicatedconfig.pymoduleconfig show,config add,config remove,config set,config unset- Folder roles: output and search are independent assignments
- Persistent diff tool, MRU depth, search dirs
...expansion in all path arguments (resolved at runtime, never stored)config show <...ref>resolves any...reference for scripting- Windows case-insensitive path comparison (
_paths_equal) - Environment variable expansion (
%USERPROFILE%,$HOME) - Stray quote stripping for trailing-backslash shell escaping issues
- Too-broad path detection (warns on home dir, drive roots)
- Unified folder registry with
--lastflag oncompare,organize, anddiffcommands: auto-uses most recent extraction from MRU without copy-pasting paths- MRU (Most Recently Used) extraction history: configurable depth (default
10), referenced as
...-1,...-2, etc. - Search dir composition:
-sfor explicit-only search,-ssfor additive (includes saved dirs),-nspto exclude parent folder docs/fuzzy-matching.md: threshold formula derivation, customization via environment variables, fitting script referencedocs/config.md: full configuration reference covering folders, roles, MRU, settings,...notation, and search behavior- Path shortening in display (
~\Desktopinstead ofC:\Users\...\Desktop)
- Default output directory:
~/Desktop/notepad-cleanup/nc-TIMESTAMP(was~/Desktop/notepad-cleanup-TIMESTAMP). Consolidates extractions into one folder - Extract now auto-registers output parent as a search dir in folder registry
- Extract hints now show both
compare --lastandorganize --lastas next steps - Help text updated across all commands to reflect new workflow:
extract -> compare -> organize - Config functions extracted from
dedup.pyinto dedicatedconfig.pymodule
--dry-runflag onextract— preview what would be extracted without saving files-halias for--helpon all commands-Valias for--version- Detailed help text with examples for all commands (
extract,organize,run) - Auto-versioning system (ported from wingather):
_version.pyas canonical version source, pre-commit/post-commit hooks auto-stamp branch, build number, date, and commit hash into version string - Version scripts:
scripts/update-version.sh,scripts/install-hooks.sh,scripts/paths.sh,scripts/hooks/(pre-commit, post-commit, pre-push) - CHANGELOG.md
- GitHub Discussions enabled
setup.pyreads version from_version.pyviaget_pip_version()(PEP 440)__init__.pyimports version from_version.py(single source of truth)- README badges: added Discussions, Platform
- Phase 2 now correctly identifies newly loaded RichEditD2DPT controls by tracking
handle snapshots before/after each
tab.select(), instead of blindly reading the last handle (which often re-read an already-loaded tab) - Increased Phase 2 tab switch delay from 0.08s to 0.15s for more reliable control loading
- README with features, installation, usage, and architecture docs
- GPL-3.0 license
- FUNDING.yml (GitHub, Ko-fi, Buy Me A Coffee)
- Issue templates for bug reports and feature requests
- CONTRIBUTING.md with development setup guide
- CI workflow switched to Windows runners (lint + build)
- CODEOWNERS updated to @djdarcy
- setup.py: added GPL-3.0 classifier, updated author
- Phase 2 duplicate extraction: global dedup across all windows using normalized text hashing (line endings + trailing whitespace stripped)
- UIA cross-window bleed: use
app.window(handle=)instead ofapp.top_window()since all Notepad instances share one PID - Phase 2 reads via WM_GETTEXT (same as Phase 1) instead of UIA
Document.window_text()— eliminates hash mismatch between methods - Ctrl+C during Claude CLI: use
time.sleep()+process.poll()instead ofthread.join()which swallows KeyboardInterrupt on Windows
- Each tab preserved as individual file — removed quick-notes.md compaction
- Output folder renamed from
_reorganized/toorganized/
get_tab_count()rewritten to use UIAdescendants(control_type="TabItem")instead of NotepadTextBox child count, which only counted loaded tabs and prevented Phase 2 from triggering- Phase 2 tab enumeration: use
descendants()instead ofchildren()chain since WinUI TabItems aren't direct children of the Tab control
- Organizer switched from inline content embedding to Claude Read tool approach:
short prompt with
--allowedTools Read,Grep, Claude reads files from disk - Removed
build_file_listing()and stdin piping (no longer needed) - Added threaded stdout reader for Ctrl+C support during Claude CLI subprocess
- Two-phase extraction: silent WM_GETTEXT (Phase 1) + UIA tab switching (Phase 2)
- CLI with
extract,organize,runcommands (Click + Rich) - AI organization via Claude Code CLI: returns JSON plan, Python executes file ops
- Manifest.json tracking for all extracted files
- Spike scripts in
tests/one-offs/for UIA exploration