Measured on a real Android codebase (12,142 Kotlin files, 51,281 symbols):
| Operation | Time | Notes |
|---|---|---|
| Cold index (parse all files) | ~3s | First run, no cache |
| Warm index (load from cache) | ~90ms | 70 MB bincode file |
| Semantic tokens (CST-only) | 19ms | Default CLI mode |
| Semantic tokens (with resolve) | ~92ms | --resolve flag, loads index |
find/refs (rg fallback) |
~50ms | --fast mode |
find/refs (indexed) |
~100ms | Includes cache load |
System: Linux (CachyOS), Rust release build (opt-level=3, thin LTO).
The tokens command defaults to CST-only mode (Phase 1 classification). This is instant
(~20ms) and sufficient for syntax highlighting preview. Use --resolve to opt-in to
Phase 2 cross-file resolution, which loads the full workspace index.
For find/refs, the --fast mode uses ripgrep directly and skips cache loading — this
is often faster than the indexed path for simple name lookups.
Add the [profile.profiling] section to Cargo.toml (already present):
[profile.profiling]
inherits = "release"
debug = 2
strip = false
lto = falsecargo install samply
# Ensure perf events are accessible
echo '1' | sudo tee /proc/sys/kernel/perf_event_paranoid
sudo sysctl kernel.perf_event_mlock_kb=8192
# Build with debug symbols
cargo build --profile profiling
# Profile indexing
rm -rf ~/.cache/kmp-lsp/
cd /path/to/kotlin/project
samply record -- /path/to/target/profiling/kmp-lsp index
# Profile semantic tokens
samply record -- /path/to/target/profiling/kmp-lsp tokens --resolve src/BigFile.kt
# Save profile without opening browser
samply record --save-only -o profile.json -- ...
samply load profile.json # open laterSamply opens Firefox Profiler UI with interactive flamegraph, call tree, and timeline.
Based on profiling the indexing pipeline:
-
Cache deserialization — bincode deserialization of the 70 MB index dominates startup for CLI commands that need the full index. Mitigated by defaulting
tokensto CST-only. -
File discovery —
fdsubprocess to enumerate workspace files. Fast (~200ms for 12K files) but could be skipped with warm-manifest mode when cache is fresh. -
Tree-sitter parsing — parallelized across all cores. Individual file parse is ~0.3ms. Bulk parsing 12K files takes ~2.5s thanks to Rayon thread pool.
-
Semantic token generation — Phase 1 (CST walk) is O(nodes), ~10ms for 1700-line file. Phase 2 (index lookups) adds ~5ms per file when index is already in memory.
- zstd compression for the on-disk cache. The 70 MB bincode compresses to ~15 MB with zstd level 3. Disk I/O reduction should halve load time on spinning disks and NVMe alike (CPU cost of decompression is negligible vs I/O savings).
-
rkyv zero-copy deserialization — replace bincode with rkyv. Memory-map the cache file and access data in-place without heap allocation. "Loading" becomes ~0ms (just an mmap syscall). Per-access cost is minimal. This would make indexed CLI commands instant.
-
Per-file cache sharding — instead of one monolithic blob, store one cache file per source file (or per module). Only load what's needed. Benefits:
- Incremental updates: edit one file → rewrite one entry (not 70 MB)
- Partial loads:
tokens --resolveonly needs imports of the target file - Better filesystem caching: OS page cache can evict cold entries
-
Incremental indexing — on file-change notification, re-parse only the changed file and update the in-memory index. Avoids full workspace re-index on every save. Already partially implemented for
textDocument/didChange. -
Lazy resolution — for semantic tokens, resolve type references on-demand using a lightweight import graph rather than loading the full symbol table. Would allow Phase 2 without full index load.