System-tray voice dictation for every desktop app. Press a hotkey, speak, text appears at your cursor. No input method switching, no app switching, no friction. Typeless/Wispr Flow experience with BYOK pricing.
| Product | Model | Price | Our Advantage |
|---|---|---|---|
| Wispr Flow | SaaS, closed | $10/mo | BYOK = near-zero cost, cross-platform |
| Typeless | SaaS, closed | $12-30/mo | BYOK, open, desktop-native |
| macOS Dictation | Built-in, free | Free | Multi-provider, LLM refinement, multilingual |
| Whisper.cpp CLI | Open source | Free | GUI, auto-paste, refinement, no terminal needed |
Positioning: Wispr Flow UX + BYOK model + cross-platform (macOS/Windows/Linux)
- VoxPen Android v1 shipped (prompts, API patterns proven)
- Rust toolchain installed (
rustup) - Tauri v2 CLI installed (
cargo install tauri-cli) - Node.js + pnpm for React frontend
- Tauri v2 project with React + TypeScript + Tailwind frontend
- Configure
tauri.conf.json: app idcom.voxpen.desktop, hidden settings window, tray icon - Capabilities: global-shortcut, store, sql, autostart permissions
- All Rust dependencies in workspace
Cargo.toml+voxpen-core/Cargo.toml -
AppErrorenum withthiserror(6 variants, serializable for Tauri) -
PipelineStateenum (7 states matching AndroidImeUiState) -
Languageenum withcode()andprompt()methods -
RecordingModeenum (HoldToRecord default, Toggle) -
audio::encoder::pcm_to_wav()— 44-byte RIFF header, pure Rust -
pipeline::prompts::for_language()— 4 prompts verbatim from Android - API constants: timeouts, base URL matching Android
- Tray icon (placeholder, idle state)
- Tray menu: Settings / Quit
- Click Settings → show settings window
- App starts with hidden window (tray-only)
Created src-tauri/crates/voxpen-core/ as a separate workspace crate containing all business logic (types, encoder, prompts, API modules). This crate has no Tauri dependency and can be built/tested independently without Linux system libraries (libgtk, libwebkit, etc.). The Tauri app crate depends on voxpen-core and handles only framework glue (tray, hotkey, IPC commands).
- 41 unit tests passing
- clippy clean (
-D warnings) - Frontend builds (Vite + React + Tailwind)
- Full Tauri binary build requires Linux dev libraries (libglib2.0-dev, libgtk-3-dev, libwebkit2gtk-4.1-dev, libasound2-dev)
Empty Tauri app living in system tray, with core types (AppError, PipelineState, Language, RecordingMode) defined and tested.
This is the entire core loop. Get this working end-to-end first.
Note:
audio/encoder.rs(pcm_to_wav) already implemented and tested in Phase 0. Phase 1 focuses on the remaining pipeline modules. All new modules go insrc-tauri/crates/voxpen-core/src/(framework-independent, testable without Tauri).
File: voxpen-core/src/api/groq.rs
-
SttConfigstruct with api_key, model, language, response_format -
WhisperResponsestruct for Groq API response -
transcribe()async fn with reqwest multipart POST, per-call Bearer auth - HTTP error mapping: 401 → ApiKeyMissing, 413 → Transcription, 500 → Transcription
- 5 wiremock tests: success, 401, 413, 500, custom model
File: voxpen-core/src/pipeline/transcribe.rs
-
transcribe()composing encoder::pcm_to_wav + groq::transcribe - Empty PCM validation → AppError::Audio
- 3 wiremock tests: encode+call, empty rejection, error propagation
File: voxpen-core/src/pipeline/controller.rs
-
PipelineControllerwithtokio::sync::watchfor state broadcast -
PipelineConfigstruct (API key, language, model) -
on_start_recording(): validates API key → emits Recording (or Error) -
on_stop_recording(pcm): emits Processing → transcribe → Result (or Error) -
current_state(),subscribe(),update_config(),reset() - 7 tests: idle start, recording transition, error on missing key, processing on stop, empty pcm, reset, config update
File: voxpen-core/src/audio/recorder.rs + src-tauri/src/audio.rs
-
AudioRecordertrait: start/stop/is_recording (mockall automock) -
CpalRecorderconcrete implementation (16kHz mono i16, Arc buffer)
File: voxpen-core/src/input/clipboard.rs + input/paste.rs + src-tauri/src/clipboard.rs + src-tauri/src/keyboard.rs
-
ClipboardManagertrait: get_text/set_text (mockall automock) -
KeySimulatortrait: paste (mockall automock) -
paste_text()orchestration: save → write → paste → sleep(100ms) → restore - 4 mock tests: save/restore, write+paste, graceful restore failure, paste failure propagation
-
ArboardClipboardconcrete implementation (arboard crate) -
EnigoKeyboardconcrete implementation (Cmd+V on macOS, Ctrl+V on others)
Files: src-tauri/src/lib.rs, commands.rs, hotkey.rs, state.rs
- Register global hotkey in Tauri
setup()viatauri-plugin-global-shortcut - On hotkey press →
PipelineController::on_start_recording()+ start audio capture - On hotkey release → stop capture →
PipelineController::on_stop_recording(pcm)→ auto-paste - Emit
pipeline-stateTauri events on each state change (background task with watch channel) - Tauri command:
save_api_key(provider, key)→ Tauri store - Tauri command:
get_settings()/save_settings()→ Tauri store - Tauri command:
test_api_key(provider, key)→ silent WAV test transcription -
GroqSttProvider/GroqLlmProviderconcrete impls reading settings at call time -
AppStateshared state with Arc<Mutex<>> for thread-safe Tauri access - Short recording guard (<0.5s → ignore, don't send to API)
- History commands wired to SQLite via
rusqlite(replacedtauri-plugin-sql) - Tray tooltip state changes (Ready/Recording/Processing/Done/Error)
- Hotkey debounce guard (AtomicBool prevents concurrent pipeline runs)
- Errors emitted to frontend overlay (not just stderr)
- API key read from Tauri store before env var fallback
- Settings sync to shared state on save/load
- History saved after each transcription (original + refined, with UUID, timestamp, duration)
- Tray tooltip state changes: Ready → Recording... → Processing... → Done → Ready
- Overlay window: secondary always-on-top frameless transparent window
- Overlay show/hide driven by pipeline state (visible during recording/processing, hidden on idle)
- Window routing: App.tsx detects window label and renders Settings vs Overlay
Full Tauri app compilation requires system libs (libgtk-3-dev, libwebkit2gtk-4.1-dev, libasound2-dev). The CI workflow validates full build. Local verification: voxpen-core tests (89 passing) + frontend build + clippy clean.
Working end-to-end: press hotkey → speak → text appears at cursor. This is a usable MVP.
-
api/groq.rs: Chat completion endpoint (chat_completion(),ChatConfig)- Models:
openai/gpt-oss-120b(default) /openai/gpt-oss-20b - Temperature:
0.3, max_tokens:2048 - 5 wiremock tests: success, 401, 500, default model, empty choices
- Models:
-
pipeline/refine.rs: Orchestrate text → LLM → refined text- Compose
prompts::for_language()+groq::chat_completion() - 3 wiremock tests: success, empty rejection, error propagation
- Compose
-
LlmProvidertrait + updatedPipelineController<S, L>- After STT success: if refinement enabled, emit
Refining→ call LLM → emitRefined - If refinement fails: fall back to
Resultwith original text (graceful degradation) - 5s timeout via
tokio::time::timeout— falls back to raw text - 5 new tests: refining→refined, fallback on failure, skip when disabled, timeout, debug masking
- After STT success: if refinement enabled, emit
-
pipeline/settings.rs:Settingsstruct (Serialize/Deserialize/Default)
- Tabbed sidebar layout (General, Speech, Refinement, Appearance)
- GeneralSection: hotkey display, recording mode, auto-paste, launch at login
- SttSection: provider dropdown, API key (masked + save), language, model
- RefinementSection: enable toggle, provider, API key, model
- AppearanceSection: theme (system/light/dark), UI language
-
useSettingshook with debounced save -
src/lib/tauri.tsinvoke wrappers
- Settings struct with defaults (actual tray menu in Phase 1.6)
Full-featured voice dictation with refinement and configurable settings.
- Overlay React component with pipeline-state event listener
- States: Recording (red pulse), Processing (spinner), Done (green check, auto-hide 1.5s), Error (red X, auto-hide 3s)
- Semi-transparent background with backdrop-blur
- Tauri secondary window (always-on-top, frameless, transparent, show/hide from Rust)
-
history.rs:TranscriptionEntrystruct withdisplay_text()method - SQL schema constants (CREATE, INSERT, QUERY, SEARCH, DELETE, CLEANUP, COUNT)
- History UI (React): search, expand/collapse, copy, delete with confirmation
- History tab integrated into Settings sidebar
- Language badges (colored pills per language)
- SQLite integration via
rusqlitewithHistoryDbwrapper (query, search, insert, delete)
-
audio/chunker.rs: WAV-aware chunking for files > 25MB- Validates RIFF/WAVE headers, splits PCM data, updates chunk headers
- 6 tests: single chunk, multi-chunk, header preservation, size updates, error cases
- File transcription UI (deferred to post-v1.0)
Polished experience with visual feedback, history, and file transcription.
- Clean, minimal settings window with sidebar navigation
- Smooth animations in overlay (CSS pulse, spin transitions)
- Dark mode support (follows system, manual override)
-
useThemehook: system/light/dark with media query listener - App icons generated (all sizes, .icns, .ico) — completed in Phase 5
-
react-i18next+i18nextsetup -
en.json+zh-TW.jsoncomplete (all UI strings) - All React components wired with
t()translation calls - Language switching in AppearanceSection with
i18n.changeLanguage() - Tray menu localized (deferred — English-only acceptable for v1.0 MVP)
- System notifications localized (deferred to post-v1.0)
- macOS: accessibility/mic permissions, menu bar, code signing, DMG, Apple Silicon + Intel
- Windows: global hotkey, paste simulation, system tray, NSIS, Win 10+11
- Linux: X11/Wayland, paste simulation, AppImage, Ubuntu 22.04+
- No internet → error emitted to overlay (network error from reqwest)
- Invalid API key → error emitted to overlay (ApiKeyMissing error)
- Very short recording (<0.5s) → skip and reset (8000 sample guard)
- Rapid hotkey presses → debounce via AtomicBool processing guard
- Microphone in use → handle gracefully (deferred — cpal returns error)
- Very long recording (>5 min) → auto-stop + warn (deferred)
- System sleep/wake → re-register hotkey (deferred)
Production-ready app tested on all three platforms.
- Generated
icon.icns(macOS),icon.ico(Windows) viatauri icon - All icon sizes: 32x32, 128x128, 128x128@2x, icon.png
- DMG with hardened runtime (macOS, min 10.15)
- NSIS installer with currentUser mode, WebView2 bootstrapper (Windows)
- AppImage + .deb (Linux)
- Updater artifacts enabled (
createUpdaterArtifacts: "v1Compatible")
-
tauri-plugin-updater+tauri-plugin-processadded (Rust + JS) - Updater plugin registered in
lib.rs - Updater + process permissions in
desktop.jsoncapabilities -
UpdateCheckerReact component with download progress bar - i18n strings for update UI (en + zh-TW)
- Updater signing key generation (first real release)
- GitHub repo URL in updater endpoint (placeholder set)
-
.github/workflows/ci.yml: test core + build frontend on push/PR -
.github/workflows/release.yml: cross-platform build on tag push- macOS (arm64 + x64), Windows (x64), Linux (x64)
tauri-apps/tauri-actionwith Rust cache- Code signing env vars ready (secrets need to be set)
- Draft releases with platform-specific asset descriptions
- Apple Developer account + code signing certificate
- Apple notarization credentials
- Windows code signing certificate (optional)
-
pnpm test:core— run voxpen-core tests -
pnpm lint:core— run clippy on voxpen-core -
pnpm tauri:build/pnpm tauri:devconvenience scripts
- Product page with demo GIF/video
- Download links per platform
- Setup guide (get API key, install, configure)
- Homebrew cask formula (optional)
Distribution infrastructure ready. First release requires: signing keys, Apple Developer account, and GitHub repo configuration.
- User selects text in any app
- Press different hotkey (e.g.,
Cmd+Shift+E) - Speak editing command: "翻成英文" / "make it shorter" / "改成正式語氣"
- App reads selected text (simulate
Cmd+C) - Send to LLM: selected text + voice command
- Replace selection with LLM output (simulate paste)
- Send audio chunks while recording (real-time STT)
- Display partial transcription in overlay
- Finalize on recording stop
- Integrate
whisper.cppviawhisper-rscrate (Rust backend ready) - Download model on first use (4 tiers: quick/balanced/quality/maximum)
- Model management commands (download, status, delete)
- Fallback when offline
- User choice: cloud (faster, needs internet) vs local (private, slower)
Status: Backend complete, UI temporarily removed from settings.
local-whisper is an opt-in Cargo feature (--features local-whisper), not in default build.
Windows build blocker: whisper-rs-sys (bindgen + whisper.cpp) fails to compile
on Windows MSVC due to libclang struct layout mismatches. bindgen generates Linux
glibc types (_G_fpos_t, _IO_FILE) or produces opaque structs (size=1) depending
on environment configuration. Attempted solutions:
WHISPER_DONT_GENERATE_BINDINGS=1→ bundled bindings are Linux-onlyBINDGEN_EXTRA_CLANG_ARGS=--target=x86_64-pc-windows-msvc+ MSVC include paths → still fails- x64 Native Tools Command Prompt → missing scoop tools (cargo, git, cmake)
- Combined VS dev shell + scoop → still produces wrong struct sizes
Decision: Ship v1.0 without local whisper in the default build. Enable via
cargo build --features local-whisper on macOS/Linux where it compiles cleanly.
Revisit Windows support when whisper-rs-sys upstream fixes MSVC bindgen compatibility.
- Add ko, es, fr, de, th, vi with refinement prompts
- Whisper STT works out-of-box, only prompts need tuning
- Same research as Android: 意傳科技 API, 教育部 ASR engine
- If API available, integrate as custom STT provider
- Shared prompt library (export/import JSON)
- Shared history (optional cloud sync, user-hosted)
- Testability: Core business logic (encoder, API client, state machine, prompts) can be tested without Linux system libraries (libgtk, libwebkit, etc.)
- Separation of concerns: Framework-agnostic code (pure Rust) vs framework glue (Tauri setup, IPC commands, device access)
- Faster iteration:
cargo test -p voxpen-coreruns in seconds, no Tauri build needed - Trait boundaries: Hardware-dependent code (audio, clipboard, keys) uses traits in voxpen-core, with concrete impls in the Tauri crate
- Future portability: Core crate could be reused in CLI, other frameworks
- Bundle size: Tauri ~5-15MB vs Electron ~150MB+
- Memory usage: Tauri uses system webview vs Electron's bundled Chromium
- Rust backend: system-level operations (hotkey, paste simulation, audio) are natural in Rust
- Startup speed: Tauri apps launch near-instantly
- Security: Tauri's permission model is stricter than Electron
API calls contain secret keys. Running them in Rust means:
- Keys never enter the webview/JS context
- No CORS issues (Rust HTTP client is not browser-bound)
- Can be called without any window open (tray-only mode)
- Better error handling and retry logic in Rust
The overlap between Android IME and desktop tray app is minimal:
- Shared: API endpoint URLs, prompt text, language list
- Not shared: UI (Compose vs React), audio recording (AudioRecord vs cpal), system integration (IME vs hotkey+paste), architecture (MVVM vs Tauri commands)
Copying prompts and API patterns is simpler than maintaining a shared KMP module.
- Proven patterns: Android v1 shipped successfully with these patterns
- Consistency: Same team maintains both codebases — familiar mental model
- Portability: Same state machine, same API contracts, same error handling strategy
- Testing: Same TDD approach, same test naming, same coverage targets
Default is hold-to-dictate because:
- Matches muscle memory (like walkie-talkie)
- Natural endpoint — release = done
- Prevents accidental long recordings
- Faster for short dictations (most common use case)
Toggle mode available for accessibility and long dictations.
1. Read current clipboard → save to temp
2. Write transcription to clipboard
3. Simulate Cmd+V / Ctrl+V
4. Wait 100ms
5. Restore original clipboard from temp
Same as Typeless/Wispr Flow/1Password. Fragile on some Linux Wayland compositors but reliable on macOS and Windows.
| Risk | Impact | Mitigation |
|---|---|---|
| macOS Accessibility permission UX | High | Clear first-run permission flow |
| Paste simulation fails in some apps | Medium | Clipboard-only fallback + notification |
| Wayland paste issues (Linux) | Medium | Detect Wayland → clipboard-only mode |
| Apple notarization rejection | Medium | Follow guidelines strictly, no private APIs |
| Windows SmartScreen warning | Medium | Consider EV code signing certificate |
| cpal audio device issues | Low | Fallback to system default, clear error messages |
| Tauri v2 breaking changes | Low | Pin versions, follow release notes |
| whisper-rs Windows MSVC build | High | local-whisper opt-in feature, excluded from default build. Windows ships cloud-only STT. |
| Layer | Status | Details |
|---|---|---|
| Core types & state machine | ✅ 89 tests | AppError, PipelineState, Language, RecordingMode |
| Groq STT API client | ✅ tested | Multipart POST, error mapping, wiremock |
| LLM refinement pipeline | ✅ tested | Chat completion, prompts, 5s timeout fallback |
| Transcription pipeline | ✅ tested | PCM→WAV→STT orchestration |
| Audio encoder | ✅ tested | 44-byte RIFF header, pure Rust |
| Audio chunker | ✅ tested | WAV-aware splitting for >25MB files |
| Pipeline controller | ✅ tested | State machine with SttProvider + LlmProvider traits |
| Auto-paste logic | ✅ tested | save→write→paste→restore with trait mocks |
| Hardware traits | ✅ defined | AudioRecorder, ClipboardManager, KeySimulator |
| Settings UI | ✅ built | 5-tab sidebar: General, Speech, Refinement, Appearance, History |
| Floating overlay | ✅ built | Recording/Processing/Done/Error states with animations |
| History UI | ✅ built | Search, expand, copy, delete with language badges |
| i18n | ✅ built | en + zh-TW, all components wired |
| Theme system | ✅ built | System/light/dark with media query listener |
| Auto-update UI | ✅ built | UpdateChecker with progress bar, bilingual |
| Bundle config | ✅ configured | DMG, NSIS, AppImage, .deb, updater artifacts |
| CI/CD | ✅ configured | GitHub Actions: CI + cross-platform release |
| App icons | ✅ generated | All sizes, .icns, .ico |
| Tauri integration layer | ✅ written | Hotkey, commands, state, event emission |
| CpalRecorder | ✅ written | 16kHz mono, Arc buffer |
| ArboardClipboard | ✅ written | arboard get/set_text |
| EnigoKeyboard | ✅ written | Cmd+V / Ctrl+V simulation |
| IPC commands | ✅ written | get/save_settings, save/test_api_key |
| Global hotkey handler | ✅ written | Press→record, release→transcribe→paste |
| Pipeline state emission | ✅ written | watch channel → Tauri events → React |
| Missing Item | Blocked By | Effort |
|---|---|---|
| Platform compilation + smoke test | System libs / CI | ~2-3 days |
| Platform testing (macOS/Win/Linux) | Platform access | ~1 week |
| Total | ~1.5 weeks |
Phase 4.3 — Platform Testing (~1 week after compilation):
- macOS: accessibility/mic permissions, menu bar, code signing, DMG
- Windows: global hotkey, paste simulation, system tray, NSIS
- Linux: X11/Wayland, paste simulation, AppImage, Ubuntu 22.04+
Phase 5 — External Setup (non-code):
- Apple Developer account + signing certificate
- Updater signing key generation
- GitHub repo URL configuration
We're at ~95% of v1.0. All business logic is proven (89 core tests), all UI is complete, all Tauri integration is wired:
- SQLite history: real insert/query/search/delete via
rusqlite - Overlay: secondary always-on-top frameless window with show/hide
- Tray: dynamic tooltip updates (Ready/Recording/Processing/Done/Error)
- Hotkey: debounced with AtomicBool, errors emitted to frontend
- API keys: read from Tauri store first, env var fallback for dev
- Settings: synced to shared state on save/load
- History: entries saved automatically after each transcription
What remains is non-code work:
- Install system libs and compile the full Tauri app
- Test on real hardware (macOS, Windows, Linux)
- Apple Developer account + signing for distribution
- Total: ~1-1.5 weeks of platform testing
- End-to-end latency < 3s (hotkey release → text pasted) on Groq
- Works in top 10 desktop apps: VS Code, Chrome, Slack, Notion, Word, LINE, Discord, Terminal, Notes, Mail
- Zero-crash in 1-week dogfooding
- macOS + Windows builds stable
- Auto-paste success rate > 95% on macOS and Windows
- 繁中 + English UI complete
- Settings persist across restarts (Tauri store + shared state sync)
- History searchable (SQLite LIKE search via rusqlite)
- Test coverage ≥ 80% for Rust pipeline modules (89 tests, all core modules covered)
| Phase | Duration | Status | Deliverable |
|---|---|---|---|
| Phase 0: Scaffold | 2 days | ✅ COMPLETE | Tray app + core types defined and tested |
| Phase 1.1-1.5: Core logic | 2 weeks | ✅ COMPLETE | STT, LLM, pipeline, state machine, paste (89 tests) |
| Phase 1.6: Tauri integration | ~1 week | ✅ COMPLETE | Hotkey, commands, state, events, history, API keys |
| Phase 1.7: Visual feedback | ~0.5 week | ✅ COMPLETE | Tray tooltip, overlay window, show/hide |
| Phase 2: Refinement + Settings | 2 weeks | ✅ COMPLETE | LLM pipeline + full settings UI |
| Phase 3: Overlay + History | 1 week | ✅ COMPLETE | Overlay + history UI + SQLite + chunker |
| Phase 4.1-4.2: i18n + Theme | 1 week | ✅ COMPLETE | Bilingual UI + dark mode |
| Phase 4.3: Platform testing | ~1 week | ❌ PENDING | Needs system libs for compilation |
| Phase 4.4: Edge cases | ~3 days | ✅ PARTIAL | Debounce, error emission, short guard done |
| Phase 5: Distribution infra | 1 week | ✅ COMPLETE | CI/CD, bundle config, auto-update, icons |
| Total to v1.0 | ~1-1.5 weeks remaining | ~95% done | Code-complete, needs platform testing |