Real-time speech translation with prosody preservation for resource-constrained devices
Status: 💡 Concept stage - Implementation timeline TBD
Goal: Privacy-preserving translation that runs locally on Chromebooks while preserving speaker emotion and emphasis
Current translation requires choosing between cloud processing, expensive hardware, or robotic output that loses emotion and emphasis.
Anumanchipalli, Oliveira, & Black (2012) demonstrated cross-lingual prosody transfer is feasible:
- Showed prominence patterns are consistent across languages
- Developed word-level TILT parameterization for intonation transfer
- Validated on English↔Portuguese and English↔German
- Full CMU paper
This system extends their work with three innovations:
- Syllable-level prosody mapping (not word-level) - Better handles morphological differences
- Cooperative pacing - Audio cue when user should pause, creating natural rhythm
- Local execution - Runs on personal devices
Possible Technical approach: Whisper (STT) → Prosody extraction → NLLB (translation) → Syllable-level prosody transfer → TTS with prosody control → Lag monitoring
- Runs locally on personal devices
- Better preservation of speaker emotion/emphasis
- Open source (adaptable)
Want to build this first? Please do! Goal is impact, not credit.
Want to collaborate? Open an issue.
Interested in: Prosody-controllable TTS engines, syllabification for different languages, optimal lag thresholds, evaluation metrics for education.
Apache 2.0 - Free to use, can't be locked down by patents.
Core Philosophy: Technology should be a public good.