Portable real-time speech-to-text node powered by Whisper and NVIDIA Jetson. Captures speech via ReSpeaker 4 Mic Array, transcribes on-device using the LocalAgreement streaming algorithm, and streams results through Hypha RPC.
π Privacy Guarantee: Your voice is NEVER recorded or stored. All processing happens on-device. Read our Privacy Policy β
It's live! Give it a try:
ποΈ Open Live Transcript Viewer β Β Β π‘ SSE Stream Β Β π HealthΒ π Logs
| Component | Details |
|---|---|
| Compute | Jetson AGX Orin 64GB (JetPack 6.x, CUDA 12.x) |
| Microphone | ReSpeaker 4 Mic Array v2.0 (UAC1.0, 16 kHz, beamformed ch0) |
| Speaker (test) | Dell AC511 USB SoundBar or HDMI/DisplayPort monitor speakers |
| Power | USB-C PD power bank |
| Enclosure | 3D printed shell (ventilation + antenna ports) |
Mic auto-detection: ReSpeaker is tried first; falls back to HIK 1080P Camera if not found. Override with --mic "name-substring".
Speaker auto-detection: Tests use Dell AC511 USB SoundBar first, falling back to HDMI/DisplayPort monitor speakers if not found.
- Continuous streaming transcription using whisper_streaming LocalAgreement β no word-boundary errors from chunk splitting
- On-device Whisper inference via PyTorch (GPU, works offline)
- Real-time transcript streaming via Hypha ASGI service (SSE)
- Live transcript viewer β browser-based HTML page at
/ - ReSpeaker 4 Mic Array: hardware beamforming + 4-mic noise suppression via ch0
- Direction annotation β ReSpeaker USB DOA angle tags each utterance with the speaker's direction (e.g.
45Β°); note: speaker grouping is best-effort only (see known limitations) - LED control β Automatically turns off ReSpeaker's RGB LED ring on startup (no more annoying lights!)
- Auto-reconnect to Hypha on network loss (exponential backoff)
- systemd service with watchdog (
WatchdogSec=180) and auto-restart
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β hypha-whisper-node Architecture β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββ βββββββββββββββ βββββββββββββββββββ ββββββββββββ
β ReSpeaker β β MicCaptureβ β StreamingEngineβ β Hypha β
β 4-Mic ArrayββββββΆβ (PyAudio) ββββββΆβ (Whisper ASR) ββββββΆβ Client β
β (USB Audio)β β β β β β β
βββββββββββββββ βββββββββββββββ βββββββββββββββββββ ββββββ¬ββββββ
β β β β
β β β β
βΌ βΌ βΌ βΌ
βββββββββββββββ βββββββββββββββ βββββββββββββββββββ ββββββββββββ
β XMOS XVF- β β 6-channel β β LocalAgreement β β ASGI β
β 3000 DSP β β audio: β β Algorithm β β Service β
β β β β’ ch0=ASR β β β β β
β β’ Beamform β β β’ ch1-4=DOA β β β’ VAD (Silero) β β Endpointsβ
β β’ DOA β β β β β’ Buffer 3-5s β β β
β β’ AEC/NS β β β β β’ Commit text β β / β
βββββββββββββββ βββββββββββββββ βββββββββββββββββββ β /transcript_feed
β /health
β /logs
ββββββ¬ββββββ
β
βΌ
ββββββββββββββββ
β Browser β
β (SSE) β
ββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Data Flow (Privacy-First) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Speech βββΆ Memory βββΆ Whisper (local GPU) βββΆ Text βββΆ SSE βββΆ Discard β
β (temp) (no cloud) β β
β βΌ β
β βββββββββββ β
β β DOA β β
β β Buffer β β
β β (timing β β
β β fix) β β
β βββββββββββ β
β β
β π No audio stored π No transcript history π No cloud processing β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DOA Time-Alignment (Key Fix) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Problem: LocalAgreement buffers 3-5s, so text commits AFTER audio captured β
β β
β WRONG (old): DOA at commit time βββΆ misattributes if speaker changes β
β β
β CORRECT (new): β
β 1. DOA polled from firmware every 50ms βββΆ store as time intervals β
β 2. When text commits, get its actual time range [begin, end] β
β 3. Query: which DOA angle had longest overlap with [begin, end]? β
β β
β Speaker 1: 0-4.9s @ 90Β° ββββ€βββ Dominant = 90Β° (4.9s > 0.1s) β
β Speaker 2: 4.9-5s @ 341Β° ββββ― β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
hypha-whisper-node/
βββ main.py # Entry point, orchestrates all components
βββ audio/
β βββ capture.py # PyAudio microphone capture
β βββ doa_reader.py # ReSpeaker USB DOA + IntervalBuffer
β βββ led_control.py # Turn off ReSpeaker LEDs on startup
βββ transcribe/
β βββ streaming_engine.py # Whisper + LocalAgreement + DOA alignment
β βββ speaker_registry.py # Direction-based speaker labeling
β βββ whisper_online.py # Vendored from whisper_streaming
βββ rpc/
β βββ hypha_client.py # Hypha RPC ASGI service (SSE endpoints)
βββ tests/
βββ test_hardware_loopback.py # Acoustic WER + DOA verification
| Component | Technology | Purpose |
|---|---|---|
| Audio Capture | PyAudio | ReSpeaker 6-channel (ch0=ASR, ch1-4=raw) |
| DOA Estimation | XMOS XVF-3000 | On-chip direction detection via USB |
| ASR Engine | Whisper + LocalAgreement | Streaming transcription with buffering |
| VAD | Silero VAD | Voice activity detection |
| DOA Alignment | Duration-weighted overlap | Correct attribution during speaker changes |
| Streaming | Hypha RPC + SSE | Real-time text delivery to browsers |
| Watchdog | systemd + sd_notify | Auto-restart on hang or crash |
hypha-whisper-node is built with privacy as a foundational principle:
| Privacy Feature | Status |
|---|---|
| π€ Audio Storage | β Never saved β audio is processed in real-time and immediately discarded |
| π Transcript Storage | β Never persisted β transcripts exist only in memory |
| βοΈ Cloud Transcription | β None β all processing is on-device (Jetson GPU) |
| π‘ Telemetry | β None β no analytics or usage data collected |
| π Open Source | β 100% β fully auditable codebase |
| π Offline Mode | β Yes β works without any network connection |
Microphone β Memory β Whisper (local) β SSE Stream β Discard
β β β
(temporary) (no cloud) (live only)
- No audio files are ever written to disk
- No transcript history is stored β when the service restarts, everything is gone
- No voice data is sent to external servers for transcription
- Optional Hypha streaming only sends text (never audio) to your configured endpoint
Run completely offline:
python3 main.py --server ""π Read the full Privacy Policy β
| Model | Avg latency | Load time |
|---|---|---|
| tiny.en | 0.19 s | 6 s |
| base.en | 0.40 s | 4 s |
| small.en (default) | 0.92 s | 26 s |
git clone https://github.com/reef-imaging/hypha-whisper-node
cd hypha-whisper-node
sudo ./setup.shsetup.sh installs system packages, PyTorch (NVIDIA JP6.1 wheel), all Python deps, and creates /etc/hypha-whisper/config.env.
numpy pin:
numpy==1.26.4is pinned inrequirements.txt. Do not upgrade βwhisper-timestampedpulls numpy 2.x which breaks PyTorch's ABI on Jetson.
sudo nano /etc/hypha-whisper/config.envHYPHA_SERVER=https://hypha.aicell.io/
HYPHA_WORKSPACE=my-workspace
HYPHA_WORKSPACE_TOKEN=my-token# Option 1: Auto-install with setup script (recommended)
sudo ./setup.sh --install-service
# Option 2: Manual install
sudo cp deploy/hypha-whisper.service /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable --now hypha-whisperAll commands below also start/stop/restart the watchdog automatically.
| Task | Command |
|---|---|
| Start | sudo systemctl start hypha-whisper |
| Stop | sudo systemctl stop hypha-whisper |
| Restart | sudo systemctl restart hypha-whisper |
| Status | systemctl status hypha-whisper |
| Watchdog status | systemctl status hypha-whisper-watchdog |
| Live logs | journalctl -u hypha-whisper -f |
| Watchdog logs | journalctl -u hypha-whisper-watchdog -f |
| Last 100 lines | journalctl -u hypha-whisper -n 100 |
| Disable autostart | sudo systemctl disable hypha-whisper hypha-whisper-watchdog |
Logs look like:
2026-03-09T12:55:19 INFO [MicCapture] Found 'ReSpeaker 4 Mic Array...' at device index 24 (capture ch=6, use ch=0)
2026-03-09T12:55:25 INFO [StreamingEngine] Ready
2026-03-09T12:55:31 INFO [hypha] Connected to https://hypha.aicell.io (workspace: reef-imaging)
2026-03-09T12:55:54 INFO [transcript] Transcript sent to 1 client(s)
π Privacy Note: Transcript text is intentionally NOT logged. Only metadata (timestamps, client counts, connection events) appears in logs.
If the Hypha server drops, the service reconnects automatically (exponential backoff, max 60 s).
Once running, the service exposes these endpoints via Hypha:
| Endpoint | Description |
|---|---|
GET / |
Live transcript viewer β open in any browser |
GET /transcript_feed |
SSE stream β one data: <json> event per committed phrase |
GET /health |
JSON: {"status":"ok","model":"small.en","uptime_seconds":123} |
GET /logs?tail=N |
SSE stream of all Python log records; tail=N replays last N lines first |
Live deployment (reef-imaging workspace):
| URL | Description |
|---|---|
| hypha.aicell.io/reef-imaging/apps/hypha-whisper/ | Live transcript viewer |
| hypha.aicell.io/reef-imaging/apps/hypha-whisper/transcript_feed | SSE transcript stream |
| hypha.aicell.io/reef-imaging/apps/hypha-whisper/health | Health check |
| hypha.aicell.io/reef-imaging/apps/hypha-whisper/logs?tail=100 | Live log stream |
Full URL pattern:
https://<HYPHA_SERVER>/<WORKSPACE>/apps/hypha-whisper/
https://<HYPHA_SERVER>/<WORKSPACE>/apps/hypha-whisper/transcript_feed
https://<HYPHA_SERVER>/<WORKSPACE>/apps/hypha-whisper/health
https://<HYPHA_SERVER>/<WORKSPACE>/apps/hypha-whisper/logs?tail=100
Open https://<HYPHA_SERVER>/<WORKSPACE>/apps/hypha-whisper/ in a browser. The page connects automatically to transcript_feed via EventSource, accumulates text, and auto-scrolls. A Clear button resets the display. The connection indicator shows green when live and retries automatically on disconnect.
Each transcript segment is tagged with a coloured direction badge (e.g. 45Β°) showing the DOA angle when the speech was detected. Consecutive segments from the same direction are grouped into one line.
β Fixed β DOA attribution: Previously, speaker angles could be misattributed when multiple people spoke during LocalAgreement's 3-5s buffering period. The fix uses duration-weighted overlap: each transcript segment is tagged with the DOA angle that had the longest overlap with its actual time range (learned from WhisperX's interval tree approach).
Each data: event is a JSON object:
{"text": "Hello everyone.", "speaker": "45Β°", "angle": 45}| Field | Type | Description |
|---|---|---|
text |
string | Committed transcript phrase |
speaker |
string | Speaker direction label, e.g. "45Β°" (empty if DOA unavailable) |
angle |
int | null | Raw DOA angle in degrees, or null |
import httpx
url = "https://hypha.aicell.io/reef-imaging/apps/hypha-whisper/transcript_feed"
with httpx.stream("GET", url) as r:
for line in r.iter_lines():
if line.startswith("data: "):
print(line[6:])Designed for AI agents and automated monitoring. Each SSE event is a JSON object:
{"ts": 1741694400.123, "level": "INFO", "logger": "transcribe.streaming_engine", "msg": "12:00:00 INFO ... β Engine ready"}| Field | Description |
|---|---|
ts |
Unix timestamp (float) |
level |
DEBUG / INFO / WARNING / ERROR / CRITICAL |
logger |
Python logger name (e.g. rpc.hypha_client) |
msg |
Fully formatted log line |
Query parameter tail=N replays the last N buffered records (up to 2000) before streaming live β useful for catching up after connecting:
import httpx, json
url = "https://hypha.aicell.io/reef-imaging/apps/hypha-whisper/logs"
with httpx.stream("GET", url, params={"tail": 100}) as r:
for line in r.iter_lines():
if line.startswith("data: "):
record = json.loads(line[6:])
print(f"[{record['level']}] {record['msg']}")python3 main.py \
--server https://hypha.aicell.io/ \
--workspace my-workspace \
--token my-token \
--model small.en \
--backend whisper-timestampedOverride microphone:
python3 main.py --mic "ReSpeaker" # force ReSpeaker
python3 main.py --mic "HIK" # force HIK camera micOffline mode (transcribe to stdout, no Hypha):
python3 main.py --server ""pip install -r requirements-dev.txt
pytest tests/ -m "not hardware and not integration and not slow"Plays pre-recorded reference audio (tests/test-audio-male.wav) through the Dell AC511 speaker, records via ReSpeaker, transcribes, and measures Word Error Rate.
# One-time: allow passwordless sudo for service management during tests
# Replace <username> with your actual username
echo "<username> ALL=(ALL) NOPASSWD: /bin/systemctl start hypha-whisper, /bin/systemctl stop hypha-whisper" \
| sudo tee /etc/sudoers.d/hypha-whisper-tests
# Run all hardware tests (auto stops/restarts hypha-whisper service)
pytest tests/test_hardware_loopback.py -m hardware -v
# Run a specific test
pytest tests/test_hardware_loopback.py -m hardware -k wer
pytest tests/test_hardware_loopback.py -m hardware -k "rms or playback"The suspend_service pytest fixture automatically stops hypha-whisper at the start of the test session (to release the mic) and restarts it when done. It runs a background keeper thread that re-stops the service every 3 s to prevent Restart=always from reclaiming the microphone mid-test.
| Test | What it checks |
|---|---|
test_speaker_playback_only |
Speaker plays without error (USB SoundBar or HDMI/DP) |
test_mic_capture_rms |
ReSpeaker picks up speaker audio (RMS > 0.001) |
test_acoustic_loopback_wer |
Full pipeline WER < 30% against reference transcript |
test_speaker_identification |
Left vs right channel β 2 distinct angle labels |
test_speaker_stability_under_variation |
Same direction, varied audio β same angle label |
- Live conference captioning
- AI agent voice interfaces
- Edge speech processing nodes
- Lab automation voice control
| Project | Reference | Usage |
|---|---|---|
| OpenAI Whisper | GitHub β’ Paper | ASR backbone for speech recognition |
| whisper_streaming | GitHub | LocalAgreement streaming algorithm |
| WhisperX | GitHub β’ Bains, M. et al. (2023) | Duration-weighted speaker assignment (Section 3.1, interval tree approach) |
| ReSpeaker Mic Array | Wiki β’ GitHub | Hardware DOA via XMOS XVF-3000 |
| Hypha RPC | PyPI β’ Docs | RPC framework for bioimaging |
| FastAPI | Docs | ASGI service framework |
-
Bains, M., et al. (2023). WhisperX: Time-Accurate Speech Transcription of Long-Form Audio. Interspeech 2023.
-
Radford, A., et al. (2022). Robust Speech Recognition via Large-Scale Weak Supervision. ICML 2022.
- Whisper architecture and training methodology.
- Paper
-
PlΓ‘tek, O., et al. (2023). Streaming ASR for Long-Form Audio. UFAL MFF UK.
- LocalAgreement algorithm for streaming transcription.
- GitHub
- XMOS XVF-3000: Datasheet β Voice processor with on-chip DOA, AEC, beamforming
- ReSpeaker 4-Mic Array v2.0: Hardware Overview β USB microphone array with 4x MP34DT01TR-M MEMS mics
Including AI assistants β Claude and Kimi contribute via co-authored commits. π€
MIT
