A reusable Copilot ASR benchmark framework for Azure Speech-to-Text services. It measures Word Error Rate (WER), Sentence Error Rate (SER), command recognition accuracy, and latency (LBL, UPL) across Azure Speech Fast Transcription, Speech SDK realtime recognition, LLM enhanced modes, and multiple languages.
中文:这是一个面向 Copilot、AI Voice Agent 和智能语音命令场景的 Azure Speech ASR Benchmark 工具,用于评估语音识别准确率、实时转写、命令识别、错误类型和延迟表现。
In a recent full validation run, the benchmark completed 604 test cases across 5 recognition modes without service invocation failures. The results showed Azure Speech was stable and usable for this English home-command scenario, with high command-level recognition accuracy in both fast_llm and realtime modes.
Comes with a bundled 30-command smart speaker testset in five locales (en-GB, de-DE, es-ES, fr-FR, it-IT). Generate audio with one command, run the benchmark, and get detailed markdown reports.
This project is designed for teams building Copilot-style voice experiences, voice agents, smart device commands, car assistants, and speech-driven product workflows. It helps answer practical ASR questions:
- Can Azure Speech reliably recognize short voice commands?
- Which recognition mode works best for command-and-control scenarios?
- Where do errors come from: ASR quality, text normalization, reference labels, or pronunciation variants?
- How do
fast_default,fast_llm,fast_mai,realtime, andrealtime_refinecompare? - What are the recurring issues in numbers, word forms, homophones, and command phrase normalization?
The benchmark has been used to analyze realistic command recognition behavior, including:
- Number and formatting mismatches, such as
oneversus1, ortenversus10 - Near-homophone and phrase errors, such as
filterversusfuture,fanversusphone/fast, andautoversusautumn - Command phrase variations, such as
startversusstar - Reference text and audio mismatches that can distort accuracy statistics
- Product-specific command phrases that may need domain adaptation, such as
turn on,filter,turbo,oscillation,dehumidification,firefly light effect, andrunning lights effect
These notes make the benchmark useful beyond a raw WER score: it can separate model behavior, dataset quality, reference normalization, and product vocabulary issues.
git clone https://github.com/shawnq-msft/test-copilot-asr-benchmarking.git
cd test-copilot-asr-benchmarking
pip install -r requirements.txtThe repository was renamed from
test-copliot-asr-benchmarkingtotest-copilot-asr-benchmarkingto fix the Copilot spelling. Existing GitHub links to the old URL continue to redirect to this repository.
copy .env.example .env # Windows
# cp .env.example .env # macOS/LinuxEdit .env and fill in your Azure Speech resource details:
AZURE_SPEECH_KEY=your_key_here
AZURE_SPEECH_REGION=eastus
AZURE_SPEECH_ENDPOINT=https://eastus.api.cognitive.microsoft.comFor
fast_llmandfast_mai(LLM/MAI preview services), you may need a separate Speech resource. AddLLM_AZURE_SPEECH_KEY,LLM_AZURE_SPEECH_REGION,LLM_AZURE_SPEECH_ENDPOINTif so.
The testset reference texts are included. Generate the WAV files using Azure TTS:
# All 5 locales (takes ~3 minutes)
python scripts/generate_testset.py
# Just English to get started
python scripts/generate_testset.py --locale en-GBTest 5 samples × 2 services × 1 locale to verify everything works (~1-2 minutes):
python run_full.py --languages en-GB --max-per-dataset 5 --no-pace --services fast_default realtimeOpen the generated report:
results/asr-bench_en-GB_<timestamp>_report.md
results/asr-bench_en-GB_<timestamp>_error_analysis.md
| Service | Description |
|---|---|
fast_default |
Azure Fast Transcription (REST, api-version 2024-11-15). Locale-locked. |
fast_llm |
Fast Transcription + LLM enhanced mode (preview). No locale set — auto-detects language. May hallucinate or confuse languages. |
fast_mai |
Fast Transcription + mai-transcribe-1 model (preview). Requires LLM Speech preview enabled. |
realtime |
Speech SDK continuous recognition (WebSocket, word-level timestamps). |
realtime_refine |
Realtime + Post-Stream Refinement (MRS preview, requires SDK ≥ 1.49.0). |
The testdata/ directory contains 30 smart speaker commands per locale:
| Locale | Language | Example commands |
|---|---|---|
| en-GB | English (UK) | "Set a timer for five minutes", "Turn off the living room lights" |
| de-DE | German | "Stell einen Timer für fünf Minuten", "Garagentor öffnen" |
| es-ES | Spanish | "Pon un temporizador de cinco minutos", "Buenas noches" |
| fr-FR | French | "Mets un minuteur de cinq minutes", "Bonne nuit" |
| it-IT | Italian | "Imposta un timer di cinque minuti", "Buona notte" |
WAV files are generated from the trans.txt reference files using Azure TTS
(voice: en-GB-SoniaNeural, de-DE-KatjaNeural, etc.).
# One locale
python run_full.py --languages en-GB
# All locales (takes 20–60 min depending on network)
python run_full.py
# Specific services
python run_full.py --languages en-GB --services fast_default fast_mai realtimeIf a run was interrupted, resume it without re-testing completed samples:
python run_full.py --languages en-GB --resume results/asr-bench_en-GB_<timestamp>.csvBring your own audio files in the same structure:
python run_full.py --testdata-dir /path/to/myaudio --output-dir /path/to/output --tag myprojectYour audio directory should contain <locale>/<dataset-name>/ folders with WAV files and an optional trans.txt.
--languages Locales to test (default: all in config)
--max-per-dataset Samples per dataset (default: 30)
--workers Thread pool size (default: 4)
--no-pace Skip real-time pacing (faster, but latency unrealistic)
--services Services to benchmark (default: all 5)
--resume Path to partial CSV to resume from
--tag Output filename prefix (default: asr-bench)
--seed Random seed for reproducible sampling (default: 42)
--testdata-dir Custom audio data root (default: testdata/)
--output-dir Where to write results (default: results/)
For each run <tag>_<locale>_<timestamp>:
| File | Description |
|---|---|
results/<run>.csv |
Per-sample, per-service metrics (WER, SER, INS/DEL/SUB, latency) |
results/<run>_report.md |
Aggregate WER/latency table with endpoint and dataset details |
results/<run>_error_analysis.md |
Best/worst/hallucinations with [▶] audio playback links |
results/<run>_words.jsonl |
Per-word timestamps for realtime samples (for latency audit) |
results/<run>_justification.md |
Human prose interpretation (written manually with the asr-justify skill) |
results/<run>.csv.strict |
Pre-normalization backup (created after running loose_match.py) |
Three latency metrics are computed for each sample:
- First Latency — how quickly the first partial result appears after the user starts speaking (realtime only; REST has no partial results)
- LBL (Last-final Beyond Last-chunk) — server flush time relative to the last audio chunk sent; can be negative when the server runs ahead of audio I/O
- UPL (User-Perceived Latency) — how late the final transcript arrives after the user stops speaking; anchored to the realtime SDK's word-end timestamp for fair cross-service comparison
WER is computed after NFKC + lowercase + punctuation-strip normalization. An additional loose-match pass handles systematic false positives:
# Apply loose matching after a benchmark run
python -X utf8 scripts/loose_match.py results/asr-bench_en-GB_<timestamp>.csv
# Regenerate reports after normalization
python -X utf8 -c "
from pathlib import Path
from benchmark.report import build_report
from scripts.error_analysis import main as ea_main
for csv_path in sorted(Path('results').glob('asr-bench_*.csv')):
if '.strict' in csv_path.name: continue
build_report(csv_path, csv_path.with_name(csv_path.stem + '_report.md'))
ea_main(csv_path, csv_path.with_name(csv_path.stem + '_error_analysis.md'))
"Loose matching catches: compound-word splits (stummschalten ↔ stumm schalten),
alphanumeric spacing (2D ↔ 2 D), and degree/percent symbol variants.
This project is configured as a Copilot Agent (.github/copilot-instructions.md).
Copilot can run each step for you or guide you through the full pipeline.
Use the prompt files in .github/prompts/ — invoke them from the Copilot Chat panel:
| Step | Prompt file | What it does |
|---|---|---|
| 1 | 01-setup.prompt.md |
Install dependencies, configure .env, validate credentials |
| 2 | 02-generate-testset.prompt.md |
Synthesize WAV files with Azure TTS |
| 3 | 03-run-benchmark.prompt.md |
Run run_full.py with your chosen options |
| 4 | 04-loose-match.prompt.md |
Normalize WER for compound words and symbol variants |
| 5 | 05-analyze.prompt.md |
Regenerate error analysis from existing CSV |
| 6 | 06-justify.prompt.md |
Write human-readable interpretation (uses asr-justify skill) |
The .github/skills/asr-justify/SKILL.md defines a 9-step procedure for writing
a prose interpretation of benchmark results — covering where WER is misleading,
fast_llm hallucinations, genuine service disagreements, latency QA, and recommendations.
- Create a directory:
testdata/<locale>/<dataset-name>/ - Add
trans.txt(one reference command per line, or tab-separatedfilename.wav<TAB>reference) - Either:
- Place existing WAV files in the same directory
- Run
python scripts/generate_testset.py --locale <locale>to synthesize them
- Run:
python run_full.py --languages <locale>
.
├── .env.example # Credential template
├── .github/
│ ├── copilot-instructions.md # Copilot agent persona and capabilities
│ ├── prompts/ # Step-by-step runnable prompt files
│ │ ├── 01-setup.prompt.md
│ │ ├── 02-generate-testset.prompt.md
│ │ ├── 03-run-benchmark.prompt.md
│ │ ├── 04-loose-match.prompt.md
│ │ ├── 05-analyze.prompt.md
│ │ └── 06-justify.prompt.md
│ └── skills/asr-justify/
│ └── SKILL.md # Human result interpretation skill
├── benchmark/
│ ├── config.py # Credentials, service list, data classes
│ ├── runner.py # Benchmark orchestrator (threading, resume)
│ ├── report.py # Aggregate metrics → markdown report
│ ├── metrics.py # WER/SER computation
│ ├── datasets_loader.py # Audio loading from testdata/
│ ├── audio_utils.py # PCM/WAV conversion, resampling
│ ├── ping.py # TCP ping + IP geolocation for report metadata
│ ├── boundary_fix.py # Speech boundary heuristics for realtime
│ ├── asr_fast.py # Azure Fast Transcription (REST)
│ ├── asr_realtime.py # Azure Speech SDK continuous recognition
│ ├── asr_llm_speech.py # fast_llm and fast_mai (LLM/MAI preview)
│ └── asr_whisper.py # Whisper v3 stub (not yet implemented)
├── scripts/
│ ├── generate_testset.py # Azure TTS synthesis from trans.txt
│ ├── error_analysis.py # Per-sample error breakdown
│ └── loose_match.py # Automatic WER normalization
├── testdata/ # Reference transcriptions (WAVs git-ignored)
│ ├── en-GB/smart-commands/trans.txt
│ ├── de-DE/smart-commands/trans.txt
│ ├── es-ES/smart-commands/trans.txt
│ ├── fr-FR/smart-commands/trans.txt
│ └── it-IT/smart-commands/trans.txt
├── results/ # Benchmark outputs (git-ignored)
├── run_full.py # Main entry point
└── requirements.txt
fast_llmcaveat: this service has no locale set and relies on language auto-detection. It may produce text in the wrong language, especially for non-English audio. This is intentional — it's included specifically to benchmark the auto-detection behavior.- TTS-synthesized audio: the bundled testset uses Azure TTS, which produces clean, consistent audio. WER will be lower than on real human recordings. Use your own recordings for more realistic results.
- Latency is network-dependent: absolute UPL/LBL numbers depend heavily on your distance to the Azure region. Use relative comparisons between services, not absolute values.
- Short utterances amplify WER: smart speaker commands are 3–10 words. A single word error gives 10–33% WER. Consider SER (Sentence Error Rate) alongside WER for a fuller picture.