1. Define a representative test corpus
Select recordings that represent real workflow conditions: clear speech, multi-speaker with 3+ participants, recordings with background noise (HVAC, crowd, traffic), technical vocabulary, non-native accents, and fast speech above 175 WPM.
2. Create verified ground truth transcripts
Have two independent transcriptionists produce ground-truth text for each test recording. Resolve disagreements through arbitration. Ground truth quality determines benchmark reliability — a flawed reference transcript produces meaningless WER numbers.
3. Process identical files across all tools under test
Upload the same recordings to each tool without pre-processing. Record wall-clock processing time from upload completion to transcript available. Use default settings unless the specific goal is to benchmark custom configurations.
4. Score raw output and measure cleanup time
Calculate WER and speaker attribution accuracy from raw output before any editing. Then time how long a single editor takes to bring each raw transcript to delivery-ready quality — same editor, same criteria, measured independently for each tool output.