What happens when two speakers talk over each other?
Overlapping speech longer than approximately 2 seconds typically causes speaker diarization errors — the model misattributes the overlapping content to the wrong speaker, or collapses both voices into a single speaker segment. Short overlaps under 1 second are often handled correctly. For recordings with frequent crosstalk (panel discussions, heated interviews, group meetings), plan for a manual speaker label review pass after transcription. The transcript text itself is usually accurate; the speaker attribution is where cross-talk creates errors.
Why does my long transcript have strange sentence breaks every 30 minutes?
Transcription engines that process audio in 30-minute chunks for computational efficiency sometimes introduce artifacts at segment boundaries: the last sentence of one chunk and the first sentence of the next may share words, or a sentence may split at an unnatural point. Review the transitions at 30, 60, 90, and 120-minute marks in your transcript. These sections typically need a manual cleanup pass — reading the surrounding 2–3 minutes in both chunks to verify continuity.
How do I handle a recording where audio quality degrades partway through?
Network dropouts in Zoom recordings and variable microphone distances in in-person recordings both create audio quality degradation that produces transcription errors. Sections with significant audio degradation produce either silent gaps in the transcript or runs of plausible-sounding incorrect words. Identify degraded sections by looking for unusually short speaker turns, repeated words, or text that does not match the topic context. These sections require playback-verified manual review — no AI transcription tool handles severe audio degradation reliably.
What is the difference between clean verbatim and full verbatim for client delivery?
Clean verbatim removes filler words (um, uh, you know, like), false starts, and word repetitions for readability while preserving the accurate meaning. Full verbatim preserves all spoken content exactly as delivered, including fillers, stutters, and restarts. Clean verbatim is the standard for corporate, academic, and journalistic transcription. Full verbatim is required for legal depositions, court transcription, and some research applications where the exact spoken form matters. Applying the wrong level means re-processing the source audio — it cannot be corrected through text editing.
Can I rename speaker labels after transcription?
Yes. Speaker labels generated by diarization (Speaker 1, Speaker 2, etc.) can be renamed inline in the VideoText transcript editor. Renaming a label propagates consistently to all instances of that speaker throughout the transcript. Speaker names are preserved across all export formats — DOCX, PDF, SRT, VTT, and JSON all carry the updated names rather than the generic numbered labels.
Can I transcribe YouTube videos, Zoom recordings, and podcast audio?
Yes. VideoText accepts video file uploads (MP4, MOV, WebM, MKV) and audio files (MP3, M4A, WAV, AAC, FLAC). URL-based ingestion works for public YouTube videos and direct media URLs. Zoom cloud recordings can be downloaded and uploaded as MP4 files. Google Meet recordings export to Google Drive as MP4 and can be downloaded for upload. Audio-only podcast files process with the same accuracy as video files — VideoText uses only the audio track regardless of whether a video track is present.