Transcription
When you submit audio to Veronese, it goes through an automatic pipeline that produces an editable transcript. Here’s what happens.
The pipeline
Section titled “The pipeline”- Normalize — Your audio is converted to a clean format regardless of the original file type (MP3, M4A, video files, etc.).
- Transcribe — The audio is processed by an AI speech-to-text model.
- Store — Two versions of the transcript are saved:
- Raw text — the unedited machine output, preserved exactly as produced.
- Editable content — your working copy, seeded from the raw text and ready to edit.
- Notify — You receive an email when the transcript is ready.
Episode states
Section titled “Episode states”You can track progress in real time on your dashboard or via the API:
| State | What’s happening |
|---|---|
ingesting | Audio is being downloaded or prepared |
transcribing | AI model is processing the audio |
ready | Transcript is complete — open it to start editing |
failed | Something went wrong — check the episode for details |
Accuracy
Section titled “Accuracy”Transcription accuracy depends on audio quality. For best results:
- Use a microphone close to the speaker
- Minimize background noise
- Avoid very low bitrate audio (< 64 kbps)
Supported formats
Section titled “Supported formats”Any audio or video format is accepted — Veronese normalizes it automatically before transcribing. Common formats: MP3, M4A, WAV, OGG, FLAC, MP4, MOV, WebM.
Processing time
Section titled “Processing time”Typical transcription takes 1–3× the audio duration for short clips, and faster proportionally for longer recordings. A 10-minute recording usually completes in 1–3 minutes.