Transcription

When you submit audio to Veronese, it goes through an automatic pipeline that produces an editable transcript. Here’s what happens.

The pipeline

Normalize — Your audio is converted to a clean format regardless of the original file type (MP3, M4A, video files, etc.).
Transcribe — The audio is processed by an AI speech-to-text model.
Store — Two versions of the transcript are saved:
- Raw text — the unedited machine output, preserved exactly as produced.
- Editable content — your working copy, seeded from the raw text and ready to edit.
Notify — You receive an email when the transcript is ready.

Episode states

You can track progress in real time on your dashboard or via the API:

State	What’s happening
`ingesting`	Audio is being downloaded or prepared
`transcribing`	AI model is processing the audio
`ready`	Transcript is complete — open it to start editing
`failed`	Something went wrong — check the episode for details

Accuracy

Transcription accuracy depends on audio quality. For best results:

Use a microphone close to the speaker
Minimize background noise
Avoid very low bitrate audio (< 64 kbps)

Supported formats

Any audio or video format is accepted — Veronese normalizes it automatically before transcribing. Common formats: MP3, M4A, WAV, OGG, FLAC, MP4, MOV, WebM.

Processing time

Typical transcription takes 1–3× the audio duration for short clips, and faster proportionally for longer recordings. A 10-minute recording usually completes in 1–3 minutes.