Skip to content

Transcription

When you submit audio to Veronese, it goes through an automatic pipeline that produces an editable transcript. Here’s what happens.

  1. Normalize — Your audio is converted to a clean format regardless of the original file type (MP3, M4A, video files, etc.).
  2. Transcribe — The audio is processed by an AI speech-to-text model.
  3. Store — Two versions of the transcript are saved:
    • Raw text — the unedited machine output, preserved exactly as produced.
    • Editable content — your working copy, seeded from the raw text and ready to edit.
  4. Notify — You receive an email when the transcript is ready.

You can track progress in real time on your dashboard or via the API:

StateWhat’s happening
ingestingAudio is being downloaded or prepared
transcribingAI model is processing the audio
readyTranscript is complete — open it to start editing
failedSomething went wrong — check the episode for details

Transcription accuracy depends on audio quality. For best results:

  • Use a microphone close to the speaker
  • Minimize background noise
  • Avoid very low bitrate audio (< 64 kbps)

Any audio or video format is accepted — Veronese normalizes it automatically before transcribing. Common formats: MP3, M4A, WAV, OGG, FLAC, MP4, MOV, WebM.

Typical transcription takes 1–3× the audio duration for short clips, and faster proportionally for longer recordings. A 10-minute recording usually completes in 1–3 minutes.