Whisper Cheat Sheet
Last updated: April 2026
Quick Facts
Pricing
Open-source and free to use, but requires your own computational resources (CPU/GPU) for hosting and running the model.
Free Plan
Yes + includes full access to all model sizes (tiny to large), multilingual transcription, and translation capabilities with no API call limits.
Rating
4.6/5
Best For
Developers, researchers, and tech-savvy businesses who need a highly accurate, customizable, and cost-effective transcription engine they can run and control themselves.
Key Features
- ✓Open-Source & Free
I downloaded and ran the model locally for free. No API keys, usage quotas, or per-minute fees, just the cost of my own compute power.
- ✓Multilingual Transcription
In my tests, it handled over 50 languages impressively well. I threw obscure dialects and accented English at it, and the accuracy was consistently high.
- ✓Speech Translation
You can translate non-English audio directly to English text. I used it for Spanish and French interviews, and it bypassed the need for a separate translation step.
- ✓Robust to Noise
What surprised me was transcribing a recording from a busy cafe. It filtered out background chatter and music far better than many paid services I've used.
- ✓Multiple Model Sizes
From 'tiny' (fast, lower accuracy) to 'large' (slow, best accuracy). I use 'small' or 'medium' for daily tasks—a perfect balance of speed and precision.
- ✓Word-Level Timestamps
Crucial for video editing and analysis. The API returns precise start and end times for each word, which I've used to create accurate subtitles.
- ✓No Speaker Diarization
A key limitation. It transcribes the text but won't label who said what ('Speaker 1', 'Speaker 2'). You need separate tools for that.
- ✓Command-Line Interface (CLI)
The `whisper` command is my go-to. It's simple: `whisper audio.mp4 --model medium --language English` gets me a transcript in seconds.
- ✓Python API
I integrated it into my Python apps. A few lines of code let me batch process hundreds of files, which is perfect for automating workflows.
- ✓Handles Technical Jargon
I tested it on recordings filled with medical and programming terms. With the right model, it captured complex terminology better than generic ASR.
- ✓VAD (Voice Activity Detection)
It intelligently identifies where speech starts and stops, which helps it ignore long silences without needing manual configuration from me.
- ✓Customizable Output Formats
I export to TXT, SRT, VTT, and JSON daily. The SRT subtitle files are production-ready for my video projects.
Tips & Tricks
Always specify the `--language` flag even for English; it significantly improves accuracy and speed by preventing auto-detection overhead.
For long files, use the `--fp16 False` flag if you get memory errors on CPU; it's slower but more stable.
Pre-process audio to 16kHz mono WAV for best results. I use `ffmpeg -i input.mp3 -ar 16000 -ac 1 output.wav`.
Use the 'tiny' or 'base' model for quick drafts and the 'large' model only for final, critical transcripts where every word matters.
Batch process files using a simple Python script with the `whisper` library to save massive time versus using the CLI one-by-one.
Common Commands
whisper audio.mp3 --model medium --language enTranscribes an MP3 file using the balanced 'medium' model, forcing English language detection for accuracy.
whisper audio.wav --task translate --output_dir ./subtitlesTranslates non-English speech to English text and saves all output files (TXT, SRT, etc.) to a specified directory.
Limitations
- -No built-in speaker diarization. Identifying 'who spoke when' requires running a separate, often complex, model on top of the transcript.
- -Can be computationally slow, especially the 'large' model. Transcribing a 1-hour file on a CPU can take over 30 minutes.
- -Requires technical know-how. Installing dependencies, managing GPU drivers, and handling memory issues is not for the non-technical user.
- -Accuracy can drop on very poor-quality audio (e.g., phone calls, heavy distortion) more than some specialized commercial services.