Whisper Tutorial
Last updated: April 2026
What you'll achieve
After this tutorial, you'll be able to transcribe your first audio file using Whisper, turning any MP3, WAV, or M4A file into accurate text. I'll show you the exact commands to run on your computer, how to choose the right model for your needs, and how to save the transcription as a text file. You'll understand the core workflow so you can start converting interviews, meetings, or personal memos into searchable, editable documents without paying for a subscription service. This is the foundational skill for leveraging one of the most powerful free AI tools available.
Prerequisites
- •A computer with Python installed (version 3.8 or higher)
- •A terminal or command prompt you can use (like Terminal on Mac or Command Prompt/PowerShell on Windows)
- •A short audio file (e.g., an MP3) to test with, ideally 1-2 minutes long
Step-by-Step Guide
Step 1: Install Python and Pip (The Package Manager)
Before we touch Whisper, we need to set up its environment. First, check if Python is installed. Open your terminal (Mac/Linux) or Command Prompt (Windows) and type `python3 --version` or `python --version`. If you see a version number like 3.9.0 or higher, you're good. If not, go to python.org, download the latest installer for your operating system, and run it. CRITICAL: During installation on Windows, check the box that says 'Add Python to PATH'. This lets your terminal find it. Once installed, verify again. Then, we need to ensure 'pip', Python's package installer, is ready. Type `pip3 --version`. If it works, move on. If you get an error, on Mac/Linux try `sudo apt-get install python3-pip` or use your system's package manager. This step is the foundation; everything else fails without it.
Use `python3` and `pip3` commands on Mac/Linux. On Windows, commands are often just `python` and `pip`.
Step 2: Install Whisper and Its Powerful Engine (FFmpeg)
Now for the main event. In your open terminal, type the exact command: `pip3 install openai-whisper` and press Enter. You'll see a flurry of text as it downloads and installs Whisper and its dependencies. Let it run—it might take a minute. What surprised me was how simple this is compared to the complex models of the past. Next, we need FFmpeg, the software that lets Whisper understand almost any audio format. On Mac, install it with Homebrew: `brew install ffmpeg`. On Ubuntu/Debian Linux, use: `sudo apt update && sudo apt install ffmpeg`. On Windows, go to ffmpeg.org, download the build, and extract the 'bin' folder. Then, you must add that folder to your system's PATH environment variable (search 'Edit environment variables' in Windows). This is the most common stumbling block, so take your time here.
After installing FFmpeg on Windows, close and reopen your Command Prompt for the PATH changes to take effect.
Step 3: Prepare Your First Audio File and Choose a Model
Find a short audio file. In my experience, a clear voice memo or a podcast segment under 2 minutes is perfect for testing. Save it to a known folder, like your Desktop or Documents. I recommend converting it to a common format like `.mp3` or `.wav` if it isn't already. Now, understand Whisper's models: they range from 'tiny' (fast, less accurate) to 'large' (slow, most accurate). For your first test with clear audio, `base` or `small` is ideal. The `tiny` model is incredibly fast but can mess up words. The `large` model is overkill for a test and requires significant RAM. My stance: start with `base`. It offers a great balance of speed and accuracy for most beginner tasks. Navigate to your audio file's folder in the terminal using the `cd` command (e.g., `cd Desktop`).
You can drag and drop a folder onto your terminal window to automatically paste its path, then type `cd ` before pasting.
Step 4: Run Your First Transcription Command
This is the magic moment. With your terminal pointed to the folder containing your `test.mp3` (or whatever you named it), type the following command: `whisper test.mp3 --model base`. Press Enter. You'll see real-time output! Whisper will first show it's downloading the chosen model (only once), then it will print timestamps and the transcribed text as it works. What surprised me was how well it handled my mumbled speech and the faint keyboard sounds in my test recording. Once finished, Whisper automatically creates several output files in the same folder: a `.txt` file (plain text), `.vtt` (for subtitles), and `.srt` (another subtitle format). Your transcription is done! The text will be in the `.txt` file. Open it with any text editor to see the result.
Add `--language en` (for English) to the command to slightly speed up processing by not needing to detect the language.
Step 5: Customize Output and Refine the Transcription
Whisper's power is in its options. Let's refine. To get a transcription without timestamps (cleaner for documents), add `--output_format txt`. To get only the subtitle file, use `--output_format srt`. If the audio has multiple speakers, try `--task translate` if it's in another language, or use `--fp16 False` if you get errors on CPU. The most useful flag I use daily is `--initial_prompt "Hello, welcome to this podcast."` This primes the model with context, improving accuracy for specific jargon or names. To run it all together: `whisper interview.mp3 --model small --language en --output_format txt --initial_prompt "Discussion about machine learning"`. Experiment! Run the same file with `tiny` and `small` models and compare the text files to see the accuracy trade-off firsthand.
The `--verbose False` flag cleans up the terminal output, showing only the progress bar and final result.
Step 6: Explore Advanced Integrations and Next Steps
You've mastered the command line. Now, explore interfaces. I regularly use the 'Whisper Desktop' GUI for drag-and-drop simplicity on my Mac. For power users, integrate Whisper into Python scripts: `import whisper; model = whisper.load_model("base"); result = model.transcribe("audio.mp3")`. This lets you programmatically handle hundreds of files. Also, look at tools like 'Buzz' for a fantastic free desktop app. My honest opinion? While API services are easier, the control and zero per-minute cost of self-hosted Whisper are unbeatable for bulk processing. The next step is to automate: write a script that watches a folder for new audio and auto-transcribes it. The open-source ecosystem around Whisper is its greatest strength.
Search GitHub for "whisper-webui" to find browser-based interfaces you can run locally for a more user-friendly experience.
Common Mistakes to Avoid
Not adding Python to PATH on Windows during installation, making the `whisper` command unfindable. Re-run the installer and check the box.
Forgetting to install FFmpeg, causing the error 'FileNotFoundError: [Errno 2] No such file or directory: 'ffprobe''. Install FFmpeg as shown in Step 2.
Using the `large` model on a computer with less than 8GB RAM, causing freezing or crashes. Start with `tiny`, `base`, or `small`.
Having spaces in audio filenames without wrapping the path in quotes, confusing the terminal. Use quotes: `whisper "my file.mp3"`.