Whisper Tutorial

Reviewed by Marouen Arfaoui · Last tested April 2026 · 157 tools tested

Last updated: April 2026

beginner

What you'll achieve

After this tutorial, you'll be able to transcribe your first audio file using Whisper, turning any MP3, WAV, or M4A file into accurate text. I'll show you the exact commands to run on your computer, how to choose the right model for your needs, and how to save the transcription as a text file. You'll understand the core workflow so you can start converting interviews, meetings, or personal memos into searchable, editable documents without paying for a subscription service. This is the foundational skill for leveraging one of the most powerful free AI tools available.

Prerequisites

•A computer with Python installed (version 3.8 or higher)
•A terminal or command prompt you can use (like Terminal on Mac or Command Prompt/PowerShell on Windows)
•A short audio file (e.g., an MP3) to test with, ideally 1-2 minutes long

Step-by-Step Guide

Step 1: Install Python and Pip (The Package Manager)

Before we touch Whisper, we need to set up its environment. First, check if Python is installed. Open your terminal (Mac/Linux) or Command Prompt (Windows) and type `python3 --version` or `python --version`. If you see a version number like 3.9.0 or higher, you're good. If not, go to python.org, download the latest installer for your operating system, and run it. CRITICAL: During installation on Windows, check the box that says 'Add Python to PATH'. This lets your terminal find it. Once installed, verify again. Then, we need to ensure 'pip', Python's package installer, is ready. Type `pip3 --version`. If it works, move on. If you get an error, on Mac/Linux try `sudo apt-get install python3-pip` or use your system's package manager. This step is the foundation; everything else fails without it.

TIP

Use `python3` and `pip3` commands on Mac/Linux. On Windows, commands are often just `python` and `pip`.

Step 2: Install Whisper and Its Powerful Engine (FFmpeg)

Now for the main event. In your open terminal, type the exact command: `pip3 install openai-whisper` and press Enter. You'll see a flurry of text as it downloads and installs Whisper and its dependencies. Let it run—it might take a minute. What surprised me was how simple this is compared to the complex models of the past. Next, we need FFmpeg, the software that lets Whisper understand almost any audio format. On Mac, install it with Homebrew: `brew install ffmpeg`. On Ubuntu/Debian Linux, use: `sudo apt update && sudo apt install ffmpeg`. On Windows, go to ffmpeg.org, download the build, and extract the 'bin' folder. Then, you must add that folder to your system's PATH environment variable (search 'Edit environment variables' in Windows). This is the most common stumbling block, so take your time here.

TIP

After installing FFmpeg on Windows, close and reopen your Command Prompt for the PATH changes to take effect.

Step 3: Prepare Your First Audio File and Choose a Model

Find a short audio file. In my experience, a clear voice memo or a podcast segment under 2 minutes is perfect for testing. Save it to a known folder, like your Desktop or Documents. I recommend converting it to a common format like `.mp3` or `.wav` if it isn't already. Now, understand Whisper's models: they range from 'tiny' (fast, less accurate) to 'large' (slow, most accurate). For your first test with clear audio, `base` or `small` is ideal. The `tiny` model is incredibly fast but can mess up words. The `large` model is overkill for a test and requires significant RAM. My stance: start with `base`. It offers a great balance of speed and accuracy for most beginner tasks. Navigate to your audio file's folder in the terminal using the `cd` command (e.g., `cd Desktop`).

TIP

You can drag and drop a folder onto your terminal window to automatically paste its path, then type `cd ` before pasting.

Step 4: Run Your First Transcription Command

This is the magic moment. With your terminal pointed to the folder containing your `test.mp3` (or whatever you named it), type the following command: `whisper test.mp3 --model base`. Press Enter. You'll see real-time output! Whisper will first show it's downloading the chosen model (only once), then it will print timestamps and the transcribed text as it works. What surprised me was how well it handled my mumbled speech and the faint keyboard sounds in my test recording. Once finished, Whisper automatically creates several output files in the same folder: a `.txt` file (plain text), `.vtt` (for subtitles), and `.srt` (another subtitle format). Your transcription is done! The text will be in the `.txt` file. Open it with any text editor to see the result.

TIP

Add `--language en` (for English) to the command to slightly speed up processing by not needing to detect the language.

Step 5: Customize Output and Refine the Transcription

Whisper's power is in its options. Let's refine. To get a transcription without timestamps (cleaner for documents), add `--output_format txt`. To get only the subtitle file, use `--output_format srt`. If the audio has multiple speakers, try `--task translate` if it's in another language, or use `--fp16 False` if you get errors on CPU. The most useful flag I use daily is `--initial_prompt "Hello, welcome to this podcast."` This primes the model with context, improving accuracy for specific jargon or names. To run it all together: `whisper interview.mp3 --model small --language en --output_format txt --initial_prompt "Discussion about machine learning"`. Experiment! Run the same file with `tiny` and `small` models and compare the text files to see the accuracy trade-off firsthand.

TIP

The `--verbose False` flag cleans up the terminal output, showing only the progress bar and final result.

Step 6: Explore Advanced Integrations and Next Steps

You've mastered the command line. Now, explore interfaces. I regularly use the 'Whisper Desktop' GUI for drag-and-drop simplicity on my Mac. For power users, integrate Whisper into Python scripts: `import whisper; model = whisper.load_model("base"); result = model.transcribe("audio.mp3")`. This lets you programmatically handle hundreds of files. Also, look at tools like 'Buzz' for a fantastic free desktop app. My honest opinion? While API services are easier, the control and zero per-minute cost of self-hosted Whisper are unbeatable for bulk processing. The next step is to automate: write a script that watches a folder for new audio and auto-transcribes it. The open-source ecosystem around Whisper is its greatest strength.

TIP

Search GitHub for "whisper-webui" to find browser-based interfaces you can run locally for a more user-friendly experience.

Common Mistakes to Avoid

Not adding Python to PATH on Windows during installation, making the `whisper` command unfindable. Re-run the installer and check the box.

Forgetting to install FFmpeg, causing the error 'FileNotFoundError: [Errno 2] No such file or directory: 'ffprobe''. Install FFmpeg as shown in Step 2.

Using the `large` model on a computer with less than 8GB RAM, causing freezing or crashes. Start with `tiny`, `base`, or `small`.

Having spaces in audio filenames without wrapping the path in quotes, confusing the terminal. Use quotes: `whisper "my file.mp3"`.

Next Steps

→Check out our Whisper cheat sheet for quick reference to all command-line flags and model specs

→Explore Whisper alternatives like OpenAI's API, AssemblyAI, or Rev for a comparison of cost vs. convenience

→Read our guide on advanced Whisper techniques for speaker diarization and batch processing

→

Whisper Cheat SheetQuick reference

→

Whisper PromptsCopy-paste ready

→

Frequently Asked Questions

How long does it take to learn Whisper?+

If you're comfortable with basic terminal commands, you can be transcribing in 15 minutes, as this guide shows. Mastering its options for production use might take a few hours of experimentation. The core utility is immediately accessible.

Do I need technical skills to use Whisper?+

Yes, basic technical comfort is required. You need to use a terminal/command line and follow installation steps. If terms like 'install Python' or 'system PATH' sound foreign, you might find a GUI wrapper (like Buzz) easier than the core tool itself.

What can I create with Whisper?+

You can create searchable transcripts of meetings, interviews, and lectures; generate subtitles (SRT files) for videos; archive voice memos as text; transcribe historical audio; or build automated pipelines for content creators. I use it daily for podcast show notes.

Is Whisper free to use?+

Absolutely. The model is open-source and free. You pay no fees to OpenAI. However, you provide the computing power. Running the 'large' model on a cloud server (like Google Colab) or your own electricity are the only potential costs.

What are the best alternatives to Whisper?+

For an API: OpenAI's Whisper API is seamless but costs per minute. AssemblyAI offers great features like speaker detection. For local use: Nvidia's NeMo is more technical. For most free, high-quality local transcription, Whisper remains the king in my experience.

Can I use Whisper on mobile?+

Not directly, as it requires Python. However, several mobile apps (like 'Transcribe' on iOS) use Whisper's engine on their backend. For true on-device mobile use, the computational demand is currently too high for standard phones.

What are the limitations of Whisper?+

It doesn't natively identify different speakers (diarization), which is a major drawback for interviews. It can be computationally slow on long files without a good GPU. Also, it's a batch process—not a live, real-time transcription service out of the box.

Was this helpful?