Introduction
The quality of your voice clone is directly proportional to the quality of your audio sample. A $5,000 microphone with bad technique produces worse clones than a $30 USB mic with good technique in a treated room.
This guide covers everything you need to know about recording voice samples that produce excellent AI clones.
Duration Requirements by Platform
| Platform | Minimum | Recommended | Maximum Benefit |
|---|---|---|---|
| ElevenLabs Instant | 30 seconds | 3-5 minutes | 5 minutes |
| ElevenLabs Professional | 5 minutes | 30-60 minutes | 3 hours |
| PlayHT | 30 seconds | 3-5 minutes | 10 minutes |
| Resemble AI | 3 minutes | 10-25 minutes | 30 minutes |
| Coqui TTS (open source) | 5 minutes | 30-60 minutes | 5+ hours |
| RVC | 10 minutes | 30-60 minutes | 2+ hours |
The sweet spot for most platforms is 3-5 minutes. You get 90% of the quality improvement in the first few minutes. Beyond 5 minutes, returns diminish rapidly for instant cloning.
Microphone Recommendations
Budget ($20-50)
- Fifine K669 — USB condenser, surprisingly good for the price. Best budget option for voice cloning.
- Phone (iPhone 12+) — The built-in mic on modern smartphones is adequate for testing. Hold it 6 inches from your mouth.
Mid-Range ($50-150)
- Audio-Technica ATR2100x — Dynamic USB/XLR mic. Rejects room noise naturally, great for untreated rooms.
- Blue Yeti — USB condenser with multiple patterns. Set it to cardioid mode and position it correctly.
- Samson Q2U — Dynamic USB/XLR. Similar to the ATR2100x, slightly cheaper.
Professional ($150-400)
- Shure SM7B — The podcast industry standard. Dynamic mic that sounds incredible but needs a preamp or Cloudlifter.
- Rode NT1 — Condenser mic with very low self-noise. Best in a treated room.
- Elgato Wave:3 — USB condenser designed for streaming. Clean signal with easy software control.
Recommendation: The Audio-Technica ATR2100x at $80 is the best value. It is a dynamic mic (rejects room noise) with USB connection (no extra gear needed). This is what we use for most voice cloning samples.
Room Setup
Room acoustics matter more than mic quality for voice cloning. The AI needs to hear your voice, not your room.
The closet method: Recording in a small closet full of clothes is the cheapest acoustic treatment. The clothes absorb reflections and the small space eliminates echo.
DIY treatment on a budget:
- Hang heavy blankets on the walls nearest to your mic
- Place a folded towel on the desk in front of you
- Close windows and turn off fans, AC, and any appliances
- Record when the house is quiet (no washing machine, no TV in the next room)
What to avoid:
- Large, empty rooms with hard surfaces (echo/reverb)
- Rooms with a computer fan pointed at the mic
- Outdoor recording (wind, traffic, birds)
- Recording near windows (traffic noise bleeds through)
Recording Settings
| Setting | Value | Why |
|---|---|---|
| Sample rate | 44,100 Hz or 48,000 Hz | Standard quality, any higher is wasted |
| Bit depth | 16-bit or 24-bit | 24-bit gives more headroom, 16-bit is fine |
| Format | WAV (preferred) or MP3 320kbps | WAV preserves all detail, high-bitrate MP3 is acceptable |
| Mono vs Stereo | Mono | Voice is mono content, stereo doubles file size for no benefit |
| Gain level | Peaks at -6dB to -3dB | Leaves headroom to avoid clipping |
What to Say in Your Sample
The content of your recording matters for clone quality.
Do read:
- News articles (varied vocabulary and sentence structures)
- Book passages with dialogue (teaches the AI different intonations)
- Your own writing (if the clone will be used for your content)
- Lists, questions, and exclamations (gives the AI a range of speech patterns)
Do not read:
- Poetry (unusual rhythm patterns can confuse the model)
- Highly technical jargon only (too narrow a vocal range)
- The same sentence repeatedly (the AI needs variety)
Quality Checklist Before Uploading
Before submitting your sample, verify:
- [ ] No background noise audible when you pause speaking
- [ ] No clipping (distortion on loud syllables)
- [ ] Consistent volume throughout
- [ ] No plosive pops on P and B sounds
- [ ] No sibilance hiss on S sounds
- [ ] At least 30 seconds of continuous speech (for instant cloning)
- [ ] File is WAV or high-quality MP3
- [ ] You sound natural, not stiff or overly careful
Frequently Asked Questions
Can I use audio from a video call or podcast?
Yes, if the audio quality is decent. Extract the audio track, check it against the quality checklist above. Zoom and Teams recordings often have compression artifacts that reduce clone quality.
Does the language of the sample matter?
For ElevenLabs Multilingual v2, you can record in one language and generate in another. However, cloning works best when the sample language matches the output language.
Should I record in one session or multiple?
One session is preferred for consistency (same room, same mic position, same energy level). If you must split sessions, keep the setup identical.
For the cloning process itself, follow our voice cloning tutorial. For tool selection, see best voice cloning tools.