Voice Cloning Audio Requirements — Recording Guide

Introduction

The quality of your voice clone is directly proportional to the quality of your audio sample. A $5,000 microphone with bad technique produces worse clones than a $30 USB mic with good technique in a treated room.

This guide covers everything you need to know about recording voice samples that produce excellent AI clones.

Duration Requirements by Platform

Platform	Minimum	Recommended	Maximum Benefit
ElevenLabs Instant	30 seconds	3-5 minutes	5 minutes
ElevenLabs Professional	5 minutes	30-60 minutes	3 hours
PlayHT	30 seconds	3-5 minutes	10 minutes
Resemble AI	3 minutes	10-25 minutes	30 minutes
Coqui TTS (open source)	5 minutes	30-60 minutes	5+ hours
RVC	10 minutes	30-60 minutes	2+ hours

The sweet spot for most platforms is 3-5 minutes. You get 90% of the quality improvement in the first few minutes. Beyond 5 minutes, returns diminish rapidly for instant cloning.

Microphone Recommendations

Budget ($20-50)

Fifine K669 — USB condenser, surprisingly good for the price. Best budget option for voice cloning.
Phone (iPhone 12+) — The built-in mic on modern smartphones is adequate for testing. Hold it 6 inches from your mouth.

Mid-Range ($50-150)

Audio-Technica ATR2100x — Dynamic USB/XLR mic. Rejects room noise naturally, great for untreated rooms.
Blue Yeti — USB condenser with multiple patterns. Set it to cardioid mode and position it correctly.
Samson Q2U — Dynamic USB/XLR. Similar to the ATR2100x, slightly cheaper.

Professional ($150-400)

Shure SM7B — The podcast industry standard. Dynamic mic that sounds incredible but needs a preamp or Cloudlifter.
Rode NT1 — Condenser mic with very low self-noise. Best in a treated room.
Elgato Wave:3 — USB condenser designed for streaming. Clean signal with easy software control.

Recommendation: The Audio-Technica ATR2100x at $80 is the best value. It is a dynamic mic (rejects room noise) with USB connection (no extra gear needed). This is what we use for most voice cloning samples.

Room Setup

Room acoustics matter more than mic quality for voice cloning. The AI needs to hear your voice, not your room.

The closet method: Recording in a small closet full of clothes is the cheapest acoustic treatment. The clothes absorb reflections and the small space eliminates echo.

DIY treatment on a budget:

Hang heavy blankets on the walls nearest to your mic
Place a folded towel on the desk in front of you
Close windows and turn off fans, AC, and any appliances
Record when the house is quiet (no washing machine, no TV in the next room)

What to avoid:

Large, empty rooms with hard surfaces (echo/reverb)
Rooms with a computer fan pointed at the mic
Outdoor recording (wind, traffic, birds)
Recording near windows (traffic noise bleeds through)

Recording Settings

Setting	Value	Why
Sample rate	44,100 Hz or 48,000 Hz	Standard quality, any higher is wasted
Bit depth	16-bit or 24-bit	24-bit gives more headroom, 16-bit is fine
Format	WAV (preferred) or MP3 320kbps	WAV preserves all detail, high-bitrate MP3 is acceptable
Mono vs Stereo	Mono	Voice is mono content, stereo doubles file size for no benefit
Gain level	Peaks at -6dB to -3dB	Leaves headroom to avoid clipping

What to Say in Your Sample

The content of your recording matters for clone quality.

Do read:

News articles (varied vocabulary and sentence structures)
Book passages with dialogue (teaches the AI different intonations)
Your own writing (if the clone will be used for your content)
Lists, questions, and exclamations (gives the AI a range of speech patterns)

Do not read:

Poetry (unusual rhythm patterns can confuse the model)
Highly technical jargon only (too narrow a vocal range)
The same sentence repeatedly (the AI needs variety)

Quality Checklist Before Uploading

Before submitting your sample, verify:

[ ] No background noise audible when you pause speaking
[ ] No clipping (distortion on loud syllables)
[ ] Consistent volume throughout
[ ] No plosive pops on P and B sounds
[ ] No sibilance hiss on S sounds
[ ] At least 30 seconds of continuous speech (for instant cloning)
[ ] File is WAV or high-quality MP3
[ ] You sound natural, not stiff or overly careful

Frequently Asked Questions

Can I use audio from a video call or podcast?

Yes, if the audio quality is decent. Extract the audio track, check it against the quality checklist above. Zoom and Teams recordings often have compression artifacts that reduce clone quality.

Does the language of the sample matter?

For ElevenLabs Multilingual v2, you can record in one language and generate in another. However, cloning works best when the sample language matches the output language.

Should I record in one session or multiple?

One session is preferred for consistency (same room, same mic position, same energy level). If you must split sessions, keep the setup identical.

For the cloning process itself, follow our voice cloning tutorial. For tool selection, see best voice cloning tools.

Voice Cloning Audio Requirements: How to Record the Perfect Sample