Train AI Singing Voice Models — RVC & Advanced Guide

Introduction

Training a custom AI singing voice model lets you create a digital vocalist that sounds like a specific person — yourself, a character, or an original voice. The two main tools are RVC (Retrieval-based Voice Conversion) and So-VITS-SVC (Singing Voice Conversion).

This guide is for intermediate to advanced users comfortable with Python, command-line tools, and GPU-accelerated computing.

RVC vs So-VITS-SVC

Feature	RVC	So-VITS-SVC
Quality	Very Good	Excellent
Training time	20-60 min	2-8 hours
VRAM required	4GB minimum	6GB minimum
Ease of setup	Easy (Web UI)	Complex
Real-time inference	Yes	Limited
Community support	Very Active	Active
Best for	Quick covers, experimentation	High-quality production

Recommendation: Start with RVC. It is easier to set up and produces great results. Move to So-VITS-SVC when you need maximum quality for production use.

Dataset Preparation

The dataset is the most important factor in model quality.

Collecting Audio

For your own voice: Record 20-60 minutes of singing. Cover different styles, keys, and tempos. Include soft and loud passages.

For an existing voice: Collect clean vocal recordings. The best sources:

Isolated vocal tracks (from stems or vocal removal)
A cappella recordings
Live performance recordings with minimal backing

Avoid:

Heavy reverb or effects on the vocals
Multiple voices singing simultaneously
Poor quality recordings (lo-fi, phone recordings from far away)

Processing the Dataset

Isolate vocals using Ultimate Vocal Remover (UVR5) if the audio has instrumentals
Split into clips of 5-15 seconds each. RVC handles short clips better than one long file.
Remove silence — trim dead air from the beginning and end of each clip
Normalize volume — all clips should be roughly the same loudness
Save as WAV at 44.1kHz, 16-bit, mono

Dataset Size Guidelines

Duration	Expected Quality	Training Time (RVC)
10 min	Decent	20 min
30 min	Good	30-40 min
60 min	Very Good	40-60 min
2+ hours	Excellent	1-2 hours

Training with RVC

Setup

Clone the RVC Web UI repository from GitHub
Run the one-click installer (Windows) or follow manual setup
Launch the Web UI

Training Parameters

Parameter	Recommended Value	Notes
Epochs	200-500	More is not always better; watch for overfitting
Batch size	4-8	Depends on VRAM (lower for less VRAM)
Sample rate	40000 or 48000	Match your dataset
Version	V2	Always use the latest
Save frequency	Every 50 epochs	So you can pick the best checkpoint

The Training Process

Go to the Train tab in RVC Web UI
Enter an experiment name
Upload your processed audio dataset
Set the parameters above
Click "Process Data" (preprocesses audio, 1-5 minutes)
Click "Feature Extraction" (extracts vocal features, 2-10 minutes)
Click "Train" (the main training loop, 20-60 minutes)
Monitor the training loss — it should decrease and then plateau
When training finishes, test each saved checkpoint

Finding the Best Checkpoint

RVC saves model checkpoints at your specified interval. Not all checkpoints are equal:

Too few epochs: Voice is generic, not enough learning
Sweet spot: Voice sounds accurate, natural, and handles different pitches well
Too many epochs: Overfitting — voice sounds accurate on training-like input but artifacts appear on new content

Test each checkpoint with the same reference audio to find the sweet spot.

Training with So-VITS-SVC

So-VITS-SVC produces higher quality results but requires more effort:

Install via GitHub (requires Python 3.8, CUDA, and specific library versions)
Prepare your dataset (same as RVC but requires longer clips, 10-30 seconds each)
Preprocess: python preprocess_flist_config.py
Extract features: python preprocess_hubert_f0.py
Train: python train.py -c configs/config.json
Training takes 2-8 hours depending on dataset and GPU
Inference: python inference_main.py

The quality improvement over RVC is noticeable but the setup complexity is significantly higher.

Optimization Tips

Augmentation: If your dataset is small (under 20 minutes), apply slight pitch shifts and time stretching to create variations. This helps the model generalize.

Mixed speaking and singing: Including some spoken audio (30% speaking, 70% singing) helps the model handle consonants and transitions better.

Pitch range training: Ensure your dataset covers the full pitch range the model will be used for. If you train only on mid-range notes, high and low notes will sound bad.

Post-processing: Apply subtle reverb and EQ to the output to smooth any artifacts. A touch of compression helps blend the AI vocals with instrumentals.

Frequently Asked Questions

How much VRAM do I need?

RVC: 4GB minimum, 8GB recommended. So-VITS-SVC: 6GB minimum, 12GB recommended. More VRAM allows larger batch sizes and faster training.

Can I train on a CPU?

Technically yes, but it is impractical. Training that takes 30 minutes on a GPU takes 10+ hours on a CPU.

How do I prevent overfitting?

Stop training when the loss plateaus. Use the checkpoint saving feature and test multiple checkpoints. If late checkpoints sound worse than earlier ones, you have overtrained.

Can I share my trained models?

Yes, many people share RVC models on community platforms. Be aware: sharing a model trained on a real person's voice without their consent may have legal implications.

For the beginner's approach, start with how to make AI sing. For the complete ecosystem, see our AI voice generator guide.

AI Voice Model Training for Singing: RVC, So-VITS-SVC & Advanced Techniques