Introduction
Training a custom AI singing voice model lets you create a digital vocalist that sounds like a specific person — yourself, a character, or an original voice. The two main tools are RVC (Retrieval-based Voice Conversion) and So-VITS-SVC (Singing Voice Conversion).
This guide is for intermediate to advanced users comfortable with Python, command-line tools, and GPU-accelerated computing.
RVC vs So-VITS-SVC
| Feature | RVC | So-VITS-SVC |
|---|---|---|
| Quality | Very Good | Excellent |
| Training time | 20-60 min | 2-8 hours |
| VRAM required | 4GB minimum | 6GB minimum |
| Ease of setup | Easy (Web UI) | Complex |
| Real-time inference | Yes | Limited |
| Community support | Very Active | Active |
| Best for | Quick covers, experimentation | High-quality production |
Recommendation: Start with RVC. It is easier to set up and produces great results. Move to So-VITS-SVC when you need maximum quality for production use.
Dataset Preparation
The dataset is the most important factor in model quality.
Collecting Audio
For your own voice: Record 20-60 minutes of singing. Cover different styles, keys, and tempos. Include soft and loud passages.
For an existing voice: Collect clean vocal recordings. The best sources:
- Isolated vocal tracks (from stems or vocal removal)
- A cappella recordings
- Live performance recordings with minimal backing
Avoid:
- Heavy reverb or effects on the vocals
- Multiple voices singing simultaneously
- Poor quality recordings (lo-fi, phone recordings from far away)
Processing the Dataset
- Isolate vocals using Ultimate Vocal Remover (UVR5) if the audio has instrumentals
- Split into clips of 5-15 seconds each. RVC handles short clips better than one long file.
- Remove silence — trim dead air from the beginning and end of each clip
- Normalize volume — all clips should be roughly the same loudness
- Save as WAV at 44.1kHz, 16-bit, mono
Dataset Size Guidelines
| Duration | Expected Quality | Training Time (RVC) |
|---|---|---|
| 10 min | Decent | 20 min |
| 30 min | Good | 30-40 min |
| 60 min | Very Good | 40-60 min |
| 2+ hours | Excellent | 1-2 hours |
Training with RVC
Setup
- Clone the RVC Web UI repository from GitHub
- Run the one-click installer (Windows) or follow manual setup
- Launch the Web UI
Training Parameters
| Parameter | Recommended Value | Notes |
|---|---|---|
| Epochs | 200-500 | More is not always better; watch for overfitting |
| Batch size | 4-8 | Depends on VRAM (lower for less VRAM) |
| Sample rate | 40000 or 48000 | Match your dataset |
| Version | V2 | Always use the latest |
| Save frequency | Every 50 epochs | So you can pick the best checkpoint |
The Training Process
- Go to the Train tab in RVC Web UI
- Enter an experiment name
- Upload your processed audio dataset
- Set the parameters above
- Click "Process Data" (preprocesses audio, 1-5 minutes)
- Click "Feature Extraction" (extracts vocal features, 2-10 minutes)
- Click "Train" (the main training loop, 20-60 minutes)
- Monitor the training loss — it should decrease and then plateau
- When training finishes, test each saved checkpoint
Finding the Best Checkpoint
RVC saves model checkpoints at your specified interval. Not all checkpoints are equal:
- Too few epochs: Voice is generic, not enough learning
- Sweet spot: Voice sounds accurate, natural, and handles different pitches well
- Too many epochs: Overfitting — voice sounds accurate on training-like input but artifacts appear on new content
Test each checkpoint with the same reference audio to find the sweet spot.
Training with So-VITS-SVC
So-VITS-SVC produces higher quality results but requires more effort:
- Install via GitHub (requires Python 3.8, CUDA, and specific library versions)
- Prepare your dataset (same as RVC but requires longer clips, 10-30 seconds each)
- Preprocess:
python preprocess_flist_config.py - Extract features:
python preprocess_hubert_f0.py - Train:
python train.py -c configs/config.json - Training takes 2-8 hours depending on dataset and GPU
- Inference:
python inference_main.py
The quality improvement over RVC is noticeable but the setup complexity is significantly higher.
Optimization Tips
Augmentation: If your dataset is small (under 20 minutes), apply slight pitch shifts and time stretching to create variations. This helps the model generalize.
Mixed speaking and singing: Including some spoken audio (30% speaking, 70% singing) helps the model handle consonants and transitions better.
Pitch range training: Ensure your dataset covers the full pitch range the model will be used for. If you train only on mid-range notes, high and low notes will sound bad.
Post-processing: Apply subtle reverb and EQ to the output to smooth any artifacts. A touch of compression helps blend the AI vocals with instrumentals.
Frequently Asked Questions
How much VRAM do I need?
RVC: 4GB minimum, 8GB recommended. So-VITS-SVC: 6GB minimum, 12GB recommended. More VRAM allows larger batch sizes and faster training.
Can I train on a CPU?
Technically yes, but it is impractical. Training that takes 30 minutes on a GPU takes 10+ hours on a CPU.
How do I prevent overfitting?
Stop training when the loss plateaus. Use the checkpoint saving feature and test multiple checkpoints. If late checkpoints sound worse than earlier ones, you have overtrained.
Can I share my trained models?
Yes, many people share RVC models on community platforms. Be aware: sharing a model trained on a real person's voice without their consent may have legal implications.
For the beginner's approach, start with how to make AI sing. For the complete ecosystem, see our AI voice generator guide.