Introduction

Training a custom AI singing voice model lets you create a digital vocalist that sounds like a specific person — yourself, a character, or an original voice. The two main tools are RVC (Retrieval-based Voice Conversion) and So-VITS-SVC (Singing Voice Conversion).

This guide is for intermediate to advanced users comfortable with Python, command-line tools, and GPU-accelerated computing.

RVC vs So-VITS-SVC

FeatureRVCSo-VITS-SVC
QualityVery GoodExcellent
Training time20-60 min2-8 hours
VRAM required4GB minimum6GB minimum
Ease of setupEasy (Web UI)Complex
Real-time inferenceYesLimited
Community supportVery ActiveActive
Best forQuick covers, experimentationHigh-quality production

Recommendation: Start with RVC. It is easier to set up and produces great results. Move to So-VITS-SVC when you need maximum quality for production use.

Dataset Preparation

The dataset is the most important factor in model quality.

Collecting Audio

For your own voice: Record 20-60 minutes of singing. Cover different styles, keys, and tempos. Include soft and loud passages.

For an existing voice: Collect clean vocal recordings. The best sources:

  • Isolated vocal tracks (from stems or vocal removal)
  • A cappella recordings
  • Live performance recordings with minimal backing

Avoid:

  • Heavy reverb or effects on the vocals
  • Multiple voices singing simultaneously
  • Poor quality recordings (lo-fi, phone recordings from far away)

Processing the Dataset

  1. Isolate vocals using Ultimate Vocal Remover (UVR5) if the audio has instrumentals
  2. Split into clips of 5-15 seconds each. RVC handles short clips better than one long file.
  3. Remove silence — trim dead air from the beginning and end of each clip
  4. Normalize volume — all clips should be roughly the same loudness
  5. Save as WAV at 44.1kHz, 16-bit, mono

Dataset Size Guidelines

DurationExpected QualityTraining Time (RVC)
10 minDecent20 min
30 minGood30-40 min
60 minVery Good40-60 min
2+ hoursExcellent1-2 hours

Training with RVC

Setup

  1. Clone the RVC Web UI repository from GitHub
  2. Run the one-click installer (Windows) or follow manual setup
  3. Launch the Web UI

Training Parameters

ParameterRecommended ValueNotes
Epochs200-500More is not always better; watch for overfitting
Batch size4-8Depends on VRAM (lower for less VRAM)
Sample rate40000 or 48000Match your dataset
VersionV2Always use the latest
Save frequencyEvery 50 epochsSo you can pick the best checkpoint

The Training Process

  1. Go to the Train tab in RVC Web UI
  2. Enter an experiment name
  3. Upload your processed audio dataset
  4. Set the parameters above
  5. Click "Process Data" (preprocesses audio, 1-5 minutes)
  6. Click "Feature Extraction" (extracts vocal features, 2-10 minutes)
  7. Click "Train" (the main training loop, 20-60 minutes)
  8. Monitor the training loss — it should decrease and then plateau
  9. When training finishes, test each saved checkpoint

Finding the Best Checkpoint

RVC saves model checkpoints at your specified interval. Not all checkpoints are equal:

  • Too few epochs: Voice is generic, not enough learning
  • Sweet spot: Voice sounds accurate, natural, and handles different pitches well
  • Too many epochs: Overfitting — voice sounds accurate on training-like input but artifacts appear on new content

Test each checkpoint with the same reference audio to find the sweet spot.

Training with So-VITS-SVC

So-VITS-SVC produces higher quality results but requires more effort:

  1. Install via GitHub (requires Python 3.8, CUDA, and specific library versions)
  2. Prepare your dataset (same as RVC but requires longer clips, 10-30 seconds each)
  3. Preprocess: python preprocess_flist_config.py
  4. Extract features: python preprocess_hubert_f0.py
  5. Train: python train.py -c configs/config.json
  6. Training takes 2-8 hours depending on dataset and GPU
  7. Inference: python inference_main.py

The quality improvement over RVC is noticeable but the setup complexity is significantly higher.

Optimization Tips

Augmentation: If your dataset is small (under 20 minutes), apply slight pitch shifts and time stretching to create variations. This helps the model generalize.

Mixed speaking and singing: Including some spoken audio (30% speaking, 70% singing) helps the model handle consonants and transitions better.

Pitch range training: Ensure your dataset covers the full pitch range the model will be used for. If you train only on mid-range notes, high and low notes will sound bad.

Post-processing: Apply subtle reverb and EQ to the output to smooth any artifacts. A touch of compression helps blend the AI vocals with instrumentals.

Frequently Asked Questions

How much VRAM do I need?

RVC: 4GB minimum, 8GB recommended. So-VITS-SVC: 6GB minimum, 12GB recommended. More VRAM allows larger batch sizes and faster training.

Can I train on a CPU?

Technically yes, but it is impractical. Training that takes 30 minutes on a GPU takes 10+ hours on a CPU.

How do I prevent overfitting?

Stop training when the loss plateaus. Use the checkpoint saving feature and test multiple checkpoints. If late checkpoints sound worse than earlier ones, you have overtrained.

Can I share my trained models?

Yes, many people share RVC models on community platforms. Be aware: sharing a model trained on a real person's voice without their consent may have legal implications.

For the beginner's approach, start with how to make AI sing. For the complete ecosystem, see our AI voice generator guide.