An any-to-many voice conversion model based on the architecture of Grad-TTS and PPG from a SSL-based phoneme recognizer.
Using python>=3.6, python<=3.9:
- follow instructions under https://pytorch.org/get-started/locally/ to install Torch on your setup,
- run
pip install -r requirements.txt
-
Prepare your multilingual corpora, then fill filelists (lines are formatted in
<wavfile_path>|<speaker_num>|
.) -
A pretrained HiFi-GAN vocoder is located at
./hifigan/g_00875000
, you can continue from it or train a new one. -
Extract PPGs
python preprocess_ppg.py --sr 16000 --in_dir /your/dataset/flacs --out_dir /your/dataset/ppgs
-
Set parameters defined in
params.py
-
Run
train_ppg.py
python inference.py -f infer_for_test.txt -c ./logs/grad_300.pt -t 100
- This is NOT my original research. The code was mostly copied from works done by Li Jingyi et al. at NERCMS, Wuhan Univ. I appreciate immensely the efforts of them.