This is the official PyTorch implementation of the following paper:
BDDM: BILATERAL DENOISING DIFFUSION MODELS FOR FAST AND HIGH-QUALITY SPEECH SYNTHESIS
Max W. Y. Lam, Jun Wang, Dan Su, Dong Yu
Abstract: Diffusion probabilistic models (DPMs) and their extensions have emerged as competitive generative models yet confront challenges of efficient sampling. We propose a new bilateral denoising diffusion model (BDDM) that parameterizes both the forward and reverse processes with a schedule network and a score network, which can train with a novel bilateral modeling objective. We show that the new surrogate objective can achieve a lower bound of the log marginal likelihood tighter than a conventional surrogate. We also find that BDDM allows inheriting pre-trained score network parameters from any DPMs and consequently enables speedy and stable learning of the schedule network and optimization of a noise schedule for sampling. Our experiments demonstrate that BDDMs can generate high-fidelity audio samples with as few as three sampling steps. Moreover, compared to other state-of-the-art diffusion-based neural vocoders, BDDMs produce comparable or higher quality samples indistinguishable from human speech, notably with only seven sampling steps (143x faster than WaveGrad and 28.6x faster than DiffWave).
Paper: Published at ICLR 2022 on OpenReview
This implementation supports model training and audio generation, and also provides the pre-trained models for the benchmark LJSpeech and VCTK dataset.
Visit our demo page for audio samples.
- May 20, 2021: Released our follow-up work FastDiff on GitHub, where we futher optimized the speed-and-quality trade-off.
- May 10, 2021: Added the experiment configurations and model checkpoints for the VCTK dataset.
- May 9, 2021: Added the searched noise schedules for the LJSpeech and VCTK datasets.
- March 20, 2021: Released the PyTorch implementation of BDDM with pre-trained models for the LJSpeech dataset.
- (Option 1) To train the BDDM scheduling network yourself, you can download the pre-trained score network from philsyn/DiffWave-Vocoder (provided at
egs/lj/DiffWave.pkl
), and follow the training steps below. (Start from Step I.) - (Option 2) To search for noise schedules using BDDM, we provide a pre-trained BDDM for LJSpeech at
egs/lj/DiffWave-GALR.pkl
and for VCTK ategs/vctk/DiffWave-GALR.pkl
. (Start from Step III.) - (Option 3) To directly generate samples using BDDM, we provide the searched schedules for LJSpeech at
egs/lj/noise_schedules
and for VCTK ategs/vctk/noise_schedules
(checkconf.yml
for the respective configurations). (Start from Step IV.)
We provide an example of how you can generate high-fidelity samples using BDDMs.
To try BDDM on your own dataset, simply clone this repo in your local machine provided with NVIDIA GPU + CUDA cuDNN and follow the below intructions.
Download the LJSpeech dataset.
For training, we first need to setup a file conf.yml for configuring the data loader, the score and the schedule networks, the training procedure, the noise scheduling and sampling parameters.
Note: Appropriately modify the paths in "train_data_dir"
and "valid_data_dir"
for training; and the path in "gen_data_dir"
for sampling. All dir paths should be link to a directory that store the waveform audios (in .wav) or the Mel-spectrogram files (in .mel).
Suppose that a well-trained score network (theta) is stored at $theta_path
, we start by modifying "load": $theta_path
in conf.yml.
After modifying the relevant hyperparameters for a schedule network (especially "tau"
), we can train the schedule network (f_phi in paper) using:
# Training on device 0 (supports multi-GPU training)
sh train.sh 0 conf.yml
Note: In practice, we found that 10K training steps would be enough to obtain a promising scheduling network. This normally takes no more than half an hour for training with one GPU.
Given a well-trained BDDM (theta, phi), we can now run the noise scheduling algorithm to find the best schedule (optimizing the trade-off between quality and speed).
First, we set "load"
in conf.yml
to the path of the trained BDDM.
After setting the maximum number of sampling steps in scheduling ("N"
), we run:
# Scheduling on device 0 (only supports single-GPU scheduling)
sh schedule.sh 0 conf.yml
For evaluation, we set "gen_data_dir"
in conf.yml
to the path of a directory that stores the test set of audios (in .wav
).
For generation, we set "gen_data_dir"
in conf.yml
to the path of a directory that stores the Mel-spectrogram (by default in .mel
generated by TacotronSTFT or by our dataset loader bddm/loader/dataset.py
).
Then, we run:
# Generation/evaluation on device 0 (only supports single-GPU generation)
sh generate.sh 0 conf.yml
This implementation uses parts of the code from the following Github repos:
Tacotron2
DiffWave-Vocoder
as described in our code.
@inproceedings{lam2022bddm,
title={BDDM: Bilateral Denoising Diffusion Models for Fast and High-Quality Speech Synthesis},
author={Lam, Max WY and Wang, Jun and Su, Dan and Yu, Dong},
booktitle={International Conference on Learning Representations},
year={2022}
}
Copyright 2022 Tencent
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
This is not an officially supported Tencent product.