The objective of speech enhancement is to take a (noisy) audio signal and apply some transformations to the signal so as to improve the quality of the signal by removing unnecessary frequency components, previously added to the original clean signal because of external noise. Over the years, the trend has shifted towards end-to-end deep-learning based approaches where a (noisy) signal is directly fed into a multi-layer neural-network as input (either in waveform-domain or frequency-domain) and an enhanced version of the signal is produced as output. Although these approaches achieve state-of-the-art results, what they fail to take into account is that, in many cases, (depending on the kind of noise) not every speech frame of the input signal actually needs enhancement - either because they are not affected by noise or are affected by a small negligible amount. Applying enhancement on such frames, in many cases, leads to degradation of quality instead of enhancing it. This project tries to remedy this issue by proposing a model SF2Net, an independent add-on to already existing end-to-end models, which tries to classify the frames as to whether or not they require enhancement and then optionally, post-process the results of those other end-to-end models by replacing the frames of the enhanced signal (with some smoothing applied), that SF2Net detected as not requiring any enhancement, thereby preserving the original quality of the signal in those sections.
The dataset used for this project is Microsoft Scalable Noisy Speech Dataset (MS-SNSD) along with corresponding enhanced files produced by pretrained models of FacebookResearch's denoiser. The code follows the following file name conventions:
Category | File Name |
---|---|
Clean Sample | clnsp<id>.wav (MS-SNSD convention) |
Noisy Sample | noisy<id>_SNRdb_<snr_level>_clnsp<id>.wav (MS-SNSD convention) |
Enhanced Sample | noisy<id>_SNRdb_<snr_level>_clnsp<id>_enhanced.wav (denoiser 's convention) |
The directory structure for the dataset is as follows
.
├── ...
├── data # Dataset directory
│ ├── train # Training data
│ │ ├── clean # The clean audio samples (.wav format)
│ │ ├── enhanced # The enhanced samples (can be left empty)
│ │ └── noisy # The noisy audio samples (.wav format)
│ ├── validation # Validation data
│ │ └── ... # (Same as train directory)
│ └── test # Testing data
│ └── ... # (Same as train directory)
└── ...
Make sure all the files are put appropriately inside data
directory (Have included some sample files for reference)
(Recommended that a virtual environment is set up before proceeding further)
- Install the dependencies by doing
pip3 install requirement.txt
- Make sure the dataset directory is exactly structured as mentioned above. In case of missing directories, create them.
- (Optional) Change the parameters in
config.json
file (model hyper-parameters, optimizer, epochs ... etc. ) - (Optional) Change the model that will be used and whether to train/test that model in
start.sh
script - Execute
start.sh
Periodically, models (wrapped inside a subclass of models.base.BaseModel
) will be saved to ./pretrained/
directory
as *.pkl
file. You can change the destination directory in config.json
.
NOTE: The directory is created everytime before training starts and previous contents are deleted. So make sure you start testing
only after training or you have put the appropriate *.pkl
file in that directory.
- The project structure and layout is adapted (and modified) from the template generated by this repository
- Microsoft Scalable Noisy Speech Dataset (MS-SNSD)
- Pretrained models of FacebookResearch's denoiser