Speech Frame Filtering for Effective Speech Enhancement

Abstract

The objective of speech enhancement is to take a (noisy) audio signal and apply some transformations to the signal so as to improve the quality of the signal by removing unnecessary frequency components, previously added to the original clean signal because of external noise. Over the years, the trend has shifted towards end-to-end deep-learning based approaches where a (noisy) signal is directly fed into a multi-layer neural-network as input (either in waveform-domain or frequency-domain) and an enhanced version of the signal is produced as output. Although these approaches achieve state-of-the-art results, what they fail to take into account is that, in many cases, (depending on the kind of noise) not every speech frame of the input signal actually needs enhancement - either because they are not affected by noise or are affected by a small negligible amount. Applying enhancement on such frames, in many cases, leads to degradation of quality instead of enhancing it. This project tries to remedy this issue by proposing a model SF2Net, an independent add-on to already existing end-to-end models, which tries to classify the frames as to whether or not they require enhancement and then optionally, post-process the results of those other end-to-end models by replacing the frames of the enhanced signal (with some smoothing applied), that SF2Net detected as not requiring any enhancement, thereby preserving the original quality of the signal in those sections.

Dataset

The dataset used for this project is Microsoft Scalable Noisy Speech Dataset (MS-SNSD) along with corresponding enhanced files produced by pretrained models of FacebookResearch's denoiser. The code follows the following file name conventions:

Category	File Name
Clean Sample	`clnsp<id>.wav` (MS-SNSD convention)
Noisy Sample	`noisy<id>_SNRdb_<snr_level>_clnsp<id>.wav` (MS-SNSD convention)
Enhanced Sample	`noisy<id>_SNRdb_<snr_level>_clnsp<id>_enhanced.wav` (denoiser 's convention)

The directory structure for the dataset is as follows

.
├── ...
├── data                    # Dataset directory
│   ├── train               # Training data
│   │   ├── clean           # The clean audio samples (.wav format)
│   │   ├── enhanced        # The enhanced samples (can be left empty)
│   │   └── noisy           # The noisy audio samples (.wav format)
│   ├── validation          # Validation data
│   │   └── ...             # (Same as train directory)
│   └── test                # Testing data
│       └── ...             # (Same as train directory)
└── ...

Make sure all the files are put appropriately inside data directory (Have included some sample files for reference)

Usage

(Recommended that a virtual environment is set up before proceeding further)

Install the dependencies by doing pip3 install requirement.txt
Make sure the dataset directory is exactly structured as mentioned above. In case of missing directories, create them.
(Optional) Change the parameters in config.json file (model hyper-parameters, optimizer, epochs ... etc. )
(Optional) Change the model that will be used and whether to train/test that model in start.sh script
Execute start.sh

Periodically, models (wrapped inside a subclass of models.base.BaseModel) will be saved to ./pretrained/ directory as *.pkl file. You can change the destination directory in config.json.

NOTE: The directory is created everytime before training starts and previous contents are deleted. So make sure you start testing only after training or you have put the appropriate *.pkl file in that directory.

Credits

The project structure and layout is adapted (and modified) from the template generated by this repository
Microsoft Scalable Noisy Speech Dataset (MS-SNSD)
Pretrained models of FacebookResearch's denoiser

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Speech Frame Filtering for Effective Speech Enhancement

Abstract

Dataset

Usage

Credits

Files

README.md

Latest commit

History

README.md

File metadata and controls

Speech Frame Filtering for Effective Speech Enhancement

Abstract

Dataset

Usage

Credits