Skip to content

Latest commit

 

History

History
67 lines (56 loc) · 4.22 KB

README.md

File metadata and controls

67 lines (56 loc) · 4.22 KB

Speech Frame Filtering for Effective Speech Enhancement

Abstract

The objective of speech enhancement is to take a (noisy) audio signal and apply some transformations to the signal so as to improve the quality of the signal by removing unnecessary frequency components, previously added to the original clean signal because of external noise. Over the years, the trend has shifted towards end-to-end deep-learning based approaches where a (noisy) signal is directly fed into a multi-layer neural-network as input (either in waveform-domain or frequency-domain) and an enhanced version of the signal is produced as output. Although these approaches achieve state-of-the-art results, what they fail to take into account is that, in many cases, (depending on the kind of noise) not every speech frame of the input signal actually needs enhancement - either because they are not affected by noise or are affected by a small negligible amount. Applying enhancement on such frames, in many cases, leads to degradation of quality instead of enhancing it. This project tries to remedy this issue by proposing a model SF2Net, an independent add-on to already existing end-to-end models, which tries to classify the frames as to whether or not they require enhancement and then optionally, post-process the results of those other end-to-end models by replacing the frames of the enhanced signal (with some smoothing applied), that SF2Net detected as not requiring any enhancement, thereby preserving the original quality of the signal in those sections.

Dataset

The dataset used for this project is Microsoft Scalable Noisy Speech Dataset (MS-SNSD) along with corresponding enhanced files produced by pretrained models of FacebookResearch's denoiser. The code follows the following file name conventions:

Category File Name
Clean Sample clnsp<id>.wav (MS-SNSD convention)
Noisy Sample noisy<id>_SNRdb_<snr_level>_clnsp<id>.wav (MS-SNSD convention)
Enhanced Sample noisy<id>_SNRdb_<snr_level>_clnsp<id>_enhanced.wav (denoiser 's convention)

The directory structure for the dataset is as follows

.
├── ...
├── data                    # Dataset directory
│   ├── train               # Training data
│   │   ├── clean           # The clean audio samples (.wav format)
│   │   ├── enhanced        # The enhanced samples (can be left empty)
│   │   └── noisy           # The noisy audio samples (.wav format)
│   ├── validation          # Validation data
│   │   └── ...             # (Same as train directory)
│   └── test                # Testing data
│       └── ...             # (Same as train directory)
└── ...

Make sure all the files are put appropriately inside data directory (Have included some sample files for reference)

Usage

(Recommended that a virtual environment is set up before proceeding further)

  • Install the dependencies by doing pip3 install requirement.txt
  • Make sure the dataset directory is exactly structured as mentioned above. In case of missing directories, create them.
  • (Optional) Change the parameters in config.json file (model hyper-parameters, optimizer, epochs ... etc. )
  • (Optional) Change the model that will be used and whether to train/test that model in start.sh script
  • Execute start.sh

Periodically, models (wrapped inside a subclass of models.base.BaseModel) will be saved to ./pretrained/ directory as *.pkl file. You can change the destination directory in config.json.

NOTE: The directory is created everytime before training starts and previous contents are deleted. So make sure you start testing only after training or you have put the appropriate *.pkl file in that directory.

Credits