This is the report of the first Kaggle inClass competition of NYCU-IAIS-DL2024, by Heng-Tse Chou (NTHU STAT). The goal is to make Taiwanese Speech Recognition using ESPnet Toolkit and Self-Supervised Pre-Trained Model.
Final wer scores:
- Task 1
- Public: 0.658
- Private: 0.62739
- Task 2
- Public: 0.57022
- Private: 0.56481
- Task 3
- Public: 0.26092
- Private: 0.26828
Table of Contents
Basic environments:
- OS information: Linux 6.5.0-27-generic #28~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Fri Mar 15 10:51:06 UTC 2 x86_64
- python version:
3.9.19 (main, Mar 21 2024, 17:11:28) [GCC 11.2.0]
- espnet version:
espnet 202402
- pytorch version:
pytorch 2.1.0
- Git hash:
f6f011d328fb877b098321975280cadf8c64247a
- Commit date:
Tue Apr 9 01:44:27 2024 +0000
- Commit date:
Environments from torch.utils.collect_env
:
PyTorch version: 2.1.0
Is debug build: False
CUDA used to build PyTorch: 12.1
OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.29.2
Libc version: glibc-2.35
Python version: 3.9.19 (main, Mar 21 2024, 17:11:28) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-6.5.0-27-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.3.107
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 4080
Nvidia driver version: 535.161.07
Versions of relevant libraries:
[pip3] numpy==1.23.5
[pip3] pytorch-ranger==0.1.1
[pip3] torch==2.1.0
[pip3] torch-complex==0.4.3
[pip3] torch-optimizer==0.3.0
[pip3] torchaudio==2.1.0
[pip3] triton==2.0.0
[conda] blas 1.0 mkl
[conda] mkl 2023.1.0 h213fc3f_46344
[conda] mkl-service 2.4.0 py39h5eee18b_1
[conda] mkl_fft 1.3.8 py39h5eee18b_0
[conda] mkl_random 1.2.4 py39hdb19cb5_0
[conda] numpy 1.23.5 py39hf6e8229_1
[conda] numpy-base 1.23.5 py39h060ed82_1
[conda] pytorch 2.1.0 py3.9_cuda12.1_cudnn8.9.2_0 pytorch
[conda] pytorch-cuda 12.1 ha16c6d3_5 pytorch
[conda] pytorch-mutex 1.0 cuda pytorch
[conda] pytorch-ranger 0.1.1 pypi_0 pypi
[conda] torch-complex 0.4.3 pypi_0 pypi
[conda] torch-optimizer 0.3.0 pypi_0 pypi
[conda] torchaudio 2.1.0 py39_cu121 pytorch
[conda] triton 2.0.0 pypi_0 pypi
Run setup.sh
under this directory, it will do the following things sequentially:
- Create a directory
nycu-iais-dl2024-taiwanese-asr/
at parent directory. - Clone espnet into
nycu-iais-dl2024-taiwanese-asr/
. - Set up a conda environment and install espnet.
- Download and unzip the dataset.
- Remove the unneeded receipes under
espnet/egs2
, and useTEMPLATE
to generatemy-receipe
. - Move the dataset into
my-reseipe/asr1/downloads
. - Copy data preparation scripts, config yamls and run scripts for each task into
my-receipe
. - Install additional dependencies, such as s3prl, whisper, loralib.
Two files are copied into myreceipe/asr1/local
.
This script has two purposes:
- Split the original train dataset into train set and valid set randomly.
- Create
text
,wav.scp
andutt2spk
files for train, val and test in Kalid format intomy-receipe/asr1/data
.
This script is used in asr.sh
to:
- Invoke
data_prep.py
. - Sort line order for generated files.
- Create
spk2utt
files for each set.
In task 1, we choose a Branchformer as the encoder, and a Transformer as the decoder. A branchformer encoder can efficiently handles long audio sequences, which enables better feature extraction from audio signals. On the other hand, a transformer decoder leverages language modeling capability for accurate transcription, and offers flexibility in modeling language and context.
The training config is modified from egs2/aishell/asr1/conf/tuning/train_asr_branchformer_e24_amp.yaml
. It is almost identical to the original one, except batch_bins
are modified as 3000000, to accomodate the memory limitation from graphic card.
Adam is adopted for the optimizer, with warmuplr as the learning rate scheduler. The initial learning rate is set by 1.0e-3.
In the run script, a data augmentation of speed perturbation is specified, and the token type is set as char. A character tokenizer operates at the character level and treats each character in the text as a seperate token, which is more suitable for Taiwanese language.
Finally, a maximum number of epochs is set by 60, to prevent the training being too time-consuming.
The train and valid accuracy over the training process are shown below.
We can see that even though being diverged for a while, the valid accuracy stays quite close with the train accuracy, and both values grow over the training process.
The loss decreases with some fluctuations, and its trend is in correspondence with the accuracy as well.
The trend in wer is similar. All three plots suggest that our training in task 1 is successful.
The final wer scoring over the testing data:
- Public: 0.658
- Private: 0.62739
In task 2, we are asked to combine a SSL (Self-Supervised Learning) pre-trained model to ASR, using S3PRL toolkit.
The training config is modified from egs2/librispeech/asr1/conf/tuning/train_asr_conformer7_wav2vec2_960hr_large.yaml
. The upstream model adopted here is WavLM, which is a type of language model designed to operate directly on raw audio waveforms rather than on text representations. Here we use wavlm_base
and batch_bins
is set by 2000000.
Besides the pre-trained frontend, the encoder adopted in this task is Conformer, which leverages the strengths of both convolutional neural networks and transformers to achieve efficient feature extraction.
The optimizer and the learning rate scheduler are still adam and warmuplr. The initial learning rate are set to 2.5e-3.
The run script has the same config as task 1. The maximum number of epochs is set by 35.
The accuracy grows faster than task 1 over epochs. The increment is not significant after the 24th epoch, so maybe we can set a lower number of maximum epoch to save time.
The improvement can also be noticed in the loss.
The convergence of wer in task 2 is more faster and stable.
The final wer scoring over the testing data:
- Public: 0.57022
- Private: 0.56481
Both scoring are better than task 1. It shows that combining a pre-trained model can improve the performance of a ASR task.
In task 3, we are asked to complete the ASR task with OpenAI Whisper, which is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multitasking model that can perform multilingual speech recognition, speech translation, and language identification.
The training config is modified from egs2/aishell/asr1/conf/tuning/train_asr_whisper_medium_lora_finetune.yaml
. The model size we adopted here for the encoder and the decoder are both the small model. The batch_bins
are set by 3000000.
Moreover, to fintune the whisper model, the adapter LoRA
is specified. An adapter is layer added to a pre-trained model to adapt it to specific tasks or domains without extensively retraining the entire model. Adapters are designed to introduce task-specific knowledge into a pre-trained model by learning only a small number of parameters, thus preserving the general knowledge learned during pre-training.
The optimizer is set as adamw, and the learning rate scheduler is still warmuplr. The initial learning rate is set as 1.25e-05, which is recommended in this README.
For the run script, we also specified the tokenizer as whisper_multilingual.
In this task, the training of each epoch is more time-consuming, and the valid accuracy did not improve much over epochs, therefore we opt for a maximum of 10 epochs.
The train accuarcy is already at 0.65 when the first epoch finishs, and the valid accuracy is even at 0.95, which both out-performed the previous two tasks. After 10 epochs, the train accuracy steadily grew to 0.95, while valid accuracy only improved by 0.025.
We may notice that there is a slight growth in the valid loss, meaning it might exists an overfitting situation. However, since both the train and valid loss are way smaller than the ones from the previous tasks, this situation is acceptable.
The wer scoring after the first epoch finishs is very small, comparing to the previous two tasks. After 10 epochs, the differeces became not that significant, but still better.
The final wer scoring over the testing data:
- Public: 0.26092
- Private: 0.26828
The scoring are the best among three tasks. It shows that Whisper has a better performance than a pre-trained model over an ASR task.
The key finding in this assignment is that by "standing upon the shoulders of giants", the fine-tuned models are more powerful and robust than the model we trained from scratch using transformer-based architecture. By far, OpenAI Whisper is the most powerful pre-trained model for ASR tasks.
On the other hands, after watching the reports from classmates with highest scores, I realized there are several improvement for me can be made:
- Add noises to the train data: I forgot that the testing data on Kaggle are modified with noises. To equip the model with the ability to detect noise, we should also introduce the noise into the triaing process, by modify the training data with noises. Moreover, the amount of noise can be adjusted as well.
- Try different split ratio: Adjusting the ratio between train/valid and use a bigger valid set may help prevent the model being overfitted.