We present an automatic speech recognition system developed using end-to-end deep learning. The traditional speech systems usually rely on laboriously engineered processing pipelines and also tend to perform poorly in noisy environments. Our architecture is much more simpler than them and directly learns function that is robust to background noise, reverberation, or speaker variation. Therefore we do not need to hand-designed these components and also do not need a phoneme dictionary. Deep learning models CNNs, RNNs and DNNs are complementary in their modeling capabilities, as CNNs are good at reducing frequency variations, RNNs are good at modeling spatial dependencies, and DNNs are appropriate for mapping features to a more separable space. In this project, we take advantage of the complementarity of CNNs, RNNs and DNNs by combining them into one unified CRNN architecture. Our system, provides state-of-the-art results on the widely studied TIMIT corpus and in noisy environments as well.
Final report: [LINK].
Taslima Akter (takter@iu.edu)
Khandokar Md. Nayem (knayem@iu.edu)
Indiana University, Bloomington