Speech Lab, IIT Madras announces Automatic Speech Recognition (ASR) Challenge in three Indian languages - Hindi, Tamil and Indian-English. This challenge is the third challenge in the series of ASR challenges planned. In this installment of the challenge, approximately 490 hours of transcribed speech data in three Indian languages will be made open source. This data subsumes the data released in the previous challenges. The details of the first and the second challenges can be found here and here. These challenges are a part of the National Language Translation Mission funded by MeitY. They aim towards helping and encouraging the advancement of ASR in Indian Languages. We plan to have a series of challenges with increasing difficulty in different Indian languages, and release appropriate data with each challenge. In the first two challenges, we had released everything including source codes so that start-ups/Universities/Research-Labs without previous experience in ASR can also participate and get familiar with it.
Recent advancements in Speech technology have shown that ASR systems can work on par with humans. To build a good ASR system requires large amounts of training data and high-end computational resources.
However, when it comes to Indian languages, not everyone, especially academic institutions and startups, have access to these resources. As a part of this challenge, we will be releasing speech data in Hindi, Tamil and Indian-English. Everyone who participates in this challenge will then be free to use this data for research purposes
The data set comprises of Hindi, Tamil and Indian-English read and conversational speech data along with the corresponding transcriptions. This speech data was collected by Speech Lab IITM and several startups. We will be releasing approximately 490 hours of speech data in this challenge round. The details of the data sets released for this challenge are as follows:
Set | Train set | Development set | Evaluation set | Total duration |
---|---|---|---|---|
HINDI | 178.4 hours | 4.8 hours | 4.9 hours | 188.1 hours |
TAMIL | 104.5 hours | 3.9 hours | 3.8 hours | 112.2 hours |
INDIAN ENGLISH | 179.5 hours | 5.4 hours | 5.4 hours | 190.3 hours |
Lexicon has also been made available. The lexicon was generated using the Unified-parser (Hindi and Tamil) and CMU Lexicon tool (Indian-English). The Hindi and English data released in this challenge includes the Hindi data released in the first challenge and "IITM" English data released in the second challenge respectively. So approximately 490 hours + 200 hours (NPTEL data from second challenge) = 690 hours of transcribed speech data has been released through these three challenges.
- Release of training data, development data and, lexicon: May 13, 2021
- Evaluation data release and opening of submission site:
July 7th, 2021July 14th, 2021 - Closing of submission site:
July 14th, 2021July 21st, 2021(midnight anywhere in the world, i.e., 12pm UTC on July 21st, 2021) - Announcement of results: July 22nd, 2021
The models for Indian English, Hindi, Tamil are uploaded in google drive. These models can be downloaded using the links below.