Berom_Speech_Dataset

this repo is a work in progress and contains Berom Speech data for ML Speech Applications

Downloading

Go to your terminal and enter;

git clone https://github.com/mandeebot/Berom_Speech_Data.git

This adds a folder called "Berom_Speech_Data" which contains the files to your local directory.

Statistics

212 recordings of an average of 20 word length per recording
total recording hours

Data Collection

Recording and text Data were collected from a single Berom Male speaker via WhatsApp, hopefully, this is a baseline for berom speech data and as the project grows, the Lig-Aikuma Android app will be used in crowd-sourcing for more Berom Data. It is an easy-to-use app with a good interface for recording and elicitation. It offers 6 modes of usage;

Recording
Respeaking
Translating
Elicitation
Check
Share

Data Preprocessing

Preprocessing involved;

validating data for errors and removing corrupt files

One main dataset directory with subdirectories;

wav contains the unprocessed recorded files and metadata

Application

The dataset can be used majorly for low-resource speech model experiments or for cross-lingual ASR.

Problems Encountered

-Berom is a low resource language, meaning there is a very very low amount of resources online to supplement this, actively working towards generating more Berom speech data to add to this repo

so far the text transcriptions collected D0 NOT have their tonal descriptions represented(diacritcs), this sets this dataset at some disadvantage, as it is very common in the Berom Language to have one word with different meanings, the different meanings of such a word is often indicated by the tone present in the word. Working towards updating this repo with data that have their tonal descriptions represented
Recording speech takes time and can become uninteresting to perform quickly.

Contributing

If you would like to contribute to this project by recording more audio files and transcriptions, You can make a pull request and I will be happy to add you to the project.

Original Author

Mandieng Bot

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
trans		trans
wav		wav
.DS_Store		.DS_Store
README.md		README.md
berom-graphme.txt		berom-graphme.txt
berom_data.zip		berom_data.zip
create trans.py		create trans.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Berom_Speech_Dataset

Downloading

Statistics

Data Collection

Data Preprocessing

Application

Problems Encountered

Contributing

Original Author

License

About

Releases

Packages

Languages

mandeebot/Berom_Speech_Dataset

Folders and files

Latest commit

History

Repository files navigation

Berom_Speech_Dataset

Downloading

Statistics

Data Collection

Data Preprocessing

Application

Problems Encountered

Contributing

Original Author

License

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages