Skip to content

rahul1990gupta/indic-nlp-datasets

Repository files navigation

Gitpod ready-to-code

Overview

This library provides Indian regional language datasets in an easy to use sklearn.dataset API format. You are free to use it in an application intended for commercial uses.

indic-nlp-datasets Coverage

Installation

You can use pip to install this library

pip install indic-nlp-datasets

To install the latest version of the datasets, use

pip install git+https://github.com/rahul1990gupta/indic-nlp-datasets.git@master

Datasets Available

These are the datasets available in the library

Name Size submodule language
Wikipedia 275 MB load_wikipedia hi
Oscar Common Crawl 17 GB load_occ hi
News Crawl 472 MB load_news_crawl hi
Monlingual 2.45 GB load_monolingual hi
Tweet Corpus 875 MB load_tweets hi
Hinglish Corpus 18 MB load_hinglish hi
Devdas 300 KB load_devdas hi

Getting started

After installation, you can start by importing the dataset

from idatasets import load_devdas
devdas = load_devdas()
print(devdas.desc) # prints description of the data
print(devdas.created_at) # date/year when dataset was created
for sent in devdas.data:
    # process text chunks