GitHub - rahul1990gupta/indic-nlp-datasets: Indic NLP datasets library for commercial uses

Overview

This library provides Indian regional language datasets in an easy to use sklearn.dataset API format. You are free to use it in an application intended for commercial uses.

Installation

You can use pip to install this library

pip install indic-nlp-datasets

To install the latest version of the datasets, use

pip install git+https://github.com/rahul1990gupta/indic-nlp-datasets.git@master

Datasets Available

These are the datasets available in the library

Name	Size	submodule	language
Wikipedia	275 MB	`load_wikipedia`	hi
Oscar Common Crawl	17 GB	`load_occ`	hi
News Crawl	472 MB	`load_news_crawl`	hi
Monlingual	2.45 GB	`load_monolingual`	hi
Tweet Corpus	875 MB	`load_tweets`	hi
Hinglish Corpus	18 MB	`load_hinglish`	hi
Devdas	300 KB	`load_devdas`	hi

Getting started

After installation, you can start by importing the dataset

from idatasets import load_devdas
devdas = load_devdas()
print(devdas.desc) # prints description of the data
print(devdas.created_at) # date/year when dataset was created
for sent in devdas.data:
    # process text chunks

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
.github/workflows		.github/workflows
assets		assets
examples		examples
idatasets		idatasets
.gitignore		.gitignore
.gitpod.Dockerfile		.gitpod.Dockerfile
.gitpod.yml		.gitpod.yml
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
codecov.yml		codecov.yml
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

Installation

Datasets Available

Getting started

About

Releases

Packages

Languages

License

rahul1990gupta/indic-nlp-datasets

Folders and files

Latest commit

History

Repository files navigation

Overview

Installation

Datasets Available

Getting started

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages