Japanese-Wikinews Headline Dataset

The datasets contain article-headline pairs obtained from Japanese Wikinews. The articles and headlines are segmented to words using mecab-ipadic.

In this repository, there are following three version datasets according to the article length:

full-articles: the dataset with articles more than 10 tokens, and headlines;
long version: the dataset with articles extracted from the first five sentences or 256 tokens, and headlines.
short version: the dataset with articles extracted from the three sentences or 128 tokens, and headlines.

Data Statistics

Table1 Number of documents

Table2 N-gram overlaps in headline

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
full-articles		full-articles
long		long
raw		raw
short		short
README.md		README.md
preprocess.ipynb		preprocess.ipynb