The datasets contain article-headline pairs obtained from Japanese Wikinews. The articles and headlines are segmented to words using mecab-ipadic.
In this repository, there are following three version datasets according to the article length:
- full-articles: the dataset with articles more than 10 tokens, and headlines;
- long version: the dataset with articles extracted from the first five sentences or 256 tokens, and headlines.
- short version: the dataset with articles extracted from the three sentences or 128 tokens, and headlines.
Table1 Number of documents
Table2 N-gram overlaps in headline