A repository listing important datasets for multimodal or multi-domain recommender systems
Name | Scene | Tasks | Information | URL |
---|---|---|---|---|
PixelRec | Stream media | Seq Rec/CF Rec | PixelRec is a large dataset of cover images collected from a short video recommender system, comprising approximately 200 million user image interactions, 30 million users, and 400,000 video cover images. The texts and other aggregated attributes of videos are also included. | link |
NineRec | News, Video, Ads, Images | Seq Rec/CF Rec/Cross-domain Rec | NineRec is a large multimodal recommendation dataset of collected from five famous feeds platform, comprising one pre-trained source dataset and 9 diverse target datasets. Both text and high-resoultion images are included. | link |
MicroLens | Short videos | Seq Rec/CF Rec | MicroLens is a large short video recommendation dataset of collected from a short video platform, comprising 1 bilion interactions, 3 million users and 1 million short videos. Text, images, audio and images are all included. | link |
Amazon Review | Commerce | Seq Rec/CF Rec | This is a large crawl of product reviews from Amazon. Ratings: 82.83 million, Users: 20.98 million, Items: 9.35 million, Timespan: May 1996 - July 2014 | link |
Steam | Game | Seq Rec/CF Rec | Reviews represent a great opportunity to break down the satisfaction and dissatisfaction factors around games. Reviews: 7,793,069, Users: 2,567,538, Items: 15,474, Bundles: 615 | link |
MovieLens | Movie | Rating Prediction | The dataset should not be used for sequential recommendation and several other top-N recommendation tasks see https://arxiv.org/pdf/2307.09985.pdf. | link |
Yelp | Commerce | General | There are 6,990,280 reviews, 150,346 businesses, 200,100 pictures, 11 metropolitan areas, 908,915 tips by 1,987,897 users. Over 1.2 million business attributes like hours, parking, availability, etc. | link |
MIND | News | General | MIND contains about 160k English news articles and more than 15 million impression logs generated by 1 million users. Every news contains textual content including title, abstract, body, category, and entities. | link |
U-NEED | Commerce | Conversation Rec | U-NEED consists of 7,698 fine-grained annotated pre-sales dialogues, 333,879 user behaviors, and 332,148 product knowledge tuples. | link |
KuaiSAR | Video | Search and Rec | KuaiSAR contains genuine search and recommendation behaviors of 25,877 users, 6,890,707 items, 453,667 queries, and 19,664,885 actions within a span of 19 days on the Kuaishou app | link |
Tenrec | Video, Article | General | Tenrec is a large-scale benchmark dataset for recommendation systems. It contains around 5 million users and 140 million interactions. it covers four recommendation scenarios | link |