A curated list of awesome papers related to pre-trained models for information retrieval (a.k.a., pre-training for IR). If there are any papers I missed, please let me know! And any feedback and contributions are welcome!
For people who want to acquire some basic & advanced knowledge about neural models for information retrieval and try some neural models by hand, we refer readers to the below awesome NeuIR survey and the text-matching toolkit MatchZoo-py:
- A Deep Look into neural ranking models for information retrieval. Jiafeng Guo et.al. IPM 2020
- Pre-training Methods in Information Retrieval. Yixing Fan, Xiaohui Xie et.al. 2021
- Pretrained Transformers for Text Ranking: BERT and Beyond. Jimmy Lin et.al. 2020
- Semantic Models for the First-stage Retrieval: A Comprehensive Review. Jiafeng Guo et.al. TOIS 2021
- Learning to Reweight Terms with Distributed Representations. Guoqing Zheng, Jamie Callan SIGIR 2015.(DeepTR)
- Context-Aware Term Weighting For First Stage Passage Retrieval. Zhuyun Dai et.al. SIGIR 2020 short. [code] (DeepCT)
- Context-Aware Document Term Weighting for Ad-Hoc Search. Zhuyun Dai et.al. WWW 2020. [code] (HDCT)
- Learning Term Discrimination. Jibril Frej et.al. SIGIR 2020. (IDF-reweighting)
- COIL: Revisit Exact Lexical Match in Information Retrieval with Contextualized Inverted List. Luyu Gao et.al. NAACL 2020. [code] (COIL)
- Learning Passage Impacts for Inverted Indexes. Antonio Mallia et.al. SIGIR 2021 short. [code] (DeepImpact)
- Document Expansion by Query Prediction. Rodrigo Nogueira et.al. [doc2query code, docTTTTTquery code] (doc2query, docTTTTTquery)
- Generation-Augmented Retrieval for Open-Domain Question Answering. Yuning Mao et.al. ACL 2021. [code] (query expansion with BART)
- SparTerm: Learning Term-based Sparse Representation for Fast Text Retrieval. Yang Bai, Xiaoguang Li et.al. Arxiv 2020. (SparTerm: Term importance distribution from MLM+Binary Term Gating)
- Contextualized Sparse Representations for Real-Time Open-Domain Question Answering. Jinhyuk Lee, Minjoon Seo et.al. ACL 2020. [code] (SPARC, sparse vectors)
- SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking., and v2. Thibault Formal et.al. SIGIR 2021. [code](SPLADE)
- Ultra-High Dimensional Sparse Representations with Binarization for Efficient Text Retrieval. Kyoung-Rok Jang et.al. EMNLP 2021. (UHD)
- Efficient Passage Retrieval with Hashing for Open-domain Question Answering. Ikuya Yamada et.al. ACL 2021. [code] (BPR, convert embedding vector to binary codes)
- Dense Passage Retrieval for Open-Domain Question Answering. Vladimir Karpukhin,Barlas Oguz et.al. EMNLP 2020 [code] (DPR, in-batch negatives)
- RepBERT: Contextualized Text Embeddings for First-Stage Retrieval. Jingtao Zhan et.al. Arxiv 2020. [code] (RepBERT)
- Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval. Lee Xiong, Chenyan Xiong et.al. [code] (ANCE, refresh index during training)
- RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering. Yingqi Qu et.al. NAACL 2021. (RocketQA: cross-batch negatives, denoise hard negatives and data augementation)
- Optimizing Dense Retrieval Model Training with Hard Negatives. Jingtao Zhan et.al. SIGIR 2021.[code] (ADORE&STAR, query-side finetuning build on pretrained document encoders)
- Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling. Sebastian Hofstätter et.al. SIGIR 2021.[code] (TAS-Balanced, sample from query cluster and distill from BERT ensemble)
- PAIR: Leveraging Passage-Centric Similarity Relation for Improving Dense Passage Retrieval Ruiyang Ren et.al. EMNLP Findings 2021. [code] (PAIR)
- ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. Omar Khattab et.al. SIGIR 2020. [code] (ColBERT)
- Poly-encoders: Architectures and pre-training strategies for fast and accurate multi-sentence scoring. Samuel Humeau,Kurt Shuster et.al. ICLR 2020. [code] (Poly-encoders)
- Sparse, Dense, and Attentional Representations for Text Retrieval. Yi Luan, Jacob Eisenstein et.al. TACL 2020. (ME-BERT, multi-vectors)
- Improving Document Representations by Generating Pseudo Query Embeddings for Dense Retrieval. Hongyin Tang, Xingwu Sun et.al. ACL 2021.
- Real-Time Open-Domain Question Answering with Dense-Sparse Phrase Index. Minjoon Seo,Jinhyuk Lee et.al. ACL 2019. [code] (DENSPI)
- Learning Dense Representations of Phrases at Scale. Jinhyuk Lee, Danqi Chen et.al. ACL 2021. [code] (DensePhrases)
- Multi-View Document Representation Learning for Open-Domain Dense Retrieval. Shunyu Zhang et.al. ACL 2022. (MVR)
- Distilling Knowledge from Reader to Retriever for Question Answering. Gautier Izacard, Edouard Grave. ICLR 2020. [unofficial code] (Distill cross-attention of reader to retriever)
- Distilling Knowledge for Fast Retrieval-based Chat-bots. Amir Vakili Tahami et.al. SIGIR 2020. [code] (Distill from cross-encoders to bi-encoders)
- Improving Efficient Neural Ranking Models with Cross-Architecture Knowledge Distillation. Sebastian Hofstätter et.al. Arxiv 2020. [code] (Distill from BERT ensemble)
- Distilling Dense Representations for Ranking using Tightly-Coupled Teachers. Sheng-Chieh Lin, Jheng-Hong Yang, Jimmy Lin. Arxiv 2020. [code] (TCTColBERT: distill from ColBERT)
- Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling. Sebastian Hofstätter et.al. SIGIR 2021.[code] (TAS-Balanced, sample from query cluster and distill from BERT ensemble)
- RocketQAv2: A Joint Training Method for Dense Passage Retrieval and Passage Re-ranking. Ruiyang Ren, Yingqi Qu et.al. EMNLP 2021. [code] (RocketQAv2, joint learning by distillation)
- Joint Learning of Deep Retrieval Model and Product Quantization based Embedding Index. Han Zhang et.al. SIGIR 2021 short. [code] (Poeem)
- Jointly Optimizing Query Encoder and Product Quantization to Improve Retrieval Performance. Jingtao Zhan et.al. CIKM 2021. [code] (JPQ)
- Efficient Passage Retrieval with Hashing for Open-domain Question Answering. Ikuya Yamada et.al. ACL 2021. [code] (BPR, convert embedding vector to binary codes)
- Learning Discrete Representations via Constrained Clustering for Effective and Efficient Dense Retrieval. Jingtao Zhan et.al. WSDM 2022. [code] (RepCONC)
- Multi-Task Retrieval for Knowledge-Intensive Tasks. Jean Maillard, Vladimir Karpukhin^ et.al. ACL 2021. (Multi-task learning)
- Evaluating Extrapolation Performance of Dense Retrieval. Jingtao Zhan et.al. Arxiv 2022. [code]
- Latent Retrieval for Weakly Supervised Open Domain Question Answering. Kenton Lee et.al. ACL 2019. [code] (ORQA, ICT)
- Pre-training tasks for embedding-based large scale retrieval. Wei-Cheng Chang et.al. ICLR 2020. (ICT, BFS and WLP)
- REALM: Retrieval-Augmented Language Model Pre-Training. Kelvin Guu, Kenton Lee et.al. ICML 2020. [code] (REALM)
- Less is More: Pre-train a Strong Text Encoder for Dense Retrieval Using a Weak Decoder. Shuqi Lu, Di He, Chenyan Xiong et.al. EMNLP 2021. [code] (Seed)
- Condenser: a Pre-training Architecture for Dense Retrieval. Luyu Gao et.al. EMNLP 2021. [code](Condenser)
- Unsupervised Context Aware Sentence Representation Pretraining for Multi-lingual Dense Retrieval. Ning Wu et.al. JICAI 2022. [code](CCP, cross-lingual pre-training)
- Unsupervised Corpus Aware Language Model Pre-training for Dense Passage Retrieval. Luyu Gao et.al. ACL 2022. [code](coCondenser)
- LaPraDoR: Unsupervised Pretrained Dense Retriever for Zero-Shot Text Retrieval. Canwen Xu, Daya Guo et.al. ACL 2022. [code] (LaPraDoR, ICT+dropout)
- Pre-train a Discriminative Text Encoder for Dense Retrieval via Contrastive Span Prediction Xinyu Ma et.al. SIGIR 2022. [code]
- Real-Time Open-Domain Question Answering with Dense-Sparse Phrase Index. Minjoon Seo, Jinhyuk Lee et.al. ACL 2019. [code] (DENSPI)
- Dense Passage Retrieval for Open-Domain Question Answering. Vladimir Karpukhin, Barlas Oguz et.al. EMNLP 2020 [code] (DPR)
- Contextualized Sparse Representations for Real-Time Open-Domain Question Answering. Jinhyuk Lee, Minjoon Seo et.al. ACL 2020. [code] (SPARC, sparse vectors)
- DC-BERT: Decoupling Question and Document for Efficient Contextual Encoding. Yuyu Zhang, Ping Nie et.al. SIGIR 2020 short. (DC-BERT)
- Learning Dense Representations of Phrases at Scale. Jinhyuk Lee, Danqi Chen et.al. ACL 2021. [code] (DensePhrases)
- Multi-Task Retrieval for Knowledge-Intensive Tasks. Jean Maillard, Vladimir Karpukhin^ et.al. ACL 2021. (Multi-task learning)
- Real-Time Open-Domain Question Answering with Dense-Sparse Phrase Index. Minjoon Seo,Jinhyuk Lee et.al. ACL 2019. [code] (DENSPI)
- Complement Lexical Retrieval Model with Semantic Residual Embeddings. Luyu Gao et.al. ECIR 2021.
- BERT-based Dense Retrievers Require Interpolation with BM25 for Effective Passage Retrieval. Shuai Wang et.al. ICTIR 2021.
- Progressively Optimized Bi-Granular Document Representation for Scalable Embedding Based Retrieval. Shitao Xiao et.al. WWW 2022. [code]
- Understanding the Behaviors of BERT in Ranking. Yifan Qiao et.al. Aixiv 2019. (Representation-focused and Interaction-focused)
- Passage Re-ranking with BERT. Rodrigo Nogueira et.al. [code] (monoBERT: Maybe the first work on applying BERT to IR)
- Multi-Stage Document Ranking with BERT, The Expando-Mono-Duo Design Pattern for Text Ranking with Pretrained Sequence-to-Sequence Models. Rodrigo Nogueira et.al. Arxiv 2020. (Expando-Mono-Duo: doc2query+pointwise+pairwise)
- CEDR: Contextualized Embeddings for Document Ranking. Sean MacAvaney et.al. SIGIR 2020 short. [code] (CEDR: BERT+neuIR model)
- Beyond [CLS] through Ranking by Generation. Cicero Nogueira dos Santos et.al. EMNLP 2020 short. (Query generation using GPT and BART)
- Document Ranking with a Pretrained Sequence-to-Sequence Model. Rodrigo Nogueira, Zhiying Jiang et.al. EMNLP 2020. [code] (Relevance token generation using T5)
- Generalizing Discriminative Retrieval Models using Generative Tasks. Bingsheng Liu, Hamed Zamani et.al. WWW 2021. (GDMTL, joint discriminative and generative model with multitask learning)
- Deeper Text Understanding for IR with Contextual Neural Language Modeling. Zhuyun Dai et.al. SIGIR 2020 short. [code] (BERT-MaxP, BERT-firstP, BERT-sumP: Passage-level)
- Simple Applications of BERT for Ad Hoc Document Retrieval, Applying BERT to Document Retrieval with Birch, Cross-Domain Modeling of Sentence-Level Evidence for Document Retrieval. Wei Yang, Haotian Zhang et.al. Arxiv 2020, Zeynep Akkalyoncu Yilmaz et.al. EMNLP 2019 short. [code] (Birch: Sentence-level)
- Intra-Document Cascading: Learning to Select Passages for Neural Document Ranking. Sebastian Hofstätter et.al. SIGIR 2021. [code] (Distill a ranking model to conv-knrm to select top-k passages)
- PARADE: Passage Representation Aggregation for Document Reranking. Canjia Li et.al. Arxiv 2020. [code] (An extensive comparison of various Passage Representation Aggregation methods)
- Leveraging Passage-level Cumulative Gain for Document Ranking. Zhijing Wu et.al. WWW 2020. (PCGM)
- Local Self-Attention over Long Text for Efficient Document Retrieval. Sebastian Hofstätter et.al. SIGIR 2020 short. [code] (TKL:Transformer-Kernel for long text)
- Beyond 512 Tokens: Siamese Multi-depth Transformer-based Hierarchical Encoder for Long-Form Document Matching. Liu Yang et.al. CIKM 2020. [code] (SMITH for doc2doc matching)
- Socialformer: Social Network Inspired Long Document Modeling for Document Ranking. Yujia Zhou et.al. WWW 2022. (Socialformer)
- DC-BERT: Decoupling Question and Document for Efficient Contextual Encoding. Yuyu Zhang, Ping Nie et.al. SIGIR 2020 short. (DC-BERT)
- Efficient Document Re-Ranking for Transformers by Precomputing Term Representations. Sean MacAvaney et.al. SIGIR 2020. [code] (PreTTR)
- Modularized Transfomer-based Ranking Framework. Luyu Gao et.al. EMNLP 2020. [code] (MORES, similar to PreTTR)
- TILDE: Term Independent Likelihood moDEl for Passage Re-ranking. Shengyao Zhuang, Guido Zuccon SIGIR 2021. [code] (TILDE)
- Fast Forward Indexes for Efficient Document Ranking. Jurek Leonhardt et.al. WWW 2022. (Fast forward index)
- Understanding BERT Rankers Under Distillation. Luyu Gao et.al. ICTIR 2020. (LM Distill + Ranker Distill)
- Simplified TinyBERT: Knowledge Distillation for Document Retrieval. Xuanang Chen et.al. ECIR 2021. [code] (TinyBERT+knowledge distillation)
- The Cascade Transformer: an Application for Efficient Answer Sentence Selection. Luca Soldaini et.al. ACL 2020.[code] (Cascade Transformer: prune candidates by layer)
- Early Exiting BERT for Efficient Document Ranking. Ji Xin et.al. EMNLP 2020 SustaiNLP Workshop. [code] (Early exit)
- Training Curricula for Open Domain Answer Re-Ranking. Sean MacAvaney et.al. SIGIR 2020. [code] (curriculum learning based on BM25)
- Not All Relevance Scores are Equal: Efficient Uncertainty and Calibration Modeling for Deep Retrieval Models. Daniel Cohen et.al. SIGIR 2021.
- BERT-QE: Contextualized Query Expansion for Document Re-ranking. Zhi Zheng et.al. EMNLP 2020 Findings. [code] (BERT-QE)
- Semi-Siamese Bi-encoder Neural Ranking Model Using Lightweight Fine-Tuning. Euna Jung, Jaekeol Choi et.al. WWW 2022. [code] (Lightweight Fine-Tuning)
- MarkedBERT: Integrating Traditional IR Cues in Pre-trained Language Models for Passage Retrieval. Lila Boualili et.al. SIGIR 2020 short. [code] (MarkedBERT)
- Selective Weak Supervision for Neural Information Retrieval. Kaitao Zhang et.al. WWW 2020. [code] (ReInfoSelect)
- PROP: Pre-training with Representative Words Prediction for Ad-hoc Retrieval. Xinyu Ma et.al. WSDM 2021. [code] (PROP)
- Cross-lingual Language Model Pretraining for Retrieval. Puxuan Yu et.al. WWW 2021.
- B-PROP: Bootstrapped Pre-training with Representative Words Prediction for Ad-hoc Retrieval. Xinyu Ma et.al. SIGIR 2021. [code] (B-PROP)
- Pre-training for Ad-hoc Retrieval: Hyperlink is Also You Need. Zhengyi Ma et.al. CIKM 2021. [code] (HARP)
- Contrastive Learning of User Behavior Sequence for Context-Aware Document Ranking. Yutao Zhu et.al. CIKM 2021. [code](COCA)
- Pre-trained Language Model based Ranking in Baidu Search. Lixin Zou et.al. KDD 2021.
- A Unified Pretraining Framework for Passage Ranking and Expansion. Ming Yan et.al. AAAI 2021. (UED, jointly training ranking and query generation)
- Axiomatically Regularized Pre-training for Ad hoc Search. Jia Chen et.al. SIGIR 2022. [code] (ARES)
- Cross-lingual Retrieval for Iterative Self-Supervised Training. Chau Tran et.al. NIPS 2020. [code] (CRISS)
- CLIRMatrix: A massively large collection of bilingual and multilingual datasets for Cross-Lingual Information Retrieval. Shuo Sun et.al. EMNLP 2020. [code] (Multilingual dataset-CLIRMatrix and multilingual BERT)
- RocketQAv2: A Joint Training Method for Dense Passage Retrieval and Passage Re-ranking. Ruiyang Ren, Yingqi Qu et.al. EMNLP 2021. [code] (RocketQAv2)
- Adversarial Retriever-Ranker for dense text retrieval. Hang Zhang et.al. ICLR 2022. [code] (AR2)
- Rethinking Search: Making Domain Experts out of Dilettantes. Donald Metzler et.al. SIGIR Forum 2020. (Envisioned the model-based IR system)
- Transformer Memory as a Differentiable Search Index. Yi Tay et.al. Arxiv 2022. (DSI)
- DynamicRetriever: A Pre-training Model-based IR System with Neither Sparse nor Dense Index. Yujia Zhou et.al. Arxiv 2022. (DynamicRetriever)
- A Neural Corpus Indexer for Document Retrieval. Yujing Wang et.al. Arxiv 2022. (NCI)
- Autoregressive Search Engines: Generating Substrings as Document Identifiers. Michele Bevilacqua et.al. Arxiv 2022. [code] (SEAL)
- Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training. Gen Li, Nan Duan et.al. AAAI 2020. [code] (Unicoder-VL)
- XGPT: Cross-modal Generative Pre-Training for Image Captioning. Qiaolin Xia, Haoyang Huang, Nan Duan et.al. Arxiv 2020. [code] (XGPT)
- UNITER: UNiversal Image-TExt Representation Learning. Yen-Chun Chen, Linjie Li et.al. ECCV 2020. [code] (UNITER)
- Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks. Xiujun Li, Xi Yin et.al. ECCV 2020. [code] (Oscar)
- VinVL: Making Visual Representations Matter in Vision-Language Models. Pengchuan Zhang, Xiujun Li et.al. ECCV 2020. [code] (VinVL)
- Dynamic Modality Interaction Modeling for Image-Text Retrieval. Leigang Qu et.al. SIGIR 2021 Best student paper. [code] (DIME)
- ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. Jiasen Lu, Dhruv Batra et.al. NeurIPS 2019. [code] (VilBERT)
- 12-in-1: Multi-Task Vision and Language Representation Learning. Jiasen Lu, Dhruv Batra et.al. CVPR 2020. [code] (A multi-task model based on VilBERT)
- Learning Transferable Visual Models From Natural Language Supervision. Alec Radford et.al. CVPR 2020. [code] (CLIP, GPT team)
- ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph. Fei Yu, Jiji Tang et.al. Arxiv 2020. [code] (ERNIE-ViL,1st place on the VCR leaderboard)
- M6-v0: Vision-and-Language Interaction for Multi-modal Pretraining. Junyang Lin, An Yang et.al. KDD 2020. (M6-v0/InterBERT)
- M3P: Learning Universal Representations via Multitask Multilingual Multimodal Pre-training. Haoyang Huang, Lin Su et.al. CVPR 2021. [code] (M3P, MILD dataset)
- Faiss: a library for efficient similarity search and clustering of dense vectors
- Pyserini: a Python Toolkit to Support Sparse and Dense Representations
- MatchZoo: a library consisting of many popular neural text matching models
- Pre-trained Models for Natural Language Processing: A Survey. Xipeng Qiu et.al.
- BERT-related-papers
- Pre-trained Languge Model Papers from THU-NLP
- Efficient Transformers: A Survey. Yi Tay, Mostafa Dehghani et.al. Arxiv 2020.