This is a repository that provides a list of papers on knowledge-enhanced multimodal learning inspired by Awesome Vision-and-Language.
- A survey on knowledge-enhanced multimodal learning (2022): https://arxiv.org/abs/2211.12328
- KB-VQA: Ask me anything: Free-form visual question answering based on knowledge from external sources https://arxiv.org/abs/1511.06973
- Factual VQA (FVQA): FVQA: Fact-Based Visual Question Answering https://arxiv.org/abs/1606.05433
- Knowledge-aware VQA (KVQA): Kvqa: Knowledge-aware visual question answering https://ojs.aaai.org/index.php/AAAI/article/view/4915
- Outside-knowledge VQA (OK-VQA): OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge https://arxiv.org/abs/1906.00067
- Text-KVQA: From strings to things: Knowledge-enabled vqa model that can read and reason https://openaccess.thecvf.com/content_ICCV_2019/papers/Singh_From_Strings_to_Things_Knowledge-Enabled_VQA_Model_That_Can_Read_ICCV_2019_paper.pdf
- Visual7W+KB: Cross-modal knowledge reasoning for knowledge-based visual question answering https://arxiv.org/abs/2009.00145
- S3VQA: Select, substitute, search: A new benchmark for knowledge-augmented visual question answering https://arxiv.org/abs/2103.05568
- Zero-shot Fact VQA (ZS-F-VQA): Zero-shot visual question answering using knowledge graph https://arxiv.org/abs/2107.05348
- High-order Visual Question Reasoning (HVQR): Explainable High-order Visual Question Reasoning: A New Benchmark and Knowledge-routed Network https://arxiv.org/abs/1909.10128
- Building a Large-scale Multimodal Knowledge Base System for Answering Visual Queries https://arxiv.org/abs/1507.05670
- Image Captioning and Visual Question Answering Based on Attributes and External Knowledge https://ieeexplore.ieee.org/document/7934440
- Explicit Knowledge-based Reasoning for Visual Question Answering https://arxiv.org/abs/1511.02570
- FVQA: Fact-based Visual Question Answering https://arxiv.org/abs/1606.05433
- Straight to the Facts: Learning Knowledge Base Retrieval for Factual Visual Question Answering https://arxiv.org/abs/1809.01124
- Out of the Box: Reasoning with Graph Convolution Nets for Factual Visual Question Answering https://arxiv.org/abs/1811.00538
- Kvqa: Knowledge-aware visual question answering https://ojs.aaai.org/index.php/AAAI/article/view/4915
- From Strings to Things: Knowledge-Enabled VQA Model That Can Read and Reason https://ieeexplore.ieee.org/abstract/document/9010987
- Cross-modal Knowledge Reasoning for Knowledge-based Visual Question Answering https://arxiv.org/abs/2009.00145
- Mucko: Multi-Layer Cross-Modal Knowledge Reasoning for Fact-based Visual Question Answering https://arxiv.org/abs/2006.09073
- Boosting Visual Question Answering with Context-aware Knowledge Aggregation https://dl.acm.org/doi/pdf/10.1145/3394171.3413943
- Zero-shot visual question answering using knowledge graph https://arxiv.org/abs/2107.05348
- Towards Knowledge-Augmented Visual Question Answering https://aclanthology.org/2020.coling-main.169/
- An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA https://arxiv.org/abs/2109.05014
- Image Captioning for Effective Use of Language Models in Knowledge-Based Visual Question Answering https://arxiv.org/abs/2109.08029
- A Dataset and Baselines for Visual Question Answering on Art https://arxiv.org/abs/2008.12520
- Knowledge is Power: Hierarchical-Knowledge Embedded Meta-Learning for Visual Reasoning in Artistic Domains https://dl.acm.org/doi/pdf/10.1145/3447548.3467285
- ConceptBert: Concept-Aware Representation for Visual Question Answering https://aclanthology.org/2020.findings-emnlp.44/
- Weakly-supervised visual-retriever-reader for knowledge-based question answering https://aclanthology.org/2021.emnlp-main.517.pdf
- KRISP: Integrating Implicit and Symbolic Knowledge for Open-Domain Knowledge-Based VQA https://arxiv.org/abs/2012.11014
- EKTVQA: Generalized use of External Knowledge to empower Scene Text in Text-VQA https://arxiv.org/abs/2108.09717
- Multi-Modal Answer Validation for Knowledge-Based VQA https://arxiv.org/abs/2103.12248
- Passage Retrieval for Outside-Knowledge Visual Question Answering https://arxiv.org/abs/2105.03938
- Improving and Diagnosing Knowledge-Based Visual Question Answering via Entity Enhanced Knowledge Injection https://arxiv.org/abs/2112.06888
- Select, Substitute, Search: A New Benchmark for Knowledge-Augmented Visual Question Answering https://arxiv.org/abs/2103.05568
- KVL-BERT: Knowledge Enhanced Visual-and-Linguistic BERT for Visual Commonsense Reasoning https://arxiv.org/abs/2012.07000
- Vision–Language–Knowledge Co-Embedding for Visual Commonsense Reasoning https://www.mdpi.com/1424-8220/21/9/2911
- Multi-Level Knowledge Injecting for Visual Commonsense Reasoning https://ieeexplore.ieee.org/abstract/document/9083951
- Explainable High-order Visual Question Reasoning: A New Benchmark and Knowledge-routed Network https://arxiv.org/abs/1909.10128
- Explainable and Explicit Visual Reasoning over Scene Graphs https://arxiv.org/abs/1812.01855
- Improving Image Captioning by Leveraging Knowledge Graphs https://arxiv.org/abs/1901.08942
- Relational Reasoning using Prior Knowledge for Visual Captioning https://arxiv.org/abs/1906.01290
- Image Captioning with Internal and External Knowledge https://dl.acm.org/doi/pdf/10.1145/3340531.3411948
- Integrating Image Captioning with Rule-based Entity Masking https://arxiv.org/abs/2007.11690
- Joint Commonsense and Relation Reasoning for Image and Video Captioning https://ojs.aaai.org/index.php/AAAI/article/view/6731
- Auto-Encoding Scene Graphs for Image Captioning https://arxiv.org/abs/1812.02378
- Injecting Prior Knowledge into Image Caption Generation https://arxiv.org/abs/1911.10082
- Boosting Entity-aware Image Captioning with Multi-modal Knowledge Graph https://arxiv.org/abs/2107.11970
- KM-BART: Knowledge Enhanced Multimodal BART for Visual Commonsense Generation https://arxiv.org/abs/2101.00419
- Unified Vision-Language Pre-Training for Image Captioning and VQA https://arxiv.org/abs/1909.11059
- Knowledgeable Storyteller: A Commonsense-Driven Generative Model for Visual Storytelling https://www.ijcai.org/proceedings/2019/744
- Knowledge-Enriched Visual Storytelling https://arxiv.org/abs/1912.01496
- Imagine, Reason and Write: Visual Storytelling with Graph Knowledge and Relational Reasoning https://ojs.aaai.org/index.php/AAAI/article/view/16410
- Commonsense Knowledge Aware Concept Selection For Diverse and Informative Visual Storytelling https://arxiv.org/abs/2102.02963
- KG-GAN: Knowledge-Guided Generative Adversarial Networks https://arxiv.org/abs/1905.12261
- Integrating Visuospatial, Linguistic and Commonsense Structure into Story Visualization https://arxiv.org/abs/2110.10834
- StoryDALL-E: Adapting Pretrained Text-to-Image Transformers for Story Continuation https://arxiv.org/abs/2209.06192
- Reasoning with Multi-Structure Commonsense Knowledge in Visual Dialog https://arxiv.org/abs/2204.04680
- Natural Language Rationales with Full-Stack Visual Reasoning: From Pixels to Semantic Frames to Commonsense Graphs https://arxiv.org/abs/2010.07526
- Grounded Situation Recognition https://arxiv.org/abs/2003.12058
- Reasoning over Vision and Language: Exploring the Benefits of Supplemental Knowledge https://arxiv.org/abs/2101.06013
- Kb-vlp: Knowledge based vision and language pretraining https://www.microsoft.com/en-us/research/uploads/prod/2021/10/kb_vlp_ICML2021.pdf
- Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks https://arxiv.org/abs/2004.06165
- ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph https://arxiv.org/abs/2006.16934
- ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge Integration https://arxiv.org/abs/2108.07073