Riassunto analitico
Multimodal Large Language Models (MLLMs) represent the natural progression of traditional LLMs, extending their functionality beyond text to include multiple modalities. While ongoing research focuses on developing new architectures and vision-language adapters, this work centers on enhancing MLLMs with the ability to answer questions that require external knowledge. Despite their impressive capabilities, LLMs are prone to factual inaccuracies due to their dependence on internal parametric knowledge, which restricts their ability to provide accurate, up-to-date information, especially in fast-changing contexts. Retrieval-Augmented Generation (RAG) is a targeted approach that mitigates these issues by supplementing LLMs with relevant external knowledge, improving both the quality and factual accuracy of their outputs. To address this limitation, we propose integrating Visual Retrieval-Augmented Generation (Visual RAG), which has significant potential to enhance the dynamic and contextual accuracy of these models. Our approach focuses on integrating an external knowledge source composed of multimodal documents. Relevant information is retrieved and used as additional context. We propose a framework that incorporates knowledge-enhanced reranking to accurately filter the top-k retrieved images, alongside visual retrieval components. We conduct experiments on a knowledge-based visual question answering dataset, evaluating both the zero-shot capabilities of this strategy and its adaptability through fine-tuning. Extensive studies are conducted using LLaVA models, focusing on retrieving images and integrating their descriptions to answer specific queries.
|