Tesi etd-10312024-170822

Tipo di tesi

Tesi di laurea magistrale

Autore

PAVONE, COSIMO

URN

etd-10312024-170822

Titolo

Improving Multimodal Large Language Models With Retrieval-Augmented Generation And Knowledge-Enhanced Reranking

Titolo in inglese

Struttura

Dipartimento di Ingegneria "Enzo Ferrari"

Corso di studi

Ingegneria informatica

Commissione

Nome Commissario	Qualifica
BARALDI LORENZO	Primo relatore

Parole chiave

Deep Learning
LLM
Multimodal
RAG
Retrieval

Data inizio appello

2024-12-03

Disponibilità

Accesso limitato: si può decidere quali file della tesi rendere accessibili. Disponibilità mixed (scegli questa opzione se vuoi rendere inaccessibili tutti i file della tesi o parte di essi)

Data di rilascio

2064-12-03

Riassunto analitico

Multimodal Large Language Models (MLLMs) represent the natural progression of traditional LLMs, extending their functionality beyond text to include multiple modalities. While ongoing research focuses on developing new architectures and vision-language adapters, this work centers on enhancing MLLMs with the ability to answer questions that require external knowledge. Despite their impressive capabilities, LLMs are prone to factual inaccuracies due to their dependence on internal parametric knowledge, which restricts their ability to provide accurate, up-to-date information, especially in fast-changing contexts. Retrieval-Augmented Generation (RAG) is a targeted approach that mitigates these issues by supplementing LLMs with relevant external knowledge, improving both the quality and factual accuracy of their outputs. To address this limitation, we propose integrating Visual Retrieval-Augmented Generation (Visual RAG), which has significant potential to enhance the dynamic and contextual accuracy of these models. Our approach focuses on integrating an external knowledge source composed of multimodal documents. Relevant information is retrieved and used as additional context. We propose a framework that incorporates knowledge-enhanced reranking to accurately filter the top-k retrieved images, alongside visual retrieval components. We conduct experiments on a knowledge-based visual question answering dataset, evaluating both the zero-shot capabilities of this strategy and its adaptability through fine-tuning. Extensive studies are conducted using LLaVA models, focusing on retrieving images and integrating their descriptions to answer specific queries.

Abstract

File

Nome file	Dimensione	Tempo di download stimato (Ore:Minuti:Secondi)
Nome file	Dimensione	28.8 Modem	56K Modem	ISDN (64 Kb)	ISDN (128 Kb)	piu' di 128 Kb
Ci sono 1 file riservati su richiesta dell'autore.
Contatta l'autore