Tesi etd-09232023-174306

Tipo di tesi

Tesi di laurea magistrale

Autore

PUGNALONI, FRANCESCO

URN

etd-09232023-174306

Titolo

Large-scale Table-Matching through Graph-Embeddings

Titolo in inglese

Large-scale Table-Matching through Graph-Embeddings

Struttura

Dipartimento di Ingegneria "Enzo Ferrari"

Corso di studi

Ingegneria informatica (D.M.270/04)

Commissione

Nome Commissario	Qualifica
SIMONINI GIOVANNI	Primo relatore
NAUMANN FELIX	Correlatore
BERGAMASCHI SONIA	Correlatore

Parole chiave

Big Data
Entity Resolution
Graph Neural Network
Table-Embeddings
Table-matching

Data inizio appello

2023-10-19

Disponibilità

Accessibile via web (tutti i file della tesi sono accessibili)

Riassunto analitico

Lo scopo del “Table Matching” è cercare delle coppie che sono appunto dei “match” in una raccolta di tabelle relazionali, considerando due tabelle un “match” se hanno un’ alta similarità rispetto ad una misura. Effettuare questa operazione su raccolte di milioni di tabelle eterogenee è difficile, poichè ottonere una soluzione esatta è un problema quadratico. Un modo per ridurne la complessità è utilizzare tecniche di “blocking”, come l’LSH, per redurre il numero di confronti da effettuare. Sfortunatamente, i metodi di blockng attuali non sono adatti per lavorare con tabelle, e algoritmi come l’LSH per la “cosine similarity” funzionano soltanto con degli embedding.
In questa tesi, approcciamo il problema di generare “”table embeddings” che preservano proprietà utili per trovare “matching tables”, proponendo due framework che sfruttano tecniche differenti per calcolarli. Entrambi utilizzano rappresentazioni intermedie tramite grafi, il primo sfrutta un approccio basato su “node2vec”, il secondo delle “Graph Neural Networks” (GNNs).
I nostri esperimenti suggeriscono che node2vec tende a non essere il tool migliore quando il numero di tabelle cresce. Al contrario, il framework basato su Graph Neural Networks ha fornito risultati promettenti lavorando su grandi quantità di dati mai visti, scalando bene sia in termini di tempo di esecuzione che di qualità degli embedding.

Abstract

The purpose of “Table Matching” is the research of “matching” couples inside a collection of relational tables, considering two tables a “match” if they have high similarity with respect to a similarity measure. Performing this operation on collections of millions of heterogeneous tables is hard because obtaining an exact solution is a quadratic problem. A possible way to reduce its complexity is employing blocking techniques, such as LSH, to reduce the number of comparisons. Unfortunately, the existing blocking methods are not suited to work with tables, and algorithms such as “LSH for cosine similarity” can only work with embeddings. In this thesis, we approach the problem of generating table embeddings that maintain properties useful to discover matching tables, proposing two frameworks that exploit different techniques to generate such embeddings. Both of them use intermediate graph representations, the first one implements a node2vec-based approach, and the second one exploits Graph Neural Networks (GNNs). Our experiments suggest that node2vec tends to struggle when the number of tables to embed increases. On the contrary, the GNN-based framework provided promising results when it came to processing large amounts of previously unseen data, scaling up well both in terms of execution time and embedding quality.

File

Nome file		Dimensione	Tempo di download stimato (Ore:Minuti:Secondi)
Nome file		Dimensione	28.8 Modem	56K Modem	ISDN (64 Kb)	ISDN (128 Kb)	piu' di 128 Kb
	PugnaloniFrancesco_Thesis.pdf	5.13 Mb	00:23:46	00:12:13	00:10:41	00:05:20	00:00:27
Contatta l'autore