Riassunto analitico
Entity Matching (EM) aims to identify records referring to the same real-world entity, a critical task in data integration. Deep learning models have demonstrated superior performance over traditional rule-based approaches, particularly in handling textual and noisy data. However, these models require extensive labeled training data, limiting their practical applicability. To address this challenge, this thesis investigates the use of Large Language Models (LLMs) for generating high-quality, hard-to-match synthetic records in a zero-shot setting. The proposed approach employs an LLM to produce alternative entity descriptions, followed by a two-step verification process: an LLM-based classifier ensures semantic consistency, while a retriever component evaluates distinctiveness. If the generated record fails either criterion, the pipeline iteratively refines the output based on feedback from previous attempts. We experiment with open-source LLMs and evaluate the impact of generated matching pairs on EM model generalization across multiple benchmark datasets. Our findings reveal that training with LLM-augmented data yields competitive and sometimes even superior results compared to training with original data.
|