Riassunto analitico
Entity Resolution (ER) is a critical task in data integration, focused on identifying and merging records that represent the same real-world entities across datasets. This research introduces a similarity function designed to evaluate the performance similarity of two datasets on the ER task using a limited set of labeled data. The main application is to identify the optimal source dataset for training an ER model that maximizes performance on a target dataset lacking labels. Using a semi-supervised approach, we train a state-of-the-art entity matching (EM) model, such as Ditto, on a labeled dataset and apply it to both labeled and unlabeled datasets to produce entity clusters. Key features are then extracted from these clusters—including the number of entities, entity sizes, intra- and inter-entity similarity—as well as features directly calculated on the original datasets. Additionally, ranking metrics are computed to enhance the similarity assessment between datasets. These features and metrics define a similarity measure that evaluates ER model transferability, enabling the selection of the best source dataset for training an ER model that generalizes effectively to an unlabeled target dataset.
|