Riassunto analitico
Data Science aims to extract useful insights from raw datasets. Since data largely impact on performance, fairness, robustness, safety and scalability of the developed models, their quality is a key aspect in most Data Science projects. Data Quality is critical in the high-stakes domains, where this property affects the health of the individuals involved. To ensure this property, data industry professionals almost universally agree that one of the most difficult and time-consuming task is Data Preparation. Data Preparation is the Data Science field that deals with messy data, cleaning and organizing them for subsequent Machine Learning (ML) models. Data Quality is usually achieved through the development of a pipeline made up of a sequence of atomic operations focused on solving specific problems on the input dataset. However new data extraction can reveal data mutations related to different schema or different data characteristics. Those variations generate errors classifiable as syntactic, if they break the pipeline, or semantic, if they don't break the pipeline, but at the same time, they cause it to produce incorrect results. Out of the box fully automated solutions for such a complex problem do not exists, for this reason a new approach overcoming classical ones is required. This thesis focuses on this specific problem that we denote as Incremental Data Preparation, proposing a solution where mutability is addressed through a semi-automatic approach that give users the opportunity to inject new operations between or in place of previous ones. Those operations can be tuned through specific configurations easily definable by the user, thanks to a friendly Graphical User Interface (GUI). Besides Data Profiling and Cleaning, a novel approach based on Provenance has been included in the developed system. The latter collects information on data before and after each operation usage. Recorded pieces of information are graphically shown to the user by comparing them with the ones derived from a previous error-free batch, in order to help him/her identifying possible inconsistencies. Previously illustrated procedures have been used to support physicians in the management of Covid-19 patients, providing them forecasts about their medical conditions. For example, a respiratory crisis in the following 48 hours.
|
Abstract
Data Science aims to extract useful insights from raw datasets. Since data largely impact on performance, fairness, robustness, safety and scalability of the developed models, their quality is a key aspect in most Data Science projects. Data Quality is critical in the high-stakes domains, where this property affects the health of the individuals involved. To ensure this property, data industry professionals almost universally agree that one of the most difficult and time-consuming task is Data Preparation. Data Preparation is the Data Science field that deals with messy data, cleaning and organizing them for subsequent Machine Learning (ML) models.
Data Quality is usually achieved through the development of a pipeline made up of a sequence of atomic operations focused on solving specific problems on the input dataset. However new data extraction can reveal data mutations related to different schema or different data characteristics. Those variations generate errors classifiable as syntactic, if they break the pipeline, or semantic, if they don't break the pipeline, but at the same time, they cause it to produce incorrect results.
Out of the box fully automated solutions for such a complex problem do not exists, for this reason a new approach overcoming classical ones is required. This thesis focuses on this specific problem that we denote as Incremental Data Preparation, proposing a solution where mutability is addressed through a semi-automatic approach that give users the opportunity to inject new operations between or in place of previous ones. Those operations can be tuned through specific configurations easily definable by the user, thanks to a friendly Graphical User Interface (GUI).
Besides Data Profiling and Cleaning, a novel approach based on Provenance has been included in the developed system. The latter collects information on data before and after each operation usage. Recorded pieces of information are graphically shown to the user by comparing them with the ones derived from a previous error-free batch, in order to help him/her identifying possible inconsistencies.
Previously illustrated procedures have been used to support physicians in the management of Covid-19 patients, providing them forecasts about their medical conditions. For example, a respiratory crisis in the following 48 hours.
|