Date : 9 décembre 2024 14:00 - Type : Thesis - Roxane JOUSEAU - Amphi B - IUT
Evaluation de la qualité des données numériques pour les tâches de classification |
Data quality has become a focus in a wide array of industrial and research applications. This is especially true for machine learning, as most models rely heavily on data. However, assessing data quality before and after repairing is not an easy task. Indeed, data quality is usually assessed through data dimensions that describe data; however, there are no unified definitions for most of these dimensions and no unified metric for data quality as a whole. First, this thesis investigates the question: ”Is it always better to repair data?” by studying the impact of data deterioration on classification performances before and after repairing data. A method to decompose repairing approaches paired with a metric to evaluate how difficult it is to use different data repairing methods is also proposed. Second, this thesis presents a unified metric to measure data quality for classification tasks. This metric is designed such that no reference data or expertise is needed. It is based upon two main terms: the former evaluates classification performance across a wide range of models, and the latter assesses variations of performance when a low amount of errors is injected into datasets. Finally, this thesis proposes a metric to measure the degree of data repairability combined with a metric of the difficulty of using a repairing pipeline.
Le jury est composé de :
Sébastien Salva, Université Clermont-Auvergne, Directeur de thèse
Chafik Samir, Université Clermont-Auvergne, Directeur de thèse
Laurent d’Orazio, Université de Rennes, Rapporteur
Nicoleta Rogovschi, Université Paris 5, Rapportrice
Mireille Batton-Hubert, Ecole nationale supérieure des mines de Saint Etienne, Examinatrice.