Introduction to Data-Centric AI

Traditional approaches to machine learning commonly follow a model-centric paradigm, in which the greatest effort is spent making improvements to model architectures, tuning hyperparameters, modifying loss functions, or introducing regularisation. In an academic setting, this is often done using the assumption of having a fixed clean dataset. However, it has been demonstrated that popular academic datasets contain labelling errors (see the website by Northcutt et al.), thus the results of experiments may be biased or not reflect the performance when tested using real world data.

A data-centric paradigm places emphasis on using systematic methods to help improve the quality and quantity of data available during training and testing, with an ambition to produce more favourable outcomes (e.g., better generalisation, consistent performance across subpopulations, etc.) that are independent of model design. Within this series, there are two overarching data-centric approaches:

Development of techniques to better understand existing data and leverage this information to train better models. For example, Curriculum Learning^[1] and coreset selection^[2].
Development of techniques to modify existing data to train better models. For example, Confident Learning^[3] and data augmentation^[4].

In summary, a model-centric approach is centered around taking any dataset and producing the best possible model, while a data-centric approach leverages systematic approaches to produce the best possible dataset, which in turn can be used to train “better” models.

Lab

The lab assignment for the first lecture in the series is focussed on simple text classification (“good” or “bad” review sentiment). Using a pre-trained sentence transformer without additional fine-tuning is sufficient to achieve above 90% accuracy on the task (see notebook).

Visual inspection of the training data reveals a subpopulation of written reviews that are largely mislabelled and can be characterised by the presence of markdown in the text. Simply removing these samples leads to an increase in accuracy when evaluated against the validation dataset. In the next lecture, more sophisticated techniques are introduced to identify label errors.

References

Bengio, Y., Louradour, J., Collobert, R. and Weston, J., 2009, June. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning (pp. 41-48).
Mirzasoleiman, B., Bilmes, J. and Leskovec, J., 2020, November. Coresets for data-efficient training of machine learning models. In International Conference on Machine Learning (pp. 6950-6960). PMLR.
Northcutt, C., Jiang, L. and Chuang, I., 2021. Confident learning: Estimating uncertainty in dataset labels. Journal of Artificial Intelligence Research, 70, pp.1373-1411.
Shorten, C., Khoshgoftaar, T.M. and Furht, B., 2021. Text data augmentation for deep learning. Journal of big Data, 8, pp.1-34.