Data Curation Challenges for Artificial Intelligence

doi:10.1201/9780429323782-17

ABSTRACT

Deep learning algorithms have brought on a paradigm shift to automated medical image analysis approaches including segmentation. While state-of-the-art models can achieve near human-like performance on many tasks, these same algorithms can be remarkably brittle and lack the ability to generalize across datasets and institutions. A key component to training robust algorithms and evaluating generalizability is the curation of large quantities of heterogeneous data from diverse sources. In this chapter, the key challenges of data curation are discussed, with focus on the complexity of medical data, patient privacy protection, data quality issues, and data annotation. Solutions to the aforementioned challenges are also detailed. Methods for protecting patient privacy include automated anonymization and distributed deep learning techniques. Algorithms can be utilized to detect and correct for data quality issues. Natural language processing, crowdsourcing, and weakly supervised learning can be implemented to decrease annotation burden. Lastly, machine learning competitions can be an effective framework for constructing large, high-quality, multi-institutional datasets.