The success of AI will depend on validated public data sets

December 09, 2019
by John W. Mitchell, Senior Correspondent
A panel presentation last Sunday at RSNA offered insights on how to access and assemble publicly available data sets for use in generating AI algorithms. The group also touched on points of caution around bias, validation, reference diagnosis, sample size, and other variables that can skew data to the deterrence of good AI development.

John Freymann, informatics manager at the National Cancer Institute, discussed The Cancer Imaging Archive (TCIA), which hosts and de-identifies more than 100 data set sources for use by researchers. He reviewed new data sets, including data generated by NCI/NIH grants, challenge competitions, and publication data sharing requests.

“More and more data sets are being built for AI performance,” Freymann told the audience.

TCIA has more than 15,000 active users per month, and nearly 900 peer-reviewed articles have been published based on TCIA. The data sets are widely used, said Freymann, because of their permissive Creative Commons licensing agreements. He also praised the increasing number of challenge competitions with curation from radiologists as valuable sources of new data for AI applications.

The second speaker, Dr. Laura Coombs, vice president of data science and informatics at the American College of Radiology, touched on the process and changes of several data quality areas. These included anonymization of data (removing individual patient information), ground truth (the reliability of data collected on-site), and federation (bringing the algorithm to the data securely rather than exporting data for development and exposing it to security risks).

Dr. Jayashree Kalpathy-Cramer, director of QTIM Lab at the Center for Machine Learning, MGH & BWH Center of Clinical Data Science, made a case for the expansion of public data sets by citing several points:

– There is, arguably, a reproducibility crisis in research.
– Very few publications have been validated using external data sets.
– Multi-institutional data sets are needed to build (validate) robust AI tools.
– Radiomics and learning methods can be "brittle", in that performance degrades when applied to data sets other than what the in-house program used to machine learn.
– Models built on limited internal databases can encode and propagate historic biases.

She added that building robust machine learning models requires large volumes of well-annotated data sets, and public data sets can improve reproducibility. However, the annotations need to be built into the models using common standards, as human annotations can be biased.

Dr. George Shih, associate vice chair for informatics at Weill Cornell Medicine Radiology, noted there is a shortage of physicians worldwide, and lifesaving medical exams go unread as a result. AI, Shih said, is transforming medicine. He characterized the current AI imaging environment as an academic and industry gold rush measured by AI interest, sessions, and competitions.

Data and privacy concerns related to AI data sharing need to be top of mind, as well as rooting out data bias, which Shih said is prevalent in data sets. He noted that healthcare systems worry about the legal risk of sharing data, and some even consider not sharing to be a competitive advantage. He cited the federated learning model where an algorithm in development is brought to the data set in-house, rather than shipping data to an external AI developer.

Another panel member, Dr. Laila Poisson, biostatistician at Henry Ford Health System, noted that it’s always best to design a data study with the end in mind. She cited 22 Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD) factors. TRIPOD will improve reporting for studies developing, validating, or updating a prediction model. AI developers also need to consider if the algorithm can work under both ideal and unusual conditions.

The final speaker, Dr. Arie Meir, product manager for Google Cloud Healthcare and Life Sciences, reviewed the company’s effort to pioneer de-identification (de-ID) methods to remove patient identifiers. The goal, he maintained, is a meaningful balance between the raw data and full deletion.

For example, if a patient texts a primary care doctor with a complaint of fever and includes her phone number and Social Security Number (SSN), the only number to be de-identified is the SSN — not her phone number and temperature. The algorithm allows the phone number to show so that the clinician can call the patient at their request, but blocks the SSN in case the message gets intercepted or hacked.

Imaging, he said, has similar challenges in what info needs to be de-identified to be useful. He cited Google's work with a large veterinary company that used an AI algorithm to organize a 90-second image sequence for its animal radiologists. The platform saved over 1 million dollars annually in staff time and improved work satisfaction.