As new achievements in medical treatment continue, so does the need for data to help propel that advancement. In new research from the University of Maryland’s Robert H. Smith School of Business, Associate Professor Margrét Bjarnadóttir and her co-authors explore “how recent advances in large language models (LLMs) present new opportunities for progress, as well as new risks, in synthetic health data generation (SHDG).”
Bjarnadóttir’s research, Large Language Models and Synthetic Health Data: Progress and Prospects, published in JAMIA Open in December 2024, looks at the “growing interest in the application of machine learning models and advanced analytics to various healthcare processes and operations,” according to the summary. Bjarnadóttir worked with her doctoral student, Daniel Smolyak, University of Maryland Department of Computer Science, Kenyon Crowley, Accenture Federal Services, and Ritu Agarwal with Johns Hopkins University.
This is an area Bjarnadóttir has been studying for two decades. “When I started, nobody was thinking about algorithmic bias or algorithmic fairness,” she said, noting a 2018 paper that pointed out algorithmic biases in healthcare. “Different groups of the population have different access to the healthcare system,” she said, which could create bias in the machine learning models, “against or for some populations.”
This is why Bjarnadóttir and her co-authors decided to pursue the topic, because looking back at healthcare data, “patient groups are not equally represented.”
“So when we have these large databases where we are starting to build models, there are patient groups that are underrepresented. So then how can we augment that data for our models to work better for groups that are not well represented in the data?” she said, adding that the ultimate goal is to build the best models possible.
One of the challenges is the strict limitations of shared health care data, but for good reason, Bjarnadóttir said. So to make data available without revealing patient data, they examine how to create high-quality synthetic data that can replace or augment the real data, she added.
The paper identifies six areas of direction for further research, including evaluation metrics, LLM adoption, data efficiency, generalization, health equity and regulatory challenges. Bjarnadóttir is working on a follow-up paper in the area of health equity, studying the ability for LLMs to be used to “generate synthetic data to improve predictions for underrepresented groups in our data.”
Bjarnadóttir said that in the future, an LLM may be conducting initial health evaluations, and to create effective, personalized medicine, prediction models in health care need to be created from high-quality data. “And one of the barriers to high-quality models is that it is hard to access high-quality clinical data. The hope is that if we are able to generate high-quality synthetic data, then we can start creating and building these, or at least testing these, more creative ways of building models on the synthetic data,” which she said could create better healthcare models.
For a diagnostic algorithm to help identify if a patient has cancer, for example, it needs to take into consideration the data from all of the different groups of patients. “Just based on ‘do no harm,’ you would never make the prediction worse for some groups to match the other,” she explained.
“So then that makes algorithmic fairness in the health care context very unique,” Bjarnadóttir said, adding that the model needs to maximize the algorithm’s abilities for each group independently.
Read Bjarnadóttir’s recent work, “Large Language Models and Synthetic Health Data: Progress and Prospects,” published in JAMIA OPEN.
Media Contact
Greg Muraski
Media Relations Manager
301-405-5283
301-892-0973 Mobile
gmuraski@umd.edu
Get Smith Brain Trust Delivered To Your Inbox Every Week
Business moves fast in the 21st century. Stay one step ahead with bite-sized business insights from the Smith School's world-class faculty.