AI's Synthetic Data Boosts Biomedical Research

A new review highlights how large language models are creating artificial data to solve real-world medical challenges.

A recent scoping review examines how large language models (LLMs) generate synthetic data for biomedical research. This approach helps overcome data scarcity, especially with sensitive patient information. The review identifies key methods and challenges in this rapidly evolving field.

By Sarah Kline

February 19, 2026

4 min read

AI's Synthetic Data Boosts Biomedical Research

Key Facts

A scoping review analyzed 59 studies on synthetic data generation by LLMs in biomedical research.
The review covered literature published between 2020 and 2025, following PRISMA-ScR guidelines.
Unstructured texts were the predominant data modality, accounting for 78.0% of studies.
LLM prompting was the most common generation method at 74.6%.
Human-in-the-loop assessments were the most frequent evaluation method (44.1%).

Why You Care

Ever worried about your sensitive medical data falling into the wrong hands? Or perhaps wondered how medical research progresses without enough patient information? Imagine a world where AI can create realistic, yet entirely artificial, patient data. This new review reveals how large language models (LLMs) are doing just that, helping to accelerate biomedical research without compromising privacy. What if this system could speed up cures for diseases that affect you or your loved ones?

What Actually Happened

A recent scoping review, published in the Journal of Healthcare Informatics Research, delved into the use of large language models (LLMs) for generating synthetic data in biomedical research. The team, led by Hanshu Rao, systematically reviewed studies from 2020 to 2025. Their goal was to understand how LLMs address data scarcity, utility, and quality issues. This comprehensive analysis included 59 relevant studies, according to the announcement. The research shows a growing adoption of synthetic data generation. This is particularly true for clinical research applications.

Technical terms like “synthetic data” refer to artificially generated data. It mimics the statistical properties of real data but contains no actual patient information. “Large language models” are AI programs trained on vast amounts of text. They can understand, generate, and translate human language. These models are now being adapted to create medical datasets.

Why This Matters to You

This creation is crucial for anyone concerned with medical privacy and research progress. Synthetic data allows scientists to test new hypotheses without accessing real patient records. This protects your personal health information. The study finds that LLMs are increasingly vital for overcoming data limitations in healthcare. This means faster creation of new treatments and better diagnostic tools. Imagine a pharmaceutical company developing a new drug. They need vast amounts of patient data for testing. However, real patient data is often hard to obtain due to privacy concerns. Synthetic data provides a safe alternative. This allows for rigorous testing and potentially quicker drug approvals. How might this impact the speed of finding cures for complex diseases?

Here’s a breakdown of the data modalities and generation methods:

Data Modality	Percentage of Studies
Unstructured Texts	78.0%
Tabular Data	13.6%
Multimodal Sources	8.4%

According to the review, “Synthetic data generation using large language models (LLMs) demonstrates substantial promise in addressing biomedical data challenges and shows increasing adoption in biomedical research.” This highlights the significant role AI plays. It offers a secure path forward for medical creation. Your data remains protected while research advances.

The Surprising Finding

One surprising aspect of the review was the heterogeneity in evaluation methods. While LLMs are generating diverse synthetic data, there isn’t a single, standardized way to assess its quality. The study finds that 44.1% of evaluations used human-in-the-loop assessments. This means human experts still play a significant role. This contrasts with intrinsic metrics at 27.1% and LLM-based evaluations at 13.6%. This mix shows that fully automated quality checks are not yet the norm. It challenges the assumption that AI can completely self-regulate its output in essential fields. The team revealed that limitations persist in several areas. These include data modalities and resource accessibility. What’s more, standardized evaluation protocols are still lacking. This suggests that human oversight remains essential for ensuring data reliability.

What Happens Next

Future efforts will likely focus on developing consistent evaluation frameworks. The paper states that these frameworks need to be transparent and widely accessible. This will help ensure the reliability of synthetic data. Think of it as creating a universal grading system for AI-generated medical information. This could take 12-18 months to develop and gain widespread adoption. For example, a new consortium might form to establish global standards for synthetic data quality. This would ensure that data generated by different LLMs is consistently high quality.

For readers, this means staying informed about these evolving standards. If you work in healthcare or research, understanding these developments is key. The industry implications are vast. This system could democratize access to research data. It could also accelerate drug discovery and personalized medicine. The team revealed that expanding accessibility is crucial. This will support effective applications in biomedical research. This ongoing work promises to reshape how medical studies are conducted, making them faster and more secure.

Ready to start creating?