AI Learns from Synthetic Data, Outperforming Human Language

New research shows how AI models can improve by training on non-linguistic data first.

Scientists are exploring new ways to train large language models (LLMs) without relying solely on natural language. By using Neural Cellular Automata (NCA) to create synthetic data, LLMs can learn more efficiently. This method could lead to more robust and less biased AI.

Mark Ellison

By Mark Ellison

March 12, 2026

3 min read

AI Learns from Synthetic Data, Outperforming Human Language

Key Facts

  • Researchers propose using Neural Cellular Automata (NCA) for 'pre-pre-training' large language models (LLMs).
  • NCA data is synthetic, non-linguistic, controllable, and cheap to generate.
  • Pre-pre-training on 164 million NCA tokens improved language modeling by up to 6% and accelerated convergence by 1.6x.
  • NCA pre-training outperformed training on 1.6 billion natural language tokens from Common Crawl.
  • Benefits of NCA pre-training transfer to reasoning benchmarks like GSM8K, HumanEval, and BigBench-Lite.

Why You Care

What if the best way for AI to learn isn’t by reading everything we’ve ever written? For content creators and AI enthusiasts, this new approach could change how you interact with future AI tools. A recent study suggests that training large language models (LLMs) on synthetic data first offers significant advantages. This creation could mean more and less biased AI for your projects.

What Actually Happened

A team of researchers, including Dan Lee and Seungwook Han, has introduced a novel method for training large language models, as detailed in the blog post. They propose using Neural Cellular Automata (NCA) to generate synthetic, non-linguistic data. This ‘pre-pre-training’ phase happens before the models encounter natural language. The goal is to address common issues with traditional pre-training, such as the finite nature of high-quality text and inherent human biases. NCA data, according to the announcement, offers rich spatiotemporal structure. It also shares statistical resemblances to natural language. Crucially, it is controllable and cheap to generate at scale, the research shows.

Why This Matters to You

This new training method has practical implications for anyone working with or relying on AI. Imagine an AI assistant that understands complex concepts more deeply because its foundational learning wasn’t limited by human language imperfections. The study found that pre-pre-training on just 164 million NCA tokens improved downstream language modeling by up to 6%. What’s more, it accelerated convergence by up to 1.6 times, the paper states. This means AI models could learn faster and perform better.

Consider this comparison:

Training Data TypeToken CountPerformance Impact
Synthetic NCA164 MillionUp to 6% betterment, 1.6x faster
Natural Language (Common Crawl)1.6 BillionOutperformed by NCA data

This is a significant finding. “Pre-pre-training on only 164M NCA tokens improves downstream language modeling by up to 6% and accelerates convergence by up to 1.6x,” the team revealed. This gain even surpassed models trained on 1.6 billion tokens of natural language from Common Crawl, according to the research. How might this shift in training impact the reliability and fairness of the AI tools you use daily?

The Surprising Finding

The most surprising revelation from this research challenges a core assumption about AI training. We generally assume that more natural language data always leads to better language models. However, the study found that a smaller amount of synthetic NCA data yielded superior results compared to a much larger dataset of natural language. This even applied when the natural language training used more computational resources, the company reports. This outcome suggests that the quality and structure of the initial training data might be more important than its sheer volume. The gains from NCA pre-training also transferred to reasoning benchmarks. These include GSM8K (a math problem dataset), HumanEval (code generation), and BigBench-Lite (a broad reasoning benchmark), the documentation indicates. This shows that the benefits extend beyond basic language understanding.

What Happens Next

This research opens a path toward more efficient and potentially less biased AI models. The team revealed that optimal NCA complexity varies by domain. For example, code benefits from simpler dynamics, while math and web text favor more complex ones. This allows for systematic tuning of the synthetic distribution for target domains. We might see specialized AI models emerge in the next 12-18 months, tailored for specific tasks. These models could be pre-trained on custom synthetic data. Imagine an AI specifically trained for legal document analysis. It could be pre-trained on synthetic data reflecting legal structures, not just general text. This could lead to highly accurate and domain-specific AI assistants. For you, this means potentially more specialized and reliable AI tools in the near future. The long-term vision is fully synthetic pre-training, leading to more and ethical AI systems.

Ready to start creating?

Create Voiceover

Transcribe Speech

Create Dialogues

Create Visuals

Clone a Voice