AI Revives Endangered Languages with Synthetic Data

New research shows how AI can create crucial datasets for low-resource languages, starting with Ladin.

A recent paper details a novel approach to generate high-quality NLP datasets for languages with very limited digital resources. By translating and filtering existing data, researchers successfully created synthetic datasets for Ladin, an endangered Romance language. This method significantly improves machine translation and opens new avenues for language preservation.

By Sarah Kline

September 10, 2025

5 min read

AI Revives Endangered Languages with Synthetic Data

Key Facts

Large Language Models (LLMs) struggle with extremely low-resource languages due to lack of labeled data.
Researchers created synthetic datasets for Ladin, an endangered Romance language, using Italian data.
The method involved translating monolingual Italian data and applying rigorous filtering and back-translation.
Incorporating these synthetic datasets significantly improved Italian-Ladin machine translation baselines.
The study produced the first publicly available sentiment analysis and MCQA datasets for Ladin.

Why You Care

Have you ever wondered if your language, or a language you love, could disappear? Many indigenous and minority languages face this threat daily. Now, imagine a world where artificial intelligence (AI) actively helps preserve these linguistic treasures. That’s precisely what new research from Ulin Nuha and Adam Jatowt reveals. Their work offers a practical path to developing language technologies for extremely low-resource languages. This directly impacts language preservation efforts globally, ensuring more voices are heard in the digital age. Your heritage languages could gain a new lease on life.

What Actually Happened

Researchers Ulin Nuha and Adam Jatowt have addressed a significant challenge in natural language processing (NLP). They focused on the problem of building language technologies for languages with very limited data, according to the announcement. Large Language Models (LLMs) struggle with these “extremely low-resource languages,” primarily due to a lack of labeled data. The team specifically targeted Ladin, an endangered Romance language, focusing on its Val Badia variant. They leveraged a small set of existing Ladin-Italian sentence pairs. From this, they created synthetic datasets for tasks like sentiment analysis and multiple-choice question answering (MCQA). This was achieved by translating monolingual Italian data. To ensure the linguistic quality of these new datasets, they applied rigorous filtering and back-translation procedures, as detailed in the blog post. This careful process ensures the generated data is reliable and accurate.

Why This Matters to You

This research has practical implications for anyone interested in language diversity or digital inclusivity. It shows how even a tiny bit of existing data can be expanded significantly. You might be a language enthusiast or a developer looking to build tools for a less common language. This method provides a blueprint for generating the necessary datasets. The study finds that incorporating these synthetic datasets into machine translation training leads to substantial improvements. This is a big step forward for languages previously overlooked by AI creation.

Think of it as creating a digital bridge for languages. For example, if you wanted to build a voice assistant for a regional dialect, this approach could provide the foundational data. The researchers state, “Leveraging a small set of parallel Ladin-Italian sentence pairs, we create synthetic datasets for sentiment analysis and multiple-choice question answering (MCQA) by translating monolingual Italian data.” This highlights the power of creative data generation. How might this method be applied to other endangered languages you know?

Here’s how the synthetic data benefits language system:

Improved Machine Translation: More accurate translations between low-resource languages and major languages.
New NLP Applications: Enables creation of sentiment analysis and Q&A tools for these languages.
Reduced Data Dependency: Less reliance on vast amounts of human-labeled data, which is often unavailable.
Enhanced Language Preservation: Provides digital tools that can help keep endangered languages alive and thriving.

This work offers the first publicly available sentiment analysis and MCQA datasets for Ladin. This establishes foundational resources that can support broader NLP research. It also opens doors for downstream applications for this underrepresented language. Your ability to interact with diverse languages through system will only grow.

The Surprising Finding

What’s truly surprising about this research is the effectiveness of synthetic data in an extremely low-resource setting. Common assumptions suggest that AI models require massive amounts of real-world, human-labeled data. However, the study demonstrates a different reality. The team revealed that even with a “small set of parallel Ladin-Italian sentence pairs,” they could generate high-quality synthetic datasets. This allowed them to achieve significant improvements in machine translation. The technical report explains that this was possible through meticulous filtering and back-translation procedures. This challenges the notion that only abundant, naturally occurring data can yield meaningful results. It suggests that clever data augmentation techniques can overcome severe data scarcity. This is particularly relevant for the thousands of languages with very limited digital footprints. It means that language preservation efforts might not need to wait for decades of data collection.

What Happens Next

This research paves the way for exciting developments in language system over the next few years. The next step involves applying this synthetic data generation method to other endangered languages. We could see similar datasets emerging for various indigenous languages within the next 12 to 18 months. This will enable the creation of basic NLP tools for these languages. For example, imagine a mobile app that offers real-time translation for a tribal language, or educational software that teaches children in their native tongue. The company reports that these foundational resources can support broader NLP research. This means more researchers will be able to build upon this work. Actionable advice for developers and linguists is to explore existing bilingual resources, no matter how small. These can serve as seeds for generating larger synthetic datasets. The industry implications are vast, potentially democratizing access to AI for linguistic diversity. This approach could lead to a renaissance for many languages currently struggling to survive in the digital age. The team revealed their contributions include the first publicly available sentiment analysis and MCQA datasets for Ladin. This sets a precedent for future efforts.

Ready to start creating?