New 'Code2Doc' Dataset Boosts AI for Code Documentation

A new dataset promises to significantly improve how AI generates software documentation by focusing on quality.

Researchers have introduced Code2Doc, a high-quality dataset designed to train AI models for generating code documentation. This dataset addresses common issues like noise and AI-generated content, leading to much better performance in AI-powered documentation tools.

By Mark Ellison

December 23, 2025

3 min read

New 'Code2Doc' Dataset Boosts AI for Code Documentation

Key Facts

Code2Doc is a quality-first curated dataset for function-level code documentation generation.
It contains 13,358 high-quality function-documentation pairs from open-source repositories.
The dataset covers Python, Java, TypeScript, JavaScript, and C++.
Only 25.6% of initial candidates satisfied all quality constraints during curation.
Fine-tuning a large language model on Code2Doc resulted in relative improvements of 29.47% in BLEU and 24.04% in ROUGE-L.

Why You Care

Ever struggled to understand complex code without clear explanations? What if AI could write documentation for you? A new dataset, Code2Doc, could make this a reality. It promises to dramatically improve how AI models generate software documentation. This means clearer code, fewer bugs, and more efficient creation for your projects.

What Actually Happened

Researchers Recep Kaan Karaman and Meftun Akarsu have unveiled Code2Doc, a new dataset for generating function-level code documentation. This dataset focuses on quality, which is crucial for training effective AI models, according to the announcement. Most existing datasets are often messy, filled with duplicate information, or even AI-generated content. These issues weaken the learning process for AI models, the paper states. Code2Doc, however, is meticulously curated. It includes 13,358 high-quality function-documentation pairs. These pairs come from widely used open-source projects. The dataset covers five popular programming languages: Python, Java, TypeScript, JavaScript, and C++.

Why This Matters to You

Think of it as giving an AI model a diet of gourmet food instead of junk food. When AI learns from high-quality data, its output is far superior. This directly impacts your work if you’re a developer, a project manager, or even someone who relies on well-documented software. Better documentation means less time spent deciphering code and more time building new features. For example, imagine a new developer joining your team. With AI-generated, high-quality documentation, they could onboard much faster. How much time could your team save with consistently clear code explanations?

“The performance of automatic code documentation generation models depends critically on the quality of the training data used for supervision,” as mentioned in the release. The researchers carefully filtered their data. They started with 52,069 candidate entries. Only 25.6 percent of these candidates met their strict quality standards. This rigorous process ensures that every piece of data helps the AI learn effectively. The dataset also boasts a high percentage of explicit type annotations, with 86.9% of samples containing them. This detail is vital for understanding code structure and functionality.

The Surprising Finding

Here’s an interesting twist: despite its modest size, Code2Doc significantly boosts AI performance. Most people might assume you need massive datasets for AI to learn effectively. However, the research shows that fine-tuning a large language model on Code2Doc led to impressive gains. It achieved relative improvements of 29.47% in BLEU scores and 24.04% in ROUGE-L scores over zero-shot performance. These metrics measure the quality and relevance of generated text. This finding challenges the assumption that ‘more data’ always means ‘better AI.’ Instead, it highlights that ‘better data’ is often the true driver of betterment. The dataset also showed minimal AI contamination, with only 2.9% flagged as potentially AI generated.

What Happens Next

The release of Code2Doc and its full curation pipeline means developers and researchers can now build better documentation tools. We can expect to see new AI-powered features in integrated creation environments (IDEs) within the next 12-18 months. These tools will automatically generate more accurate and helpful documentation. For example, a future IDE might suggest documentation for a newly written function with remarkable clarity. Your creation workflow could become much smoother. The team revealed they are releasing both the dataset and the pipeline. This supports reproducible research in automatic code documentation generation, paving the way for rapid advancements in the field. This initiative provides actionable takeaways for anyone working with AI in software engineering, emphasizing the importance of data quality over sheer volume.

Ready to start creating?