New Research Warns Against ChatGPT for Academic Citations

A recent study reveals major flaws in large language models' ability to accurately cite academic papers, particularly impacting factual integrity.

New research introduces 'ArXivBench,' a benchmark evaluating LLMs' academic citation accuracy. It found that models, including ChatGPT, frequently generate incorrect or non-existent arXiv links, undermining their utility for academic writing and proper attribution. Claude-3.5-Sonnet showed a notable advantage.

By Sarah Kline

August 9, 2025

5 min read

New Research Warns Against ChatGPT for Academic Citations

Key Facts

LLMs frequently generate incorrect or non-existent arXiv links.
The 'ArXivBench' benchmark was developed to assess LLM academic citation accuracy.
Claude-3.5-Sonnet showed a 'substantial advantage' over other LLMs in accuracy.
LLMs perform 'significantly better' in AI subfields than other academic subjects.
The study highlights critical academic risks and undermines LLMs' ability to properly attribute research.

Why You Care

If you're a content creator, podcaster, or AI enthusiast who relies on AI tools for research, a new study delivers a crucial warning: don't trust large language models like ChatGPT for academic citations. This isn't just about minor errors; it's about fundamental factual integrity and the very foundations of credible research.

What Actually Happened

A recent paper, "ArXivBench: When You Should Avoid Using ChatGPT for Academic Writing," published on arXiv by Ning Li, Jingran Zhang, and Justin Cui, dives deep into a significant limitation of large language models (LLMs). The researchers set out to evaluate how well both proprietary and open-source LLMs perform when asked to generate relevant research papers with accurate arXiv links. Their findings, as stated in the abstract, reveal "essential academic risks: LLMs frequently generate incorrect arXiv links or references to non-existent papers, fundamentally undermining their ability to properly attribute research contributions to the actual authors."

To conduct this evaluation, the team introduced 'ArXivBench,' a specialized benchmark designed to assess LLM performance across eight major subject categories on arXiv, along with five subfields within computer science, a particularly popular category. The study's core objective was to provide a standardized tool for evaluating LLM reliability in scientific contexts, aiming to promote more dependable academic use in research environments. This isn't just a theoretical exercise; it's a direct assessment of a common use case for AI in content creation and research.

Why This Matters to You

For anyone in the content creation space, from podcasters to AI researchers, the implications are prompt and large. Imagine using an LLM to quickly gather sources for a script or a research paper, only to find that half the citations lead to dead ends or, worse, to papers that don't exist. This directly impacts your credibility. As the study authors point out, the LLMs' tendency to "generate factually incorrect content remains a essential challenge." This isn't merely about convenience; it's about the integrity of your work. If you're building a podcast episode on a new AI advancement and cite a non-existent paper, your audience's trust erodes.

Furthermore, for AI enthusiasts and developers, this research underscores the ongoing need for human oversight and verification, especially when dealing with factual information and attribution. While LLMs excel at generating coherent text, their 'hallucination' problem, particularly concerning specific data like URLs and citations, remains a significant hurdle. The study's findings directly challenge the notion that LLMs can fully automate the research phase of content creation, especially for topics requiring rigorous factual backing. This means that while AI can accelerate the drafting process, the essential task of fact-checking and source verification remains firmly in human hands.

The Surprising Finding

One of the more surprising findings from the ArXivBench study concerns the performance variations across different LLMs and subject categories. While the general trend indicated unreliability, the research found that "Claude-3.5-Sonnet exhibiting a large advantage in generating both relevant and accurate responses." This suggests that not all LLMs are equally prone to these citation errors, and some models are already demonstrating better capabilities in this specific, crucial area.

Even more notably, the study reported that "most LLMs perform significantly better in Artificial Intelligence than other subfields." This counterintuitive result suggests a potential bias or stronger training data in the AI domain itself, leading to more accurate responses when querying about AI-related papers and citations. For content creators focused specifically on AI topics, this might offer a slight glimmer of hope, though the overall message remains one of caution. It implies that while LLMs might be 'smarter' about their own field, their general academic knowledge for citation purposes is still underdeveloped.

What Happens Next

This research provides a crucial benchmark for evaluating LLM reliability in scientific contexts, which will likely spur further creation in addressing these essential accuracy issues. Expect to see AI developers and researchers focusing more intently on improving LLMs' ability to handle factual recall and accurate citation, potentially through more specialized training datasets or retrieval-augmented generation (RAG) techniques specifically tuned for academic databases. The ArXivBench itself, as a "standardized tool," will likely become a reference point for future model evaluations, pushing LLM providers to show betterment in this area.

For content creators and AI users, the prompt takeaway is clear: while LLMs are capable tools for brainstorming and drafting, they are not yet substitutes for diligent human research and fact-checking, especially when it comes to academic citations. In the short term, always double-check any links or references generated by an LLM. In the longer term, as models evolve, we might see more reliable AI assistants for academic work, but for now, the human element of verification remains indispensable for maintaining credibility and factual accuracy in your content. The journey towards truly reliable AI for academic tasks is ongoing, and this study marks a significant step in identifying where the work is most urgently needed.

Ready to start creating?