LLMs Struggle with Thermodynamics: Not Yet Tutor-Ready

New research benchmarks large language models against undergraduate physics questions, revealing significant limitations.

A recent study tested leading 2025-era large language models (LLMs) on undergraduate thermodynamics questions. The findings indicate that current LLMs are not yet suitable for unsupervised tutoring, particularly in complex reasoning and visual interpretation tasks.

By Sarah Kline

September 3, 2025

4 min read

LLMs Struggle with Thermodynamics: Not Yet Tutor-Ready

Key Facts

Leading 2025-era LLMs were tested on undergraduate thermodynamics questions.
No LLM exceeded the 95% competence threshold, with the best achieving 82% accuracy.
LLMs performed better on text-only questions than on image reasoning tasks.
The study used a 50-item benchmark called UTQA.
Current LLMs are not suitable for unsupervised tutoring in thermodynamics.

Why You Care

Imagine you’re struggling with a tough physics problem. Could an AI tutor truly help you understand it? A new study suggests that while large language models (LLMs) show promise, they aren’t ready for unsupervised teaching roles. This research directly impacts anyone considering AI as a primary learning tool. It highlights crucial gaps in AI’s ability to reason consistently. Are we overestimating AI’s current educational capabilities?

What Actually Happened

A new paper, “From Canonical to Complex: Benchmarking LLM Capabilities in Undergraduate Thermodynamics,” has been published. The research, led by Anna Geißler and her team, evaluated leading 2025-era LLMs. Their goal was to assess AI’s readiness as a tutoring aid in science education. They created UTQA, a 50-item undergraduate thermodynamics question answering benchmark. This benchmark covered essential topics like ideal-gas processes and diagram interpretation. The study specifically used thermodynamics as an ideal testbed. This is because it requires consistent, principle-grounded reasoning, as detailed in the blog post.

No leading 2025-era model met the researchers’ 95% competence threshold. The best LLMs achieved only 82% accuracy on the benchmark. Text-only items performed better than those requiring image reasoning. Image-based tasks often fell to chance levels, the research shows. Prompt phrasing and syntactic complexity had little correlation with performance, the paper states. This indicates a deeper issue than just how questions are asked.

Why This Matters to You

This study has significant implications for how you might use AI in education. If you’re a student, relying solely on an LLM for complex subjects could be problematic. For educators, it means AI tools still require human oversight. The research points to specific areas where LLMs fall short. Understanding these limitations is key to effective AI integration.

Consider this: An LLM might perfectly recall definitions. However, it struggles with applying those definitions to new, complex scenarios. Think of it as knowing all the rules of chess but failing to strategize effectively. What specific challenges do you face where an AI might not be enough?

As the authors state, “reliable teaching requires more than fluent recall: it demands consistent, principle-grounded reasoning.” This highlights the difference between memorization and true understanding. The gap is particularly evident in irreversible scenarios. It also appears when binding visual features to thermodynamic meaning, according to the announcement.

Here’s a breakdown of LLM performance:

Task Type	LLM Performance
Text-only items	Better than image reasoning
Image reasoning	Often fell to chance levels
Finite-rate/Irreversible Scenarios	Significant gap in understanding
Visual-Thermodynamic Binding	Significant gap in understanding

The Surprising Finding

Perhaps the most surprising finding is the consistent struggle with image reasoning tasks. These tasks often performed at chance levels, according to the research. You might expect modern LLMs to handle visual information better. After all, many can generate images or describe them. However, the study reveals a essential weakness. LLMs struggle to connect visual data with complex thermodynamic principles. This is not just about seeing an image. It’s about interpreting a diagram to understand an underlying physical process. The team revealed this gap concentrates in finite-rate/irreversible scenarios. This challenges the assumption that LLMs can simply ‘learn’ from vast datasets. It suggests a fundamental limitation in their current reasoning architecture. It’s one thing to describe a picture. It’s another to derive meaning from a complex scientific diagram.

What Happens Next

Moving forward, developers will need to address these specific weaknesses. We can expect to see more specialized LLMs designed for scientific reasoning. These might incorporate modules for visual interpretation. Future iterations could focus on improved understanding of complex physical processes. This could happen within the next 12-18 months. For example, new models might be trained on vast datasets of scientific diagrams. This training would include detailed explanations of their physical meaning. For you, this means future AI tutors could become more reliable. However, for now, human instructors remain crucial. The industry implications are clear: AI in education is , but not yet autonomous. The documentation indicates that current LLMs are not yet suitable for unsupervised tutoring in this domain. This suggests a need for continued research and creation.

Ready to start creating?