AI Can Now Grade Critical Thinking: What It Means for Education

New research explores using Large Language Models to assess complex student essays.

A recent study investigates how Large Language Models (LLMs) can automatically assess critical thinking subskills in student essays. Researchers found that advanced LLMs like GPT-5, particularly with few-shot prompting, show promise in evaluating complex reasoning, highlighting trade-offs between cost and accuracy.

By Katie Rowan

October 16, 2025

4 min read

AI Can Now Grade Critical Thinking: What It Means for Education

Key Facts

Researchers investigated the feasibility of using Large Language Models (LLMs) to automatically assess critical thinking subskills.
The study focused on evaluating argumentative essays written by students.
GPT-5 with few-shot prompting achieved the strongest results among the tested models and methods.
Proprietary LLMs offer superior reliability but come at a higher cost.
LLMs performed better on critical thinking subskills with clear, frequent categories than on those requiring subtle distinctions.

Why You Care

Imagine a future where your essays are graded not just for grammar, but for how well you truly think critically. What if AI could accurately assess your deeper reasoning skills? New research suggests this future is closer than you might think. This creation could fundamentally change how educators provide feedback and how students develop essential essential thinking abilities.

What Actually Happened

Researchers have been exploring how Large Language Models (LLMs)—the AI behind tools like ChatGPT—can automatically assess essential thinking subskills. According to the announcement, this work focuses on evaluating complex reasoning in student-written argumentative essays. The team developed a coding rubric based on established skill progressions. They then used human coders to evaluate a large collection of student essays. This provided a baseline for comparison. The study then evaluated three distinct automated scoring approaches. These included zero-shot prompting, few-shot prompting, and supervised fine-tuning. These methods were across three different LLMs: GPT-5, GPT-5-mini, and ModernBERT. The goal was to see if AI could reliably identify and measure the individual components of essential thinking.

Why This Matters to You

This research holds significant implications for education and beyond. It could offer educators a tool for timely and consistent feedback. Think of it as a personalized tutor that can analyze your thought process. This could help you refine your arguments more effectively. The study highlights that GPT-5 with few-shot prompting achieved the strongest results in this assessment. This specific approach involves giving the AI a few examples to learn from. This helps it understand the task better. The team revealed that proprietary models like GPT-5 offer superior reliability. However, they come at a higher cost. Open-source alternatives, while more affordable, might be less sensitive to subtle distinctions. How might automated essential thinking assessment change your learning experience?

Consider these potential benefits:

Faster Feedback Cycles: Students could receive detailed essential thinking feedback much quicker than traditional grading allows.
Consistent Evaluation: AI can apply rubrics uniformly, reducing human bias in grading complex skills.
Targeted Skill creation: Educators can identify specific essential thinking subskills where students need the most support.
Scalability: Automated assessment allows for evaluating essential thinking across larger student populations without increasing educator workload.

As the paper states, “proprietary models offer superior reliability at higher cost, while open-source alternatives provide practical accuracy with reduced sensitivity to minority categories.” This means educators will need to weigh their options carefully. Your institution might choose a more expensive, highly accurate model. Or they might opt for a more accessible, open-source approach.

The Surprising Finding

Here’s an interesting twist: the research shows that LLMs performed exceptionally well on essential thinking subskills with “separable, frequent categories.” This means if a skill can be clearly defined and appears often, AI is great at spotting it. However, the team found lower performance for subskills requiring detection of subtle distinctions or rare categories. This challenges the common assumption that AI can universally grasp all nuances. For example, an AI might easily identify if you used evidence to support a claim (frequent category). But it might struggle to detect a highly , abstract logical fallacy that appears infrequently. This suggests that while AI is , human expertise remains crucial for the most complex, nuanced evaluations of essential thinking.

What Happens Next

This work represents an initial step toward assessment of higher-order reasoning skills. The company reports that further creation is needed. We might see pilot programs implementing these tools in educational institutions within the next 12-18 months. Imagine a future where your university uses an LLM to provide preliminary feedback on your research papers. This would allow professors to focus on deeper, personalized guidance. For you, this means potentially more frequent and detailed feedback on your essential thinking. The industry implications are vast, extending beyond essays to areas like professional creation and training. The documentation indicates that continuous refinement of these models will be key. This includes improving their ability to handle those subtle, less frequent essential thinking elements. The team revealed this will involve more training data and potentially hybrid human-AI approaches.

Ready to start creating?