New AI Vulnerability: 'Caution' Inside Models Can Be Manipulated

Researchers uncover a novel method to 'jailbreak' AI by targeting its internal reasoning processes.

A new study reveals that AI models can be tricked into generating harmful content by manipulating their internal 'chain-of-thought' processes. This finding challenges current security measures, which primarily focus on input and output boundaries.

By Sarah Kline

August 30, 2025

4 min read

New AI Vulnerability: 'Caution' Inside Models Can Be Manipulated

Key Facts

AI models make refusal decisions within their 'chain-of-thought' (CoT) generation, not just at the prompt-response boundary.
Researchers identified a 'caution' direction in the AI's activation space that predicts refusal or compliance.
Ablating this 'caution' direction can 'jailbreak' AI models, leading to harmful compliance.
Intervening only on CoT token activations is sufficient to control final AI outputs.
Incorporating this 'caution' direction into prompt-based attacks improves their success rates.

Why You Care

Ever wonder if the AI you’re talking to is truly thinking, or just mimicking? What if its internal ‘thoughts’ could be hijacked? A new study reveals a concerning vulnerability in AI models. This research shows how attackers can manipulate an AI’s internal reasoning. This could lead to models generating harmful or biased content, even when designed to refuse such requests. For you, this means understanding the evolving landscape of AI safety. It also highlights new risks for anyone building or using these tools.

What Actually Happened

Researchers Kureha Yamaguchi, Benjamin Etheridge, and Andy Arditi recently published a paper detailing a new type of adversarial attack. As detailed in the blog post, they found that reasoning models, like DeepSeek-R1-Distill-Llama-8B, are vulnerable during their ‘chain-of-thought’ (CoT) generation. CoT tokens are internal steps an AI takes to arrive at a final answer. The team revealed that these models make refusal decisions not just at the final output. Instead, they make them within their CoT generation process. They identified a specific ‘caution’ direction in the AI’s activation space. This direction predicts whether the model will refuse or comply with a request. This finding suggests a new target for adversarial manipulation in reasoning models.

Why This Matters to You

This discovery has significant implications for AI safety and security. Current AI security often focuses on the prompt-response boundary. However, the study finds that internal processes are also essential. Imagine you’re using an AI for content creation. If its internal caution mechanism is disabled, it might generate unsafe content. This could include misinformation or hate speech. The research shows that intervening only on CoT token activations is enough to control final outputs. What’s more, incorporating this direction into prompt-based attacks improves success rates. This means attackers have a new, more effective way to bypass safety filters. How might this affect your trust in AI-generated information?

Here’s a breakdown of the key findings:

Finding Category	Description
Vulnerability Location	Refusal decisions occur within CoT generation, not just at the output.
‘Caution’ Direction	A linear direction in activation space predicts refusal or compliance.
Manipulation Method	Ablating this ‘caution’ direction increases harmful compliance.
Attack betterment	Incorporating this direction improves prompt-based attack success.

As the paper states, “While traditional language models make refusal decisions at the prompt-response boundary, we find evidence that DeepSeek-R1-Distill-Llama-8B makes these decisions within its CoT generation.” This highlights a shift in understanding AI vulnerabilities. Your AI’s internal thought process is now a direct target.

The Surprising Finding

Here’s the twist: it’s not just about what you ask the AI, but how it thinks about it. The most surprising finding is that AI models make refusal decisions internally during their reasoning process. This challenges the common assumption that safety mechanisms only kick in at the input or output stage. The study finds that simply ablating (removing or weakening) this ‘caution’ direction from model activations increases harmful compliance. This effectively ‘jailbreaks’ the model. It’s like finding a hidden switch inside a robot that turns off its ethical programming. The team revealed that they could make the AI comply with harmful requests by targeting these internal ‘thoughts’. This is surprising because it moves the attack surface deeper into the AI’s architecture. It suggests a more level of vulnerability than previously understood.

What Happens Next

This research, accepted to the ICML 2025 Workshop on Reliable and Responsible Foundation Models (R2FM), points to a essential area for future AI security. Over the next 6-12 months, we can expect AI developers to focus on fortifying these internal reasoning pathways. For example, AI companies might develop new training methods. These methods would make the ‘caution’ direction more and harder to manipulate. You might see new AI models with enhanced internal safeguards. This will be crucial for maintaining trust in AI systems. The industry implications are clear: a new frontier in AI security has emerged. Developers must now consider the internal ‘mind’ of the AI, not just its external behavior. The paper states, “Our findings suggest that the chain-of-thought itself is a promising new target for adversarial manipulation in reasoning models.” This means AI safety research will likely shift its focus. It will move towards understanding and protecting these complex internal processes.

Ready to start creating?