Unpacking AI's Brain: Is Randomness Key to Transformers?

New research questions the role of 'attention' in large language models, suggesting surprising efficiencies.

A new study challenges long-held beliefs about how large language models (LLMs) learn. Researchers found that even with random attention weights, these powerful AI models can still perform well. This suggests that the 'attention' mechanism might be less crucial than previously thought, opening doors for more efficient AI development.

By Mark Ellison

September 5, 2025

4 min read

Unpacking AI's Brain: Is Randomness Key to Transformers?

Key Facts

The Transformer architecture is central to modern Large Language Models (LLMs).
Researchers questioned how much performance gain can be attributed to the self-attention mechanism.
The study found that attention with frozen key and query weights can still perform competitively.
A new architecture called MixiT, with entirely random attention scores, was designed and showed stable signal propagation.
The research suggests that the Transformer architecture has a built-in inductive bias towards forming specialized circuits.

Why You Care

Ever wonder how the AI powering ChatGPT really works? What if a core part of its intelligence wasn’t as ‘smart’ as we thought? A new study suggests that a fundamental component of large language models (LLMs) might be surprisingly simple. This research could change how we build future AI. It might even make your favorite AI tools faster and more accessible. How much of AI’s performance is truly learned, and how much is baked in?

What Actually Happened

Researchers have been digging into the inner workings of the Transformer architecture, which is central to modern LLMs. This architecture is known for its ability to handle complex tasks. These tasks include mathematical reasoning and memory recall. A key component is the self-attention mechanism. However, the new study, as detailed in the blog post, questions its exact contribution. The team compared standard Transformers to modified versions. In these variants, either the Multi-Layer Perceptron (MLP) layers or the attention weights were frozen. Freezing means they were set at the beginning and not allowed to learn. This experiment aimed to isolate what each part does. The surprising outcome suggests a re-evaluation of attention’s role.

Why This Matters to You

This research has practical implications for anyone using or developing AI. Imagine if you could train AI models with less computational power. This study suggests that might be possible. The findings indicate that even with random attention scores, Transformers can still function effectively. This could lead to more efficient AI. It could also make AI more accessible. For example, think of a small startup building a custom chatbot. If they need less computing power, they can innovate faster. This reduces the cost of entry for new AI applications. The study introduces ‘MixiT’, an architecture with entirely random attention scores. This model showed provably stable signal propagation. It overcame prior depth-wise scaling challenges in random transformers, as the technical report explains. This means it can handle deeper, more complex networks. How might this change the AI tools you use daily?

Key Findings on Transformer Components:

Component	Primary Role (Traditional View)	New Finding (Study’s Suggestion)
Attention	Core learning mechanism	“largely responsible for in-context reasoning” even when random
MLP Layers	Knowledge storage	“responsible for, but collaborates with attention, on knowledge storage”

This research suggests that the Transformer architecture has a built-in bias. It tends to form specialized circuits. This happens even without learnable attention weights, according to the announcement. This inherent design might be more than previously understood. It means your AI tools might be smarter by design, not just by training.

The Surprising Finding

Here’s the twist: The study found that attention with frozen key and query weights can still form ‘induction heads.’ These are crucial for sequence modeling. It can also perform competitively on language modeling, the research shows. This is truly surprising. It challenges the common assumption that learnable attention is absolutely essential. We typically think of attention as the ‘brain’ of the Transformer, actively learning relationships. However, this paper states that even random attention can achieve strong results. It suggests that the Transformer architecture itself has a built-in inductive bias. This bias allows it to create specialized circuits. It does this even when attention weights are not learned. This means the model’s inherent structure contributes significantly to its capabilities, not just its training data.

What Happens Next

These findings could pave the way for a new generation of more efficient LLMs. We might see models that require less training data or computational power. This could happen within the next 12-18 months. Developers might start experimenting with fixed or semi-random attention mechanisms. For example, imagine a future where you can run a LLM on your smartphone. This is currently challenging due to model size. This research could make such local processing more feasible. The industry implications are significant. It could democratize AI creation further. Smaller companies could compete with larger players. The team revealed that their results suggest the Transformer architecture has a built-in inductive bias towards forming specialized circuits. This happens even without learnable attention weights. This knowledge could guide future AI design. It could lead to simpler, yet equally , AI models. This could truly change the landscape of AI accessibility and creation for everyone.

Ready to start creating?