LLMs Reveal English Has Hidden Long-Range Patterns

New research shows large language models uncover deep structural connections in language, impacting AI and linguistics.

A recent paper by Colin Scheibner and colleagues demonstrates that large language models (LLMs) can detect long-range structures in English texts. This finding suggests that the 'entropy' or predictability of language continues to decrease over much longer contexts than previously thought, offering new insights into how language works and how AI processes it.

By Katie Rowan

January 3, 2026

4 min read

LLMs Reveal English Has Hidden Long-Range Patterns

Key Facts

Large language models (LLMs) were used to analyze English texts.
The study focused on uncovering long-ranged structure in English.
Conditional entropy continued to decrease with context length up to N~10^4N.
The research challenges previous assumptions about language predictability.
The paper was authored by Colin Scheibner, Lindsay M. Smith, and William Bialek.

Why You Care

Ever wonder how your favorite AI chatbot seems to understand your complex thoughts, even across many sentences? How does it predict the next word so accurately? This isn’t just about simple grammar. A new study, as detailed in the blog post, suggests something much deeper is at play. It reveals that large language models (LLMs) are uncovering hidden patterns in English that span vast distances within text. This matters because it changes our understanding of language itself and how AI processes information. What if the true complexity of human language is far greater than we imagined, and LLMs are just beginning to show us its secrets?

What Actually Happened

Researchers Colin Scheibner, Lindsay M. Smith, and William Bialek have published a paper exploring the “entropy of English” using large language models, according to the announcement. Entropy, in this context, refers to the unpredictability or randomness of language. The team used LLMs to analyze various English texts. Their goal was to find long-ranged structure within these texts. This means looking for connections between words or phrases that are far apart. The study finds that the conditional entropy—essentially, how much new information each word provides given its context—continues to decrease. This reduction in entropy occurs even with context lengths up to N~10^4N, meaning patterns persist over thousands of words.

Why This Matters to You

This research has direct implications for how you interact with AI and how AI understands your communication. It suggests that LLMs aren’t just good at predicting the next word locally. They are actually grasping much broader linguistic relationships. Imagine you’re writing a long email or a complex story. Your choice of words in one paragraph might subtly influence the best word choice several paragraphs later. This is the kind of long-range dependency the study highlights. The team revealed, “The conditional entropy or code length in many cases continues to decrease with context length at least to N~10^4N.” This means AI models are seeing connections you might not even consciously realize are there.

Here are some impacts:

Improved AI Comprehension: LLMs can better understand nuanced meaning in lengthy documents.
More Coherent AI Generation: AI-generated text will likely become more consistent and logical over long passages.
Deeper Linguistic Insights: Researchers gain new tools to study the fundamental structure of human language.

Think of it as understanding a symphony. It’s not just about the notes in a single bar. It’s about how themes develop and intertwine across the entire piece. How might this deeper understanding of language affect your future interactions with AI assistants or content creation tools?

The Surprising Finding

Here’s the twist: common assumptions about language predictability might be incomplete. Many believe that the predictability of language mostly depends on local context. That means words just before and after. However, the research shows that LLMs uncover patterns that extend far beyond this. The conditional entropy continues to decrease even after considering thousands of preceding words, as detailed in the blog post. This suggests that English possesses a much more intricate, long-range structure than previously acknowledged. It challenges the idea that language becomes entirely random after a certain short context length. This finding is surprising because it implies that the ‘memory’ of language, or its statistical dependencies, is incredibly long. It’s not just about the last few words; it’s about the entire narrative or argument you’re building.

What Happens Next

This discovery will likely influence the creation of future LLMs over the next 12-24 months. Expect to see models designed to explicitly use these long-range dependencies. For example, imagine an AI writing assistant that maintains thematic consistency throughout a 50-page novel. This research provides a foundation for such advancements. For you, this means potentially more AI tools for writing, research, and even creative tasks. The industry implications are significant, pushing AI closer to human-like understanding of complex texts. Actionable advice for readers includes staying informed about advancements in LLM context window capabilities. What’s more, consider experimenting with new AI tools as they emerge, to see how they handle longer, more intricate tasks. This field is evolving rapidly, and your understanding of its nuances will only grow.

Ready to start creating?