Unmasking AI-Generated Websites: Are You Browsing Bots?

New research reveals a surge in LLM-dominant web content and a powerful detection method.

A new study highlights the growing prevalence of websites primarily written by large language models (LLMs). These 'LLM-dominant' sites often contain unreliable information. Researchers have developed a highly accurate system to detect them, raising important questions about the future of online content.

Katie Rowan

By Katie Rowan

October 14, 2025

4 min read

Unmasking AI-Generated Websites: Are You Browsing Bots?

Key Facts

  • Web content increasingly generated by Large Language Models (LLMs) is termed "LLM-dominant."
  • LLM-dominant content can be unreliable and unethical due to plagiarism and hallucination.
  • Existing LLM detectors are inaccurate on web content because they are optimized for clean, prose-like data.
  • Researchers developed a new pipeline that classifies entire websites, achieving 100% accuracy in testing.
  • LLM-dominant sites are growing in prevalence and rank highly in search engine results.

Why You Care

Have you ever wondered if the article you’re reading was written by a human or a machine? It’s becoming harder to tell. New research reveals a significant increase in websites primarily generated by large language models (LLMs). This content, dubbed ‘LLM-dominant,’ can be unreliable and unethical. Why should you care? Because it impacts the trustworthiness of the information you consume daily. Your ability to discern fact from AI-generated fiction is at stake.

What Actually Happened

Researchers Sichang Steven He, Ramesh Govindan, and Harsha V. Madhyastha have published a preprint detailing their findings on LLM-dominant web content. According to the announcement, web content is increasingly being created by LLMs with minimal human input. The team refers to this as “LLM-dominant” content. This trend is problematic because LLMs can “plagiarize and hallucinate,” as the paper states, leading to untrustworthy information. The study finds that current LLM detectors struggle with web content due to its complexity and diverse formats. To address this, the researchers developed a new, pipeline. This system classifies entire websites rather than individual pages. It analyzes outputs from multiple prose-like pages to improve accuracy, as detailed in the blog post. This approach significantly boosts detection reliability.

Why This Matters to You

This research has practical implications for your online experience. Imagine you’re researching a essential health issue. If the information comes from an LLM-dominant site, its accuracy could be questionable. The study’s findings suggest that many websites you encounter could be AI-generated. “Since LLMs plagiarize and hallucinate, LLM-dominant content can be unreliable and unethical,” the team revealed. This directly affects the quality of information available to you.

Consider these potential impacts on your digital life:

  • Misinformation Spread: AI-generated content can quickly disseminate false or misleading information.
  • Erosion of Trust: Your trust in online sources may decline if you can’t distinguish human from AI content.
  • SEO Manipulation: LLM-dominant sites are ranking highly in search results, potentially obscuring quality human-created content.

How much of your daily online browsing is already influenced by AI-generated content? It’s a pertinent question for every internet user. The researchers found that LLM-dominant sites are growing in prevalence. They also rank highly in search results, raising questions about their impact on end users and the overall Web environment.

The Surprising Finding

Here’s the twist: despite the challenges, the researchers achieved remarkable accuracy. While LLM detectors are often inaccurate on web content, this new pipeline boasts exceptional performance. The team revealed that they obtained 100% accuracies when testing their detector across two distinct ground truth datasets totaling 120 sites. This is surprising because web content presents unique difficulties. It has low positive rates, complex markup, and diverse genres, unlike the clean, prose-like data most detectors are for. This level of accuracy challenges the common assumption that detecting AI-generated web content is an insurmountable task. It suggests that a targeted, site-level approach is far more effective than page-by-page analysis.

What Happens Next

The implications of this research will unfold over the coming months and years. We can expect to see increased efforts from search engines to identify and potentially downrank LLM-dominant content. For example, imagine Google integrating similar detection system into its ranking algorithms by late 2025. This could significantly reshape search results. Content creators and website owners will need to be more transparent about their use of AI. Your favorite news sites might soon display disclosures about AI assistance. The industry implications are vast, pushing for higher standards of content authenticity. The team’s work, presented at the ACM Internet Measurement Conference 2025, provides a crucial tool. It will help us navigate an increasingly AI-driven web. “We find LLM-dominant sites are growing in prevalence and rank highly in search results,” the documentation indicates, highlighting the urgency of this issue. You, as a user, will benefit from more reliable information online.

Ready to start creating?

Create Voiceover

Transcribe Speech

Create Dialogues

Create Visuals

Clone a Voice