New Benchmark Challenges AI in Cyber Threat Investigation

ExCyTIn-Bench evaluates how well large language model agents can detect and report cyberattacks.

Researchers have introduced ExCyTIn-Bench, a new benchmark designed to test AI agents in cyber threat investigation. This benchmark uses simulated real-world attacks to assess how well LLM agents can analyze security logs and identify threats. Initial results show significant room for improvement in current AI capabilities.

By Sarah Kline

September 15, 2025

4 min read

New Benchmark Challenges AI in Cyber Threat Investigation

Key Facts

ExCyTIn-Bench is the first benchmark to evaluate LLM agents on cyber threat investigation.
The benchmark uses a dataset from a controlled Azure tenant with 8 simulated multi-step attacks.
It includes 57 log tables from Microsoft Sentinel and 589 automatically generated questions.
The average reward across all evaluated models was 0.249, with the best model achieving 0.368.
The benchmark provides automatic, explainable ground truth answers through investigation graphs.

Why You Care

Imagine your company’s data is under attack. Can an AI quickly find the culprit and explain what happened? A new benchmark, ExCyTIn-Bench, aims to answer that question. It evaluates how well large language model (LLM) agents perform cyber threat investigation. This directly impacts your digital security and the future of automated defense.

What Actually Happened

Researchers have unveiled ExCyTIn-Bench, a novel benchmark for evaluating LLM agents in cyber threat investigation, according to the announcement. This benchmark is the first of its kind. It focuses on the complex task of sifting through diverse security signals. Real-world security analysts face this challenge daily. They must follow multi-hop chains of evidence. Then they compile a detailed incident report.

ExCyTIn-Bench constructs a specialized dataset. This dataset comes from a controlled Azure tenant. It includes 8 simulated real-world multi-step attacks. What’s more, it incorporates 57 log tables from Microsoft Sentinel and related services. The system also generates 589 questions automatically. These questions help assess the LLM agents’ performance. The team leverages expert-crafted detection logic. This logic builds threat investigation graphs. Questions are then generated using paired nodes on these graphs. This process ensures automatic and explainable ground truth answers, as detailed in the blog post.

Why This Matters to You

This benchmark is crucial for anyone interested in cybersecurity and AI. It highlights the current limitations of AI in complex security tasks. Think of it as a report card for AI security agents. Your digital assets could eventually be protected by these systems. Therefore, their accuracy is incredibly important.

For example, imagine a phishing attack. An LLM agent would need to trace emails, login attempts, and network traffic. It must connect these disparate pieces of evidence. Then it compiles a coherent report. This is exactly what ExCyTIn-Bench tests. “We present ExCyTIn-Bench, the first benchmark to Evaluate an LLM agent x on the task of Cyber Threat Investigation through security questions derived from investigation graphs,” the paper states. This means we now have a standard way to measure their effectiveness. How confident are you in AI protecting your sensitive information?

Here’s a breakdown of the benchmark’s components:

Simulated Attacks: 8 multi-step, real-world attack scenarios.
Log Data: 57 log tables from Microsoft Sentinel and other services.
Evaluation Questions: 589 automatically generated questions for assessment.
Investigation Graphs: Built from security logs with expert detection logic.

This approach provides verifiable rewards for procedural tasks. It can also extend to training agents via reinforcement learning. This could lead to more AI security systems for your future protection.

The Surprising Finding

What might surprise you is the current performance of these AI models. Despite the rapid advancements in LLMs, the task remains quite difficult. The research shows that with the base setting, the average reward across all evaluated models is 0.249. The best achieved score was only 0.368, according to the study findings. This leaves substantial headroom for future research.

This is surprising because LLMs excel at many complex language tasks. However, cyber threat investigation requires deep contextual understanding. It also demands logical reasoning across vast, heterogeneous data. It challenges the common assumption that current LLMs can easily handle such intricate, multi-step analytical processes. The team revealed that even top models struggle significantly with these real-world scenarios.

What Happens Next

The creation of ExCyTIn-Bench marks a significant step. It provides a clear path for improving AI in cybersecurity. We can expect to see more research focused on these specific challenges. Code and data for the benchmark are coming soon. This will allow other researchers to contribute and build upon these findings. Expect new models and techniques to emerge within the next 12-18 months.

For example, future LLM agents might integrate more specialized security knowledge. They could also use reasoning modules. This could help them better connect the dots in complex attack chains. The industry implications are clear. Better benchmarks lead to better AI. This will eventually lead to stronger digital defenses for everyone. Your company’s security posture could greatly benefit from these advancements. The benchmark makes the pipeline reusable and readily extensible to new logs, as mentioned in the release. This ensures continuous betterment and adaptation.

Ready to start creating?