LLMs Struggle with Deployable Cloud Infrastructure, New Study Finds

A recent paper reveals that leading AI models fall short in generating functional Infrastructure-as-Code for cloud deployments.

Despite advances, Large Language Models (LLMs) are failing to create truly deployable Infrastructure-as-Code (IaC), according to new research. A study found six state-of-the-art LLMs only achieved 20.8% deployability, highlighting a critical gap between syntactic correctness and practical utility.

By Mark Ellison

January 6, 2026

4 min read

LLMs Struggle with Deployable Cloud Infrastructure, New Study Finds

Key Facts

Six state-of-the-art LLMs achieved only 20.8% deployability for Infrastructure-as-Code (IaC).
Current LLM evaluations for IaC often overlook deployability, focusing instead on syntactic correctness.
The research was accepted by FSE 2026, indicating its significance in software engineering.
The paper is titled "Deployability-Centric Infrastructure-as-Code Generation: Fail, Learn, Refine, and Succeed through LLM-Empowered DevOps Simulation".
The study highlights the need for LLMs to generate functional, not just syntactically correct, cloud infrastructure templates.

Why You Care

Ever wondered if AI could fully automate your cloud infrastructure setup? Imagine telling an AI exactly what you need, and it just works. This vision of effortless cloud provisioning is still a distant dream, according to a new study. The research highlights a significant challenge for Large Language Models (LLMs) in generating truly functional cloud configurations. Why should this matter to you? If you’re relying on AI for your creation operations (DevOps), understanding these limitations is crucial for your project’s success.

What Actually Happened

A paper titled “Deployability-Centric Infrastructure-as-Code Generation” has shed light on a essential issue. The research, accepted by FSE 2026, investigates how well LLMs generate Infrastructure-as-Code (IaC). IaC involves managing and provisioning computer data centers through machine-readable definition files. While LLMs show promise in creating these templates from natural language, their output often lacks a vital quality: deployability. This means the generated code might look correct but won’t actually work in a real cloud environment, as detailed in the blog post.

Tianyi Zhang and four other authors conducted this study. They found that current evaluations often focus on syntactic correctness. However, this overlooks whether the IaC configuration files can actually be deployed. The team revealed that six LLMs performed poorly in this area. Their average deployability score was surprisingly low.

Why This Matters to You

This finding has direct implications for anyone involved in cloud creation or DevOps. If you’re using or considering LLMs to automate your infrastructure, you need to understand this limitation. Think of it as building a car that looks but won’t start. The problem isn’t the blueprint’s appearance, but its functionality.

“Current evaluation focuses on syntactic correctness while ignoring deployability, the essential measure of the utility of IaC configuration files,” the paper states. This means a significant gap exists between what LLMs produce and what developers actually need. What kind of impact could this have on your creation pipeline?

Consider this breakdown of LLM IaC generation issues:

Syntactic Correctness: The code follows grammar rules.
Semantic Correctness: The code makes logical sense.
Deployability: The code runs successfully in a real environment.

For example, an LLM might generate a YAML file for an AWS S3 bucket. However, it might miss a crucial permission setting or a region-specific configuration. This oversight would prevent actual deployment. Your team would then spend valuable time debugging AI-generated code. This negates the efficiency gains you hoped for.

The Surprising Finding

Here’s the twist: despite the capabilities of modern LLMs, their performance on deployability was strikingly low. The study finds that six LLMs achieved only 20.8% deployability. This figure is quite unexpected. Many might assume that if an LLM can generate syntactically correct code, it should also be functional. This research challenges that assumption directly.

Why is this so surprising? We often hear about LLMs writing complex code snippets or even entire functions. We might expect them to handle cloud infrastructure definitions with similar proficiency. However, the nuances of cloud environments, with their intricate dependencies and real-world constraints, prove to be a much harder test. It seems that ‘looking right’ is not the same as ‘working right’ in the complex world of cloud deployments.

What Happens Next

The findings from this research, accepted for FSE 2026, suggest a clear path forward. Developers and researchers need to shift their focus. The emphasis should move from just generating correct syntax to ensuring actual deployability. This will likely involve more feedback loops and simulation environments for LLMs. We can expect to see new tools emerging in the next 12-18 months that address this gap.

For example, imagine a system where an LLM generates IaC. Then, it’s immediately in a simulated cloud environment. If it fails, the system provides specific error feedback to the LLM for refinement. This iterative ‘fail, learn, refine, and succeed’ approach, as mentioned in the release, is crucial. For you, this means staying updated on LLM-powered DevOps simulation tools. These tools will help bridge the current deployability gap. The industry implications are significant, pushing for more and reliable AI assistance in cloud infrastructure management.

Ready to start creating?