GhazalBench: LLMs Struggle with Persian Poetry's Nuances

New benchmark reveals large language models grasp meaning but falter on exact recall of culturally rich ghazals.

A new benchmark called GhazalBench evaluates how large language models (LLMs) handle Persian ghazals. Researchers found LLMs understand poetic meaning but struggle with precise verse recall, especially compared to their performance on English sonnets. This highlights a critical gap in cultural and formal understanding.

By Katie Rowan

March 12, 2026

4 min read

GhazalBench: LLMs Struggle with Persian Poetry's Nuances

Key Facts

GhazalBench is a new benchmark for evaluating LLMs on Persian ghazals.
LLMs generally capture the poetic meaning of ghazals.
Models struggle with exact verse recall in completion-based tasks.
Performance on English sonnets is significantly higher, indicating training exposure differences.
The limitations are tied to training exposure, not architectural constraints.

Why You Care

Have you ever tried to finish a famous poem, only to find yourself grasping for the exact words? Imagine if AI struggled with this too, especially with culturally significant texts. A new study introduces GhazalBench, a tool to evaluate how well large language models (LLMs) understand and recall Persian ghazals—a vital part of Iranian culture. This research directly impacts how useful these AI models can be for you in real-world, culturally sensitive applications.

What Actually Happened

Researchers Ghazal Kalhor and Yadollah Yaghoobzadeh introduced GhazalBench, a specialized benchmark, as detailed in the paper. This tool assesses how large language models interact with Persian ghazals under ‘usage-grounded conditions.’ These conditions simulate how people actually use poetry, like quoting or completing verses from partial cues. The benchmark focuses on two main abilities for LLMs: producing faithful prose paraphrases of couplets and accessing canonical verses. This means checking if the AI can both understand the meaning and remember the exact wording of famous poems.

According to the announcement, the study evaluated several proprietary and open-weight multilingual LLMs. They aimed to see how these models performed on both meaning interpretation and precise recall of Persian poetry. This evaluation provides crucial insights into the current capabilities and limitations of AI when dealing with rich cultural content.

Why This Matters to You

This research reveals a significant finding for anyone interested in AI and cultural preservation. While LLMs generally capture the poetic meaning of ghazals, they struggle with exact verse recall in completion-based tasks, the study finds. This means your AI assistant might understand the gist of a poem but fail to quote it perfectly. However, recognition-based tasks — where the AI just needs to identify a verse — show much better performance. This suggests that the challenge isn’t always about understanding, but about precise generation.

For example, imagine you’re using an AI to help you write a speech that includes a famous Persian ghazal. If the AI can’t recall the exact verse, it could lead to inaccuracies or a loss of cultural authenticity. This is where GhazalBench comes in, providing a clearer picture of AI’s capabilities.

Key Findings from GhazalBench:

Meaning Capture: LLMs generally understand the poetic meaning of ghazals.
Exact Recall: Models struggle with precise verse recall in completion tasks.
Recognition vs. Recall: Recognition-based tasks significantly reduce the performance gap.
Training Exposure: Limitations are linked to training data, not inherent architectural flaws.

How might this impact your daily interactions with AI, especially when dealing with diverse cultural content?

The Surprising Finding

Here’s the twist: the research shows a ‘consistent dissociation’ in LLM performance. Models generally grasp poetic meaning but struggle with exact verse recall. This is surprising because you might expect an AI to excel at both. What’s more, a parallel evaluation on English sonnets showed markedly higher recall performance, according to the paper. This suggests that these limitations are tied to differences in training exposure rather than inherent architectural constraints. In simpler terms, it’s not that the AI can’t do it; it’s that it hasn’t been trained enough on Persian poetry.

This challenges the common assumption that general-purpose LLMs are equally adept across all languages and cultural contexts. It highlights a specific gap in their ability to handle the nuanced interplay of meaning and form in Persian ghazals. The team revealed that this disparity points directly to the need for more diverse and culturally specific training data for AI models.

What Happens Next

The findings from GhazalBench emphasize the need for new evaluation frameworks. These frameworks must jointly assess meaning, form, and cue-dependent access to culturally significant texts, as mentioned in the release. Expect to see more benchmarks like GhazalBench emerge in the next 6-12 months, focusing on similar challenges in other languages and cultural domains. For example, future applications might include AI tools specifically trained to assist scholars or artists with ancient texts, ensuring cultural accuracy.

For readers, this means you should be aware of the limitations of current LLMs when working with non-English or highly specialized cultural content. Always cross-reference AI-generated poetic content with reliable sources. The industry implications are clear: AI developers must prioritize more diverse and culturally rich training datasets. This will help bridge the gap between an AI’s general understanding and its ability to handle the specific formal demands of various cultural expressions. This research, according to the authors, is a vital step toward making AI truly globally competent.

Ready to start creating?