Why You Care
Ever wonder why your AI sometimes gives you brilliant answers, and other times, it just… doesn’t? What if a single word or phrase could drastically change its performance? New research from Mohamed Insaf Ismithdeen, Muhammad Uzair Khattak, and Salman Khan suggests this is precisely the case for Large Multimodal Models (LMMs).
This finding is crucial for anyone using or building AI. It means the way you phrase your questions directly impacts the AI’s accuracy. This insight could reshape how we interact with AI and evaluate its true capabilities. Are you getting the best out of your AI tools?
What Actually Happened
A recent paper, “Promptception: How Sensitive Are Large Multimodal Models to Prompts?” delves into the often-overlooked aspect of prompt design. The study focuses on LMMs used in Multiple-Choice Question Answering (MCQA). According to the announcement, prompt design for these models has been poorly understood.
The researchers developed ‘Promptception,’ a systematic structure for evaluating prompt sensitivity. This structure includes 61 prompt types, categorized across 15 categories and 6 supercategories. The team evaluated 10 LMMs, ranging from open-source options to proprietary models like GPT-4o and Gemini 1.5 Pro. They these models against three MCQA benchmarks: MMStar, MMMU-Pro, and MVBench.
Why This Matters to You
This research has practical implications for your daily AI use. The study found that even minor variations in prompt phrasing can lead to accuracy deviations. These deviations can be as high as 15% for certain prompts and models, as detailed in the blog post. This variability poses a significant challenge for transparent AI evaluation.
Imagine you’re a content creator using an LMM to generate marketing copy. If a slight rephrasing of your prompt leads to a 15% drop in quality, you’re not getting consistent results. This makes it difficult to trust the AI’s output reliably. The company reports that models often showcase their best performance using carefully selected prompts.
Consider this scenario:
| Prompt Variation | Expected Accuracy | Actual Accuracy (Example) |
| ‘Summarize this document.’ | High | 85% |
| ‘Give me a summary of this document.’ | High | 70% |
| ‘Can you please summarize this document for me?’ | High | 75% |
This table illustrates how subtle changes can impact outcomes. How much attention do you pay to your prompt wording? This study suggests you should pay a lot more. What’s more, the paper states that and fair model evaluation is now possible with their new Prompting Principles.
The Surprising Finding
Here’s the twist: the study revealed a surprising difference between proprietary and open-source models. The research shows that proprietary models, such as GPT-4o and Gemini 1.5 Pro, exhibit greater sensitivity to prompt phrasing. This reflects a tighter alignment with instruction semantics, according to the paper.
Conversely, open-source models were found to be steadier in their responses. However, the technical report explains they struggle with nuanced and complex phrasing. This finding challenges the common assumption that all AI models behave similarly. It highlights that the underlying architecture influences how prompts are interpreted.
For example, if you’re using a proprietary model for highly specific tasks, precise prompt engineering is essential. If you’re using an open-source model, you might find it more forgiving of simpler prompts. However, you might need to be more direct for complex instructions.
What Happens Next
The insights from ‘Promptception’ will likely influence future AI creation and evaluation. The study, accepted to EMNLP 2025, suggests that new Prompting Principles will be tailored for both proprietary and open-source LMMs. This could lead to more standardized testing methods within the next 12-18 months.
One concrete example of a future application could be automated prompt optimization tools. These tools could analyze your input and suggest rephrasing for better accuracy. Industry implications include a push for more transparent AI benchmarks. Developers might start sharing prompt libraries that yield optimal results.
For you, this means a future where AI interactions are more predictable. Start experimenting with different prompt phrasings for your current AI tasks. This will help you understand your specific model’s sensitivities. The team revealed that their structure enables more and fair model evaluation, which is a win for everyone.
