Gemini 2.5 Computer Use Model: AI Agents Navigate UIs

Google DeepMind's new model empowers AI to interact with web and mobile interfaces like humans.

Google DeepMind has released the Gemini 2.5 Computer Use model, allowing AI agents to navigate and interact with user interfaces. This model, available through the Gemini API, can perform tasks like filling forms and clicking buttons. It shows strong performance in web and mobile control benchmarks.

Katie Rowan

By Katie Rowan

October 8, 2025

5 min read

Gemini 2.5 Computer Use Model: AI Agents Navigate UIs

Why You Care

Ever wish your digital assistant could do more than just answer questions? What if it could actually use your apps for you? Google DeepMind just made a big leap forward in that direction. They’ve launched the Gemini 2.5 Computer Use model. This new AI can interact with user interfaces just like you do. It promises to redefine how AI agents assist with everyday digital tasks. Your digital life could soon become much smoother and more automated.

What Actually Happened

Google DeepMind recently introduced the Gemini 2.5 Computer Use model. This specialized AI is built on the capabilities of Gemini 2.5 Pro. It is designed to power agents that can interact directly with user interfaces. According to the announcement, this model is now available in preview via the Gemini API. Developers can access it on Google AI Studio and Vertex AI. The team revealed that this model outperforms others in web and mobile control benchmarks. It also boasts lower latency, making interactions quicker and more efficient. The computer_use tool in the Gemini API exposes its core functions.

This model works in a loop. It takes your request, a screenshot of the digital environment, and a history of recent actions. The model then analyzes these inputs. It generates a response, typically a function call representing a UI action. This could be clicking a button or typing text, as detailed in the blog post. After the action executes, a new screenshot and URL are sent back. This iterative process continues until the task is complete. This allows for complex, multi-step interactions with web and mobile applications.

Why This Matters to You

This new model has significant implications for how you interact with system. Imagine your AI assistant not just telling you the weather, but actually booking your flight. Or perhaps it could fill out a complicated online form for you. The ability for AI to natively interact with graphical user interfaces is a crucial next step. It moves us towards truly , general-purpose agents. This means less time spent on repetitive digital tasks for you.

For example, think about online shopping. Instead of manually navigating multiple pages to find a specific item, your AI could do it. It could apply filters, compare prices, and even complete the purchase with your confirmation. This capability extends beyond simple tasks. It can handle complex interactions behind logins, like managing your bank accounts or health records (with proper security, of course). This could free up valuable time in your day.

Key Capabilities of Gemini 2.5 Computer Use Model:

  • Native UI Interaction: Directly clicks, types, and scrolls on web and mobile interfaces.
  • Form Filling: Automatically completes and submits online forms.
  • Interactive Element Manipulation: Operates dropdowns, filters, and other dynamic elements.
  • Behind Login Operations: Can navigate and perform tasks within secure, logged-in environments.
  • Lower Latency: Faster response times compared to previous models.

How much time could you save if your AI could handle your most tedious online tasks?

The Surprising Finding

Here’s a twist: while many AI models interface through structured APIs, this model excels at direct UI interaction. The study finds that the Gemini 2.5 Computer Use model performs strongly on multiple web and mobile control benchmarks. This is surprising because many assumed direct UI interaction was harder for AI. Interacting with graphical interfaces, like humans do, presents unique challenges. These challenges include visual recognition and understanding context. The model’s success in these areas challenges common assumptions about AI limitations. It suggests a more intuitive future for AI assistance.

The model leads in key benchmarks:

  • Online-Mind2Web: The Gemini 2.5 Computer Use model achieved the highest score.
  • WebVoyager: This model also demonstrated superior performance.
  • AndroidWorld: It led the pack in mobile UI control tasks.

This performance is particularly notable. It shows the model’s ability to navigate unstructured and dynamic digital environments effectively. The team revealed that it is primarily for web browsers. However, it also shows strong promise for mobile UI control tasks. This indicates broad applicability across different platforms.

What Happens Next

This system is currently available in preview. We can expect wider integration and refinement in the coming months. Developers are already building with it now. They are using Google AI Studio and Vertex AI. You might see the first applications emerge in early to mid-2025. These applications could range from personal assistants to automated customer service bots. Imagine an AI that can troubleshoot your software issues by actually using the program. The company reports that feedback from developers will be crucial. This will help refine the model for broader release.

Actionable advice for you: keep an eye on your favorite apps. Future updates might include new AI-powered features. These features could automate tasks you currently do manually. Industry implications are vast. This could reshape how businesses handle online operations. It could also change how individuals manage their digital lives. This model is not yet for desktop OS-level control. However, its web and mobile capabilities open many doors. This marks a significant step towards more autonomous and capable AI agents.

Ready to start creating?

Create Voiceover

Transcribe Speech

Create Dialogues

Create Visuals

Clone a Voice