Anthropic’s latest Claude 3.5 Sonnet AI model has a new feature in public beta that can control a computer by looking at a screen, moving a cursor, clicking buttons, and typing text. The new feature, called “computer use,” is available today on the API, allowing developers to direct Claude to work on a computer like a human does, as shown on a Mac in the video below.
Microsoft’s Copilot Vision feature and OpenAI’s desktop app for ChatGPT have shown what their AI tools can do based on seeing your computer’s screen, and Google has similar capabilities in its Gemini app on Android phones. But they haven’t gone to the next step of widely releasing tools ready to click around and perform tasks for you like this. Rabbit promised similar capabilities for its R1, which it has yet to deliver.
Anthropic does caution that computer use is still experimental and can be “cumbersome and error-prone.” The company says, “We’re releasing computer use early for feedback from developers, and expect the capability to improve rapidly over time.”
There are many actions that people routinely do with computers (dragging, zooming, and so on) that Claude can’t yet attempt. The “flipbook” nature of Claude’s view of the screen—taking screenshots and piecing them together, rather than observing a more granular video stream—means that it can miss short-lived actions or notifications.
Also, this version of Claude has apparently been told to steer clear of social media, with “measures to monitor when Claude is asked to engage in election-related activity, as well as systems for nudging Claude away from activities like generating and posting content on social media, registering web domains, or interacting with government websites.”
Meanwhile, Anthropic says its new Claude 3.5 Sonnet model has improvements in many benchmarks and is offered to customers at the same price and speed as its predecessor:
The updated Claude 3.5 Sonnet shows wide-ranging improvements on industry benchmarks, with particularly strong gains in agentic coding and tool use tasks. On coding, it improves performance on SWE-bench Verified from 33.4% to 49.0%, scoring higher than all publicly available models—including reasoning models like OpenAI o1-preview and specialized systems designed for agentic coding. It also improves performance on TAU-bench, an agentic tool use task, from 62.6% to 69.2% in the retail domain, and from 36.0% to 46.0% in the more challenging airline domain.