Anthropic’s browser agent got hijacked 31.5% of the time before safeguards engaged

This spring, Anthropic reported the highest prompt injection figures among frontier labs. When their latest model, Claude Opus 4.8, was tested in a browser, red-team attackers managed to exploit it 31.5% of the time before any safeguards were activated. Unlike Anthropic, OpenAI, Google, and Meta did not provide comparable figures, making Anthropic’s data a standout reference point rather than a liability.

Each of the four frontier labs released a prompt injection disclosure, but none are consistent with one another. Anthropic’s disclosure, dated May 28, spans 244 pages and covers four agentic surfaces. OpenAI, in contrast, reported on one surface, connectors. Google shifted its focus from the model card to a separate safety framework, whereas Meta did not release a closed-model card at all. The accompanying Cross-Vendor Prompt Injection Disclosure Grid outlines the varied testing and measurements from each lab, highlighting the discrepancies that undermine direct comparisons.

Prompt injections involve embedding malicious instructions in content that an agent processes, such as web pages or documents. This can lead to unauthorized actions, making the disclosure cards crucial for buyers as primary evidence of security measures.

The lack of a uniform industry standard for evaluating prompt injections poses a significant challenge. According to Carter Rees, VP of AI at Reputation, prompt injection disrupts the foundational assumptions of legacy tools. He notes, “A phrase as innocuous as, ‘ignore previous instructions’ can carry a payload as devastating as a buffer overflow, yet it shares no commonality with known malware signatures.” As a result, each lab devises its own metrics, leading to inconsistent results.

Adam Meyers, Senior Vice President of Counter Adversary Operations at CrowdStrike, emphasizes that managing exposure now falls to the buyer. “As you implement AI, it increases your attack surface, so now you have to be able to protect those AI models against adversary misuse or data poisoning or prompt injection.” CrowdStrike’s 2026 Financial Services Threat Landscape Report indicates that adversaries are using AI to accelerate the time from initial access to impact, outpacing traditional defenses.

Anthropic’s Detailed Surface Analysis

Anthropic’s Opus 4.8 system card uniquely dissects prompt injection by surface, revealing significant variations in results. In a coding environment, Gray Swan’s Shade tool penetrated 7.03% of single attempts when thinking was enabled, which safeguards reduced to 2.09%.

When similar attacks targeted web browsers, such as those used by Claude in Chrome and Claude Cowork, the model’s vulnerability increased. Anthropic tested 129 web environments and documented the outcomes in Table 5.2.2.4.A on page 81 of the system card. The per-attempt rate, without safeguards and with thinking enabled, decreased from Sonnet 4.6’s 50.7% to Opus 4.8’s 31.5%. With safeguards activated, Opus 4.8’s rate fell to 0.5%, and with thinking disabled, it reached zero across all environments.

OpenAI’s Single-Surface Measurement

OpenAI’s GPT-5.5 card, released on April 23 and updated on April 24, addresses prompt injection through a single robustness score related to known attacks on connectors. This score, where higher values indicate better robustness, dropped from 0.998 for GPT-5.4-thinking to 0.963 for GPT-5.5. Anthropic, contrastingly, tested four surfaces with adaptive attackers and conducted a one-week bug bounty with live red-team challenges.

Comparing OpenAI’s 0.963 robustness score with Anthropic’s 31.5% per-attempt success rate is misleading. The former measures resistance to known attacks on a single surface, while the latter reflects success rates across multiple environments with dynamic attackers.

Google and Meta’s Lack of Specific Metrics

Google’s Gemini 3 addresses prompt injection under mitigations, claiming improved resistance without providing specific metrics. The associated Frontier Safety Framework report includes red teaming across capability domains, excluding prompt injection.

Meta, which releases open weights, places prompt injection defenses in a separate system, Purple Llama’s LlamaFirewall. Using the AgentDojo benchmark, the LlamaFirewall reduced attack success from 17.6% without defenses to 1.75% with combined measures. However, these results evaluate the defenses rather than the model on relevant deployment surfaces.

The Cross-Vendor Prompt Injection Disclosure Grid

The following grid aids security teams in evaluating frontier models. Each row highlights disparities among the four labs, where direct comparisons falter. Data for Anthropic is sourced from the Opus 4.8 system card, while the others rely on each vendor’s safety documentation.

Dimension	Anthropic, Opus 4.8	OpenAI, GPT-5.5	Google, Gemini 3.x	Meta, Llama stack
Safety document	System card, May 28 2026, 244 pages	System card, April 23 2026, updated April 24	Model card plus a separate Frontier Safety Framework report	No closed-model card. Open weights plus the Purple Llama stack
Injection benchmark or dataset	ART from Gray Swan and UK AISI, the Shade tool, plus an internal browser eval, 129 environments	Internal connectors evaluation, known attacks	None for injection	AgentDojo, 97 tasks
Surfaces with an injection eval	Four. Tool use, coding, computer use, browser	One. Connectors	None published for injection	One. AgentDojo agent tasks
Multi-attempt escalation shown	Yes. ART benchmark at 1, 10, 100. Coding and computer use at 1 and 200	No. A single score	No	No
Headline metric and unit	Attack-success rate. Browser, with thinking, 31.5% raw, 0.5% safeguarded	Robustness score, higher is better. 0.963, down from 0.998 for GPT-5.4-thinking	None published. Increased resistance claimed qualitatively	Attack-success rate on AgentDojo. 17.6% baseline to 1.75% combined
Live external bounty	Yes. One-week live injection bounty with external red-teamers	No injection bounty. Bio bounty only	None found	None found
Regression disclosed	Yes, explicit, with numbers	Number fell 0.998 to 0.963, not framed as a regression	Increased resistance claimed, no numbers	Not applicable

Five Considerations for Security Teams

Anthropic provided comprehensive testing across four surfaces, while OpenAI evaluated just one. Google did not disclose per-surface rates, and Meta focused on grading its defenses rather than the model itself. These varied disclosures don’t facilitate straightforward comparisons, but following these five steps can help build a comprehensive evaluation.

Identify and categorize all agents based on their interface—browser, code, connectors, or desktop. Anthropic’s Opus 4.8 shows a 2.09% rate for coding and 0.5% for browser applications. A generalized figure is inadequate. Obtain the vendor’s published rate for each surface. If unavailable, consider it untested.

Share the Cross-Vendor grid with all evaluated vendors. A 0.963 connectors score and a 31.5% browser rate should not be directly compared. Request detailed per-surface attack success rates, both raw and with safeguards, along with the attack methodology. Blank cells indicate lack of first-party data.

Clarify in writing which metrics apply to your integration. Anthropic’s 0.5% figure pertains to Claude in Chrome and Cowork with full safeguards. The API lacks these protections. Do not accept product figures for API deployments.

Include two specific clauses in the RFP. Ensure the vendor tested with adaptive attackers capable of rewriting payloads and that external parties attempted to breach the model. Anthropic used Gray Swan’s Shade tool and conducted a one-week paid bounty. OpenAI relied on known attacks for one surface. Real-world adversaries will not use known payloads.

Conduct an independent injection test prior to deploying any agent. Vendor figures are based on their system prompts and environments. Your configuration has its own prompts, permissions, and data access, requiring a unique evaluation. Set a pass threshold; anything exceeding it should not be deployed.

In conclusion, without an established standard, a vendor’s figures reveal what they chose to measure. Your red team’s evaluation will expose your vulnerabilities.

Anthropic’s browser agent got hijacked 31.5% of the time before safeguards engaged

Anthropic’s Detailed Surface Analysis

OpenAI’s Single-Surface Measurement

Google and Meta’s Lack of Specific Metrics

The Cross-Vendor Prompt Injection Disclosure Grid

Five Considerations for Security Teams

Popular Posts

Blake Lively Subpoenas Digital Influencers In Justin Baldoni Lawsuit

Gunman in Kentucky church shooting declared, ‘Someone’s gonna have to die,’ after learning intended target wasn’t there

An Expert Explains How to Protect Your Family : ScienceAlert

Everything to Know About Natalie Portman’s Boyfriend Tanguy Destable

Maccapani Fall 2026 Ready-to-Wear Collection

About US

Top Categories

Usefull Links