This spring, Anthropic reported the highest prompt injection figures among frontier labs. When their latest model, Claude Opus 4.8, was tested in a browser, red-team attackers managed to exploit it 31.5% of the time before any safeguards were activated. Unlike Anthropic, OpenAI, Google, and Meta did not provide comparable figures, making Anthropic’s data a standout reference point rather than a liability.
Each of the four frontier labs released a prompt injection disclosure, but none are consistent with one another. Anthropic’s disclosure, dated May 28, spans 244 pages and covers four agentic surfaces. OpenAI, in contrast, reported on one surface, connectors. Google shifted its focus from the model card to a separate safety framework, whereas Meta did not release a closed-model card at all. The accompanying Cross-Vendor Prompt Injection Disclosure Grid outlines the varied testing and measurements from each lab, highlighting the discrepancies that undermine direct comparisons.
Prompt injections involve embedding malicious instructions in content that an agent processes, such as web pages or documents. This can lead to unauthorized actions, making the disclosure cards crucial for buyers as primary evidence of security measures.
The lack of a uniform industry standard for evaluating prompt injections poses a significant challenge. According to Carter Rees, VP of AI at Reputation, prompt injection disrupts the foundational assumptions of legacy tools. He notes, “A phrase as innocuous as, ‘ignore previous instructions’ can carry a payload as devastating as a buffer overflow, yet it shares no commonality with known malware signatures.” As a result, each lab devises its own metrics, leading to inconsistent results.
Adam Meyers, Senior Vice President of Counter Adversary Operations at CrowdStrike, emphasizes that managing exposure now falls to the buyer. “As you implement AI, it increases your attack surface, so now you have to be able to protect those AI models against adversary misuse or data poisoning or prompt injection.” CrowdStrike’s 2026 Financial Services Threat Landscape Report indicates that adversaries are using AI to accelerate the time from initial access to impact, outpacing traditional defenses.
Anthropic’s Detailed Surface Analysis
Anthropic’s Opus 4.8 system card uniquely dissects prompt injection by surface, revealing significant variations in results. In a coding environment, Gray Swan’s Shade tool penetrated 7.03% of single attempts when thinking was enabled, which safeguards reduced to 2.09%.
When similar attacks targeted web browsers, such as those used by Claude in Chrome and Claude Cowork, the model’s vulnerability increased. Anthropic tested 129 web environments and documented the outcomes in Table 5.2.2.4.A on page 81 of the system card. The per-attempt rate, without safeguards and with thinking enabled, decreased from Sonnet 4.6’s 50.7% to Opus 4.8’s 31.5%. With safeguards activated, Opus 4.8’s rate fell to 0.5%, and with thinking disabled, it reached zero across all environments.
OpenAI’s Single-Surface Measurement
OpenAI’s GPT-5.5 card, released on April 23 and updated on April 24, addresses prompt injection through a single robustness score related to known attacks on connectors. This score, where higher values indicate better robustness, dropped from 0.998 for GPT-5.4-thinking to 0.963 for GPT-5.5. Anthropic, contrastingly, tested four surfaces with adaptive attackers and conducted a one-week bug bounty with live red-team challenges.
Comparing OpenAI’s 0.963 robustness score with Anthropic’s 31.5% per-attempt success rate is misleading. The former measures resistance to known attacks on a single surface, while the latter reflects success rates across multiple environments with dynamic attackers.
Google and Meta’s Lack of Specific Metrics
Google’s Gemini 3 addresses prompt injection under mitigations, claiming improved resistance without providing specific metrics. The associated Frontier Safety Framework report includes red teaming across capability domains, excluding prompt injection.
Meta, which releases open weights, places prompt injection defenses in a separate system, Purple Llama’s LlamaFirewall. Using the AgentDojo benchmark, the LlamaFirewall reduced attack success from 17.6% without defenses to 1.75% with combined measures. However, these results evaluate the defenses rather than the model on relevant deployment surfaces.
The Cross-Vendor Prompt Injection Disclosure Grid
The following grid aids security teams in evaluating frontier models. Each row highlights disparities among the four labs, where direct comparisons falter. Data for Anthropic is sourced from the Opus 4.8 system card, while the others rely on each vendor’s safety documentation.
|
Dimension |
Anthropic, Opus 4.8 |
OpenAI, GPT-5.5 |
Google, Gemini 3.x |
Meta, Llama stack |
|
Safety document |
System card, May 28 2026, 244 pages |
System card, April 23 2026, updated April 24 |
Model card plus a separate Frontier Safety Framework report |
No closed-model card. Open weights plus the Purple Llama stack |
|
Injection benchmark or dataset |
ART from Gray Swan and UK AISI, the Shade tool, plus an internal browser eval, 129 environments |
Internal connectors evaluation, known attacks |
None for injection |
AgentDojo, 97 tasks |
|
Surfaces with an injection eval |
Four. Tool use, coding, computer use, browser |
One. Connectors |
None published for injection |
One. AgentDojo agent tasks |
|
Multi-attempt escalation shown |
Yes. ART benchmark at 1, 10, 100. Coding and computer use at 1 and 200 |
No. A single score |
No |
No |
|
Headline metric and unit |
Attack-success rate. Browser, with thinking, 31.5% raw, 0.5% safeguarded |
Robustness score, higher is better. 0.963, down from 0.998 for GPT-5.4-thinking |
None published. Increased resistance claimed qualitatively |
Attack-success rate on AgentDojo. 17.6% baseline to 1.75% combined |
|
Live external bounty |
Yes. One-week live injection bounty with external red-teamers |
No injection bounty. Bio bounty only |
None found |
None found |
|
Regression disclosed |
Yes, explicit, with numbers |
Number fell 0.998 to 0.963, not framed as a regression |
Increased resistance claimed, no numbers |
Not applicable |
Five Considerations for Security Teams
Anthropic provided comprehensive testing across four surfaces, while OpenAI evaluated just one. Google did not disclose per-surface rates, and Meta focused on grading its defenses rather than the model itself. These varied disclosures don’t facilitate straightforward comparisons, but following these five steps can help build a comprehensive evaluation.
Identify and categorize all agents based on their interface—browser, code, connectors, or desktop. Anthropic’s Opus 4.8 shows a 2.09% rate for coding and 0.5% for browser applications. A generalized figure is inadequate. Obtain the vendor’s published rate for each surface. If unavailable, consider it untested.
Share the Cross-Vendor grid with all evaluated vendors. A 0.963 connectors score and a 31.5% browser rate should not be directly compared. Request detailed per-surface attack success rates, both raw and with safeguards, along with the attack methodology. Blank cells indicate lack of first-party data.
Clarify in writing which metrics apply to your integration. Anthropic’s 0.5% figure pertains to Claude in Chrome and Cowork with full safeguards. The API lacks these protections. Do not accept product figures for API deployments.
Include two specific clauses in the RFP. Ensure the vendor tested with adaptive attackers capable of rewriting payloads and that external parties attempted to breach the model. Anthropic used Gray Swan’s Shade tool and conducted a one-week paid bounty. OpenAI relied on known attacks for one surface. Real-world adversaries will not use known payloads.
Conduct an independent injection test prior to deploying any agent. Vendor figures are based on their system prompts and environments. Your configuration has its own prompts, permissions, and data access, requiring a unique evaluation. Set a pass threshold; anything exceeding it should not be deployed.
In conclusion, without an established standard, a vendor’s figures reveal what they chose to measure. Your red team’s evaluation will expose your vulnerabilities.

