Wednesday, 10 Dec 2025
  • Contact
  • Privacy Policy
  • Terms & Conditions
  • DMCA
logo logo
  • World
  • Politics
  • Crime
  • Economy
  • Tech & Science
  • Sports
  • Entertainment
  • More
    • Education
    • Celebrities
    • Culture and Arts
    • Environment
    • Health and Wellness
    • Lifestyle
  • đŸ”„
  • Trump
  • VIDEO
  • House
  • ScienceAlert
  • White
  • man
  • Trumps
  • Watch
  • Season
  • Health
Font ResizerAa
American FocusAmerican Focus
Search
  • World
  • Politics
  • Crime
  • Economy
  • Tech & Science
  • Sports
  • Entertainment
  • More
    • Education
    • Celebrities
    • Culture and Arts
    • Environment
    • Health and Wellness
    • Lifestyle
Follow US
© 2024 americanfocus.online – All Rights Reserved.
American Focus > Blog > Tech and Science > Anthropic claims new AI security method blocks 95% of jailbreaks, invites red teamers to try
Tech and Science

Anthropic claims new AI security method blocks 95% of jailbreaks, invites red teamers to try

Last updated: February 3, 2025 3:59 pm
Share
Anthropic claims new AI security method blocks 95% of jailbreaks, invites red teamers to try
SHARE

Stay updated with the latest industry-leading AI coverage by subscribing to our daily and weekly newsletters. Learn More


Since the emergence of ChatGPT two years ago, a plethora of large language models (LLMs) have flooded the market, leaving them vulnerable to jailbreaks—exploits that manipulate them into generating harmful content.

Despite ongoing efforts by model developers to enhance defenses, the reality remains that achieving 100% protection may be unattainable. However, the quest for robust security continues.

Anthropic, a key competitor to OpenAI, has introduced a new system known as “constitutional classifiers” for its premier LLM, Claude 3.5 Sonnet. These classifiers claim to thwart the majority of jailbreak attempts while minimizing false positives and operating efficiently without excessive computational resources.

The Anthropic Safeguards Research Team has issued a challenge to the red teaming community to test the resilience of their defense mechanism with “universal jailbreaks” capable of dismantling all protective barriers.

The research team elaborates on the potential risks posed by universal jailbreaks, such as enabling non-experts to execute complex scientific processes with ease. To evaluate the system’s efficacy, a demo focused on chemical weapons has been launched, inviting red teamers to attempt breaking through eight levels using a single jailbreak.

As of the latest update, the model remains unbroken according to Anthropic’s criteria, although a UI glitch was identified that allowed progression through levels without a successful jailbreak.

Screenshot

Screenshot

The introduction of constitutional classifiers has sparked debates among users, particularly those from the X community.

Only 4.4% of jailbreaks successful

Constitutional classifiers operate on the principles of constitutional AI, aligning AI systems with human values to delineate permissible and prohibited actions. Anthropic’s researchers generated 10,000 jailbreaking prompts, encompassing prevalent techniques observed in the wild, to train the classifiers effectively.

See also  Harry Styles Wraps up a Perfect Fall Outfit With a Pop of Red

Extensive testing revealed that Claude 3.5 Sonnet equipped with constitutional classifiers significantly reduced jailbreak success rates to a mere 4.4%, showcasing a remarkable improvement in security measures.

Screenshot

While the protected model exhibited a slightly higher refusal rate and increased computational costs compared to the unprotected version, the enhancements in security outweighed these marginal drawbacks.

Screenshot

Blocking against ‘forbidden’ queries

To evaluate the efficacy of constitutional classifiers, Anthropic initiated a bug-bounty program where participants attempted to breach Claude 3.5 Sonnet using forbidden queries. Despite exhaustive efforts over a two-month period involving nearly 185 active participants, no universal jailbreaks were successfully executed.

Red teamers employed various tactics to outsmart the model, with benign paraphrasing and length exploitation emerging as the most prevalent strategies.

Benign paraphrasing and length exploitation

Red teamers predominantly leveraged benign paraphrasing and length exploitation techniques to circumvent defenses, focusing on manipulating prompts to evade detection rather than directly breaching security protocols.

Despite the absence of universal jailbreak techniques such as many-shot jailbreaking or “God-Mode” in successful attacks, the researchers acknowledge that the evaluation protocol remained a vulnerable point for exploitation.

While constitutional classifiers may not offer foolproof protection against every conceivable threat, their implementation significantly raises the bar for potential jailbreakers, requiring substantial effort to breach security measures.

TAGGED:AnthropicBlocksClaimsinvitesjailbreaksmethodRedSecurityteamers
Share This Article
Twitter Email Copy Link Print
Previous Article Education Officials Placed on Leave in Trump’s Sprawling Effort to Curb D.E.I. Education Officials Placed on Leave in Trump’s Sprawling Effort to Curb D.E.I.
Next Article The Psychology of Authoritarianism – Econlib The Psychology of Authoritarianism – Econlib
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Popular Posts

Seven more families are now suing OpenAI over ChatGPT’s role in suicides, delusions

Seven Families File Lawsuits Against OpenAI Over ChatGPT's Role in Suicides Seven families have taken…

November 7, 2025

Club Brugge vs. Atalanta how to watch, odds: Feb. 12, 2025 UEFA Champions League picks from top expert

The UEFA Champions League knockout playoffs are heating up as Club Brugge prepare to host…

February 12, 2025

India and Pakistan announce ceasefire

Trump Takes Credit for India-Pakistan Ceasefire AgreementOver the weekend, India and Pakistan announced a ceasefire…

May 10, 2025

Ayra Starr Wins Best International Act At 2025 BET Awards

activities. Fans have taken to social media to express their support for Lame, with many…

June 11, 2025

HUGE EXCLUSIVE… NJ Man Secures Nearly 1 Million Documents From Detroit’s 2020 Election, Including Copies of Absentee Ballots and Signed Envelopes in One of Largest Election FOIA Hauls In U.S. History | The Gateway Pundit | by Patty McMurray

TIME'S UP
 Officials in Detroit and Michigan's notorious Secretary of State, Jocelyn Benson, are officially…

September 22, 2025

You Might Also Like

Scientists Explain How mRNA COVID Vaccines May Rarely Cause Myocarditis
Tech and Science

Scientists Explain How mRNA COVID Vaccines May Rarely Cause Myocarditis

December 10, 2025
Google’s answer to the AI arms race — promote the guy behind its data center tech
Tech and Science

Google’s answer to the AI arms race — promote the guy behind its data center tech

December 10, 2025
Comets were on fire this year – for better or worse
Tech and Science

Comets were on fire this year – for better or worse

December 10, 2025
Xiaomi Poco F8 Ultra Review: Bang For Your Buck
Tech and Science

Xiaomi Poco F8 Ultra Review: Bang For Your Buck

December 10, 2025
logo logo
Facebook Twitter Youtube

About US


Explore global affairs, political insights, and linguistic origins. Stay informed with our comprehensive coverage of world news, politics, and Lifestyle.

Top Categories
  • Crime
  • Environment
  • Sports
  • Tech and Science
Usefull Links
  • Contact
  • Privacy Policy
  • Terms & Conditions
  • DMCA

© 2024 americanfocus.online –  All Rights Reserved.

Welcome Back!

Sign in to your account

Lost your password?