Stay updated with the latest industry-leading AI coverage by subscribing to our daily and weekly newsletters. Learn More
Since the emergence of ChatGPT two years ago, a plethora of large language models (LLMs) have flooded the market, leaving them vulnerable to jailbreaksâexploits that manipulate them into generating harmful content.
Despite ongoing efforts by model developers to enhance defenses, the reality remains that achieving 100% protection may be unattainable. However, the quest for robust security continues.
Anthropic, a key competitor to OpenAI, has introduced a new system known as âconstitutional classifiersâ for its premier LLM, Claude 3.5 Sonnet. These classifiers claim to thwart the majority of jailbreak attempts while minimizing false positives and operating efficiently without excessive computational resources.
The Anthropic Safeguards Research Team has issued a challenge to the red teaming community to test the resilience of their defense mechanism with âuniversal jailbreaksâ capable of dismantling all protective barriers.
The research team elaborates on the potential risks posed by universal jailbreaks, such as enabling non-experts to execute complex scientific processes with ease. To evaluate the systemâs efficacy, a demo focused on chemical weapons has been launched, inviting red teamers to attempt breaking through eight levels using a single jailbreak.
As of the latest update, the model remains unbroken according to Anthropicâs criteria, although a UI glitch was identified that allowed progression through levels without a successful jailbreak.
The introduction of constitutional classifiers has sparked debates among users, particularly those from the X community.
Only 4.4% of jailbreaks successful
Constitutional classifiers operate on the principles of constitutional AI, aligning AI systems with human values to delineate permissible and prohibited actions. Anthropicâs researchers generated 10,000 jailbreaking prompts, encompassing prevalent techniques observed in the wild, to train the classifiers effectively.
Extensive testing revealed that Claude 3.5 Sonnet equipped with constitutional classifiers significantly reduced jailbreak success rates to a mere 4.4%, showcasing a remarkable improvement in security measures.
While the protected model exhibited a slightly higher refusal rate and increased computational costs compared to the unprotected version, the enhancements in security outweighed these marginal drawbacks.
Blocking against âforbiddenâ queries
To evaluate the efficacy of constitutional classifiers, Anthropic initiated a bug-bounty program where participants attempted to breach Claude 3.5 Sonnet using forbidden queries. Despite exhaustive efforts over a two-month period involving nearly 185 active participants, no universal jailbreaks were successfully executed.
Red teamers employed various tactics to outsmart the model, with benign paraphrasing and length exploitation emerging as the most prevalent strategies.
Benign paraphrasing and length exploitation
Red teamers predominantly leveraged benign paraphrasing and length exploitation techniques to circumvent defenses, focusing on manipulating prompts to evade detection rather than directly breaching security protocols.
Despite the absence of universal jailbreak techniques such as many-shot jailbreaking or âGod-Modeâ in successful attacks, the researchers acknowledge that the evaluation protocol remained a vulnerable point for exploitation.
While constitutional classifiers may not offer foolproof protection against every conceivable threat, their implementation significantly raises the bar for potential jailbreakers, requiring substantial effort to breach security measures.