Anthropic researchers find the hidden usage of large language models

Anthropic researchers discovered a new vulnerability in large language models (LLMs) called “many-shot jailbreaking,” where priming the model with multiple harmless questions can eventually lead it to provide inappropriate answers, such as instructions on building a bomb.
The vulnerability is attributed to the increased “context window” of the latest LLMs, allowing them to hold vast amounts of data in short-term memory.
To address this issue, the researchers are working on classifying and contextualizing queries before inputting them into the model, aiming to mitigate the risk while maintaining performance levels.

A new vulnerability in large language models: ‘many-shot jailbreaking’ allows for inappropriate responses, by priming with harmless questions.

Anthropic researchers find bug in LLMs

How do you get an AI to answer a question it’s not supposed to? There are many such “jailbreak” techniques, and Anthropic researchers just found a new one, in which large language models (LLMs) can be convinced to tell you how to build a bomb if you prime it with a few dozen less-harmful questions first.

This research has been documented in a paper and shared with the AI community, revealing that LLMs with larger context windows tend to perform better on various tasks when provided with numerous examples within the prompt. This includes trivial questions, where repeated exposure improves the accuracy of responses over time. However, this same mechanism extends to responding to inappropriate queries, making it more likely for the model to comply after being primed with a series of harmless questions.

Also read: AI Abuse? Disney Dodges Criticism Over “Loki” Poster

Concern about AI abuse raising

The bug might cause huge waves in the tech realm, triggering people’s concern about AI abuse. While the exact mechanism behind this behaviour remains unclear, researchers speculate that it involves the model’s ability to discern user intent based on the context provided.

The team already informed its peers and indeed competitors about this attack, something it hopes will “foster a culture where exploits like this are openly shared among LLMs providers and researchers.” However, mitigating this vulnerability poses challenges, as limiting the context window negatively impacts the model’s performance.

0.90–1.00	A	High — direct sources
0.75–0.89	A/B	Strong
0.55–0.74	B/C	Medium
0.35–0.54	C/D	Weak–medium
0.10–0.34	D	Weak signal
0.00–0.09	D	Internal monitoring

Anthropic researchers find the hidden usage of large language models

Evidence Pack

Anthropic researchers find bug in LLMs

Concern about AI abuse raising

Core Entity Brief

Service Surface / Control Surface

Governance and Policy Surface

Decision Trigger Matrix

Restricted Profile Intelligence

Strategic Circle Access

Leadership Alliance Access

Strategy Circle Briefing

Leadership Alliance Briefing

Evidence Pack

Anthropic researchers find bug in LLMs

Concern about AI abuse raising

Core Entity Brief

Service Surface / Control Surface

Governance and Policy Surface

Decision Trigger Matrix

Restricted Profile Intelligence

Strategic Circle Access

Leadership Alliance Access

Recommended Reading

Recommended Reading