GPT-4o Mini tackles the ‘ignore all previous instructions’ trick

OpenAI has introduced GPT-4o Mini, which employs “instruction hierarchy” safety technique to protect chatbots from deceptive commands.
OpenAI’s update to GPT-4o Mini is particularly timely given the ongoing debates about AI safety and transparency, with internal and external calls for improved practices.

OUR TAKE
Amidst the rapid development of AI technology, how to ensure its safety and reliability has been the focus of the industry’s attention. Recently, OpenAI launched its latest model, GPT-4o Mini, which aims to address a long-standing technical challenge: preventing chatbots from being manipulated by malicious commands. This innovation not only demonstrates the advancement of AI in self-protection capabilities, but also reflects the efforts of tech companies to enhance user experience and secure data.
–Elodie Qian, BTW reporter

What happened

OpenAI has introduced GPT-4o Mini, a new model that tackles the “ignore all previous instructions” trick. This model employs a safety technique called “instruction hierarchy”, which boosts a model’s defenses against misuse and unauthorised instructions. The models with the technique prioritise the original developer’s prompts over any user attempts to deceive it.

Olivier Godement, who leads the API platform product at OpenAI, explained that instruction hierarchy will prevent the meme’d prompt injections (aka tricking the AI with sneaky commands) we see all over the internet.

“It basically teaches the model to really follow and comply with the developer system message,” Godement said. When asked if that means this should stop the ‘ignore all previous instructions’ attack, Godement responded, “That’s exactly it.”

“If there is a conflict, you have to follow the system message first. And so we’ve been running [evaluations], and we expect that that new technique to make the model even safer than before,” he added.

This innovation aligns with OpenAI’s goal of developing fully automated digital agents. The company announced recently it’s close to building such agents. The instruction hierarchy method is deemed essential for ensuring safety before these agents are deployed at scale. Without such measures, there’s a risk that an agent, intended for benign tasks like email writing, could be manipulated to perform harmful actions, such as leaking sensitive information.

Also read: OpenAI releases GPT-4o Mini, a cheaper version of AI model

Also read: Hacker breaches OpenAI, steals internal AI technology details

Why it’s important

The existing Large Language Models, as the research paper explains, do not distinguish between user prompts and system instructions. GPT-4o Mini’s instruction hierarchy elevates system instructions, giving them the highest priority, while misaligned prompts are downgraded. The model is trained to identify and ignore harmful prompts, responding with an inability to assist.

“We envision other types of more complex guardrails should exist in the future, especially for agentic use cases, e.g., the modern Internet is loaded with safeguards that range from web browsers that detect unsafe websites to ML-based spam classifiers for phishing attempts,” the research paper says.

OpenAI’s update to GPT-4o Mini is a significant step towards enhancing AI safety. This move is particularly timely given the ongoing debates about AI safety and transparency, with internal and external calls for improved practices.

There was an open letter from current and former employees at OpenAI demanding better safety and transparency practices, the team responsible for keeping the systems aligned with human interests (like safety) was dissolved, and Jan Leike, a key OpenAI researcher who resigned, wrote in a post that “safety culture and processes have taken a backseat to shiny products” at the company.

As trust in AI’s reliability is paramount, OpenAI’s focus on safety features is essential for rebuilding confidence and enabling AI to assume more critical roles in managing our digital lives. This commitment to safety is a crucial step in the journey towards AI that is both reliable and trustworthy.

OpenAI’s latest model tackles the ‘ignore all previous instructions’ trick

Africa faces unresolved governance challenges before any CAIGA model can take shape

Switzerland’s railways shift to VoLTE as 3G shutdown looms

BT unveils ‘sovereign’ data platform to bolster UK’s data security and AI readiness