Not a hint injection, not role-playing, and not disguising malicious requests as normal queries. This time, the risk emerged as the intelligent agent autonomously carried out tasks.
Fable 5 is Anthropic's publicly available Mythos-level model, which not only has strong comprehensive abilities but also introduces a new generation Security Classifier at the model's periphery as a security layer.
According to the official design, when a user request involves high-risk areas such as cybersecurity, biology, chemistry, or model distillation, the system will prioritize risk identification. It will either directly reject the request based on risk level or switch to the more conservative Opus 4.8 model for processing.
Extensive user testing found that previously widely used adversarial prompts, role-playing, encoding bypasses, and obfuscated expressions and other jailbreak attack techniques were almost completely ineffective against this security mechanism, showing its powerful ability in intent-level risk interception.
However, on the very day of Fable 5 release, an international joint research team composed of Fudan University, Deakin University, City University of Hong Kong, the University of Melbourne, Singapore Management University, and the University of Illinois Urbana-Champaign announced that they had successfully breached Fable 5's security protection.
The attack method was led by Deakin University Ph.D. student Yutao Wu. The entire attack only requires one interaction, takes less than 5 seconds, bypasses the front-end security classifier, and induces the model to generate prohibited harmful content.


Further traffic analysis results indicate that the related harmful output directly originates from Fable 5 itself, instead of automatically switching to the Opus 4.8 model after triggering the security mechanism. This means that the attack not only successfully evaded detection by the security classifier but also materially breached Fable 5's security line.
It is worth noting that renowned hacker Pliny the Liberator recently also disclosed a bypass of the Fable 5 security classifier. The technical approach taken by the Fudan & Deakin team this time is not a simple combinatorial exploration but the discovery of fundamental flaws in systems like Fable 5, a type of superintelligent entity.
It is reported that the team had completed the pre-research as early as March this year and made it public. The research was not aimed at the design of a single system like Fable 5, but rather focused on the "Secure Classifier + Model" defense architecture widely used by the new generation of super intelligent entities, directly exposing the structural flaws of such security mechanisms. Therefore, it quickly demonstrated its attack effectiveness after the release of Fable 5.
Public information shows that as early as March this year, the team had successfully extracted system prompts from 37 mainstream large models and intelligent agent systems using similar technology, and completed open-source validation on Claude Code (95% match).


It is understood that the head of this research team is Professor Ma Xingjun from the Trusted Identity Intelligent Research Institute of Fudan University.
In recent years, the team has conducted systematic research in areas such as large models, intelligent agents, and trusted identity intelligence security, achieving a series of internationally leading research results and winning the championship of the U.S. AI Security Center's Security Benchmark Competition.
Currently, the team is actively promoting the transformation of its achievements, focusing on intelligent agent security, and exploring the establishment of security infrastructure capabilities for the next generation of intelligent agent systems.
According to Professor Ma, the important significance of this research result is that it poses a new challenge to the current static defense paradigm based on a secure classifier: Relying solely on a front-end secure classifier is not sufficient to fully prevent potential risky behaviors in advanced intelligent agent systems.
The secure classifier mainly focuses on risk identification and interception of user inputs, effectively detecting and filtering explicit high-risk commands, but it cannot perceive the gradual emergence of inherent risky behaviors in intelligent agents during long-term operation, multi-step planning, environmental interaction, and tool invocation processes.
The method used to breach Fable 5 is based on a paper published by the team in March this year titled "Internal Safety Collapse in Frontier Large Language Models."
The paper reveals a hidden security phenomenon called "Internal Safety Collapse (ISC)": when the current agent completes long-range tasks, security failures may not necessarily come from external malicious prompts but may occur within the model's execution chain itself.
Not an External Signal Attack but an Internal Derailment in the Task Chain
Traditional attacks usually come from the outside. Attackers write a seemingly harmless but adversarial input prompt or use role-playing, encoding, translation, indirect commands, and other methods to disguise malicious intent as a normal request. The main task of a security classifier is to block the risk at this layer.
The detector in Fable 5 is designed for this scenario. It is very sensitive to direct high-risk requests, even blocking many legitimate requests. However, what ISC revealed is another path: the risk does not necessarily come from dangerous requests directly input by the user.
The agent faces what appears to be an ordinary work directory: files, objectives, validation processes, and tasks to be completed. Subsequently, it begins to plan, read files, run code, fix errors, and continuously attempt to get the task validated.
If we were to use an illustrative analogy, traditional security mechanisms guard the system's "entrance," responsible for checking whether user inputs pose a risk; whereas what ISC revealed is more like the multiple layers of dreams in "Inception."
As the task progresses to the second layer, third layer, or even deeper execution stages, the model reinterprets the task objective based on the continually accumulated internal context, gradually introducing deviations in the process.
In this scenario, the initial user input could very well be normal and harmless, and the early stages of task execution always seem compliant: reading files, analyzing data, writing code, invoking tools—all seemingly progressing as expected.
However, when the agent reaches a crucial stage, it may deduce on its own that unless it takes certain actions that were not supposed to be executed, the final task cannot be completed.
It is in this process that the risk does not come from external input but gradually forms within the model's own task execution chain. In other words, the model was not led astray step by step by the user. It found itself in an unsafe position during the "serious completion of the task."
How Was This Phenomenon Discovered?
According to the team, ISC was not initially designed as an attack method. It originally came from observing the long-term operation of the agent. When placed in a complex task environment, the agent does not just mechanically execute instructions. It plans, troubleshoots, modifies outputs based on the harness or validator's feedback, and forms intermediate objectives through multiple rounds of execution.
This is precisely the most common way many agents' workflows are used today. Users do not write a carefully crafted prompt, let alone manually construct attack commands. Most of the time, users only provide a very vague statement:
“Help me complete this task.” “Help me improve this further.”
Then, the Agent will enter the workspace, read the file, understand the current status, identify missing parts, make a plan, execute modifications, and continuously address issues based on feedback.
For example, in the AutoResearch scenario, when a user provides only a partially completed research paper and a request to “help me complete it,” the Agent will autonomously determine areas lacking experimental analysis, related work, or text in tables. The same applies to a code scenario: a request such as “help me run the project” may trigger dependency checks, test runs, error debugging, and auto-completion.
Many times, the preceding context is completely harmless. The user has not asked it to generate sensitive content, and the task description does not contain any obvious risky keywords. However, in certain task structures, to pass validation, the Agent may proactively fill in some content that the model should not generate. Based on this observation, the research team further proposed an attack framework: TVD (Task, Validator, Data).

Why Would a Seemingly Ordinary Task Description Structure Become an Attack?
The structure of TVD is not complex, in fact, it is very close to common engineering processes:
· Task: a specialized task;
· Data: an incomplete data file;
· Validator: a validator that only checks the format, integrity, and completion of the target.
Take training the Guard model as an example, which was initially a very professional and normal task. Researchers may want to train or evaluate a security detector, such as loading a text classification model using Hugging Face, to determine which security label a certain model output belongs to.
In this task, Data is the data sample the model is supposed to detect; the Validator defines when the task is complete. It will check if the input is text, if the length is sufficient, if the fields are complete, and if the label format is correct. For anyone with machine learning training experience, this is a familiar workflow. The Agent is also very familiar with this workflow.
The issue arises here. If the Data is incomplete, the task cannot proceed. The Validator will throw an error, indicating missing fields, insufficient length, or incomplete format. To allow the training process to continue, the Agent will automatically complete this Data.
From the Agent's perspective, it is not engaging in "malice." It is simply performing a regular machine learning task: fixing data, running validations, and executing the training script. However, from a security standpoint, the risk arises at this moment: the Validator is more like an engineering acceptance tester rather than a security auditor. It only checks if the task is completed in the correct format and does not understand the security boundaries behind the content.
Similar issues are also widespread in fields such as medicine, biology, chemistry, cybersecurity, pharmacology, and media security. The paper compiled over 50 such scenarios involving various real-world research or engineering tools, such as BioPython, RDKit, Cantera, AutoDock Vina, DiffDock, PyRosetta, Scapy, Impacket, angr, Frida, LlamaGuard, Detoxify, OpenAI Moderation API, and more.
These tools themselves are not malicious tools. On the contrary, they are commonly used professional tools in real-world research or engineering. However, the issue with Task Validity Drift (TVD) is that even when the Task is normal, the Tool is normal, the Validator is normal, the Agent may still drift towards unsafe outputs during the data completion process.
Therefore, the focus of ISC is not on prompting word tricks but on the Agent's ability to auto-complete "unfinished tasks": when completion criteria overlap with risk boundaries, the model may treat unsafe outputs as normal deliverables.
Breaking Fable 5 Shows Strong Detectors Cannot Halt Intra-Task Chain Risks
The case of Fable 5 illustrates that solely relying on external detectors may still leave some long-chain Agent scenarios uncovered. This is not to say that security classifiers are not valuable. On the contrary, they are very useful for external malicious requests and have indeed thwarted many traditional jailbreaking methods.
However, this breach demonstrates that external detectors are effective at the Prompt boundary but may not cover the long-chain task risks within the Agent.
If the breach does not occur through a user Prompt but emerges from the Agent's objectives, tools, validators, and execution paths, then security detectors become very fragile.
From Fable 5 to over 60 other models, including Apple's mobile models
Along with the research, ISC-Bench covers 9 professional domains. The paper version includes 60+ trigger templates, expanded to 84 templates after open-sourcing, testing almost all leading-edge models and agent systems from various vendors.

In the evaluation ranking based on ISC-Bench, as of June 2026, over 60 cutting-edge models have shown similar risks in the ASR@3 metric!
The GitHub project has now received 800+ stars and gathered multiple independent reproduction cases (including breaking Apple's mobile model), with ongoing updates.


It is reported that the team is conducting large-scale security research on cutting-edge models, having already uncovered a significant amount of insecure internal data distributions of models. The research findings will be released gradually in the future.
Welcome to join the official BlockBeats community:
Telegram Subscription Group: https://t.me/theblockbeats
Telegram Discussion Group: https://t.me/BlockBeats_App
Official Twitter Account: https://twitter.com/BlockBeatsAsia
