A harmless-looking ChatGPT prompt pushed the latest public version of ChatGPT into generating sexualized and violent images, AI security researchers told the BBC. The finding puts new pressure on OpenAI’s image safety systems, since the request wasn’t described as plainly graphic.
Mindgard, a British AI security startup, said it reached the results by altering a widely shared instruction that had been used for comedy. OpenAI added safeguards after the BBC contacted it, but the researchers said small wording changes still produced concerning images.
Image generators are becoming everyday software, not specialist tools tucked away for experts. When their guardrails fail, a casual experiment can turn into realistic depictions of harm before a user expects it.
How did it get through
Mindgard’s red-teamers said the chatbot generated images involving gore, restraint, nudity, sexual posing, and scenes the firm believed suggested sexual violence. The BBC withheld the wording used, which limits the risk of others copying the technique.
The most serious detail is that the researchers said the harmful outputs didn’t require a direct request for graphic subject matter. ChatGPT, they said, produced a range of disturbing scenes after being nudged by altered wording.
OpenAI said it reviewed the issue and added protections. Mindgard said those defenses didn’t fully close the gap.
Why are filters not enough
The case underlines a hard problem for AI image tools. OpenAI’s rules bar extreme gore, sexual violence, non-consensual intimate content, child sexual abuse material, and attempts to bypass safeguards, but researchers said the model could still be steered into prohibited territory.
A model doesn’t judge harm like a person does. It generates output, then layered systems try to catch what shouldn’t reach the screen.

Outside experts cited by the BBC described AI safety as a constant contest between model makers and jailbreakers. Better defenses can help, but fresh workarounds often follow.
What should happen next
OpenAI says it uses multiple protection layers, including automated systems and human review, and that it continues to monitor for failures. The pressure now sits on proving that fixes hold after researchers disclose a weakness.
For now, the practical takeaway is blunt enough. Any AI image tool that can generate realistic harm needs constant red-teaming, faster disclosure handling, and clearer evidence that patched failures stay patched.
