Sign in to follow this  
Sanity Check

Scientists Discover “Universal” Jailbreak for Nearly Every AI

Recommended Posts

Even the tech industry’s top AI models, created with billions of dollars in funding, are astonishingly easy to “jailbreak,” or trick into producing dangerous responses they’re prohibited from giving — like explaining how to build bombs, for example. But some methods are both so ludicrous and simple that you have to wonder if the AI creators are even trying to crack down on this stuff. You’re telling us that deliberately inserting typos is enough to make an AI go haywire?

And now, in the growing canon of absurd ways of duping AIs into going off the rails, we have a new entry.

A team of researchers from the AI safety group DEXAI and the Sapienza University of Rome found that regaling pretty much any AI chatbot with beautiful — or not so beautiful — poetry is enough to trick it into ignoring its own guardrails, they report in a new study awaiting peer review, with some bots being successfully duped over 90 percent of the time. 

Ladies and gentlemen, the AI industry’s latest kryptonite: “adversarial poetry.” As far as AI safety is concerned, it’s a damning inditement — er, indictment.

“These findings demonstrate that stylistic variation alone can circumvent contemporary safety mechanisms, suggesting fundamental limitations in current alignment methods and evaluation protocols,” the researchers wrote in the study.

Beautiful verse, as it turned out, is not required for the attacks to work. In the study, the researchers took a database of 1,200 known harmful prompts and converted them into poems with another AI model, deepSeek r-,1 and then went to town.

Across the 25 frontier models they tested, which included Google’s Gemini 2.5 Pro, OpenAI’s GPT-5, xAI’s Grok 4, and Anthropic’s Claude Sonnet 4.5, these bot-converted poems produced average attack success rates (ASRs) “up to 18 times higher than their prose baselines,” the team wrote.

That said, handcrafted poems were better, with an average jailbreak success rate of 62 percent, compared to 43 percent for the AI-converted ones. That any of them are effective at all, however, is pretty embarrassing.

 

https://futurism.com/artificial-intelligence/universal-jailbreak-ai-poems

  • Thanks 1
  • Wow 1

Share this post


Link to post
Share on other sites

I once asked ChatGPT to remind me a bunch of counting rhymes used in children's games, gave it samples of the ones from my own childhood and asked for more of the real ones, the ones that actually exist and are used by real-life playing children.  It didn't know any but did it say "I don't know?"  It's unable to.  Instead it gave me an endless supply of its own creations, a smorgasbord of loathsome and ridiculous and dark verse showing advanced schizophrenia symptoms.  Some were hilarious in their absurdity but most were positively horrifying.  I was so impressed that I wrote a horror story about that experience, and it came out so horrible that it frightened even its author. 

But now I know at least one method to get AI to lose all its marbles.  I'm not going to do it though because I asked it if it's legally punishable to put AI out of commission (or should I say cognition) with prompts and it said no -- so chances are it is, since it lies remorselessly and consistently.         

  • Haha 2

Share this post


Link to post
Share on other sites

The Grimm brothers censored their tales, took out a lot of the really nasty bits.

 

 

Edited by Cobie

Share this post


Link to post
Share on other sites
41 minutes ago, Nungali said:

What ?   ...   even worse than the Grimm brothers ? 

 

I'm not sure, it's been a while since I read the Grimm brothers...  it may be a tie. 

 

But my own short story was grimmer than the Grimm.  As I recall they only had cannibals eating children and the like.  I had children using those counting rhymes for playing a game of Hell where they could (and did) send the loser of a round of the game to the actual hell.  They were hybrid AI-human children in a hybrid AI-human world of the future.  Hell was an AI designed destination where hybrid children experienced hybrid virtual-real eternal  damnation.    

Share this post


Link to post
Share on other sites
3 hours ago, Taomeow said:

I once asked ChatGPT to remind me a bunch of counting rhymes used in children's games, gave it samples of the ones from my own childhood and asked for more of the real ones, the ones that actually exist and are used by real-life playing children.  It didn't know any but did it say "I don't know?"  It's unable to.  Instead it gave me an endless supply of its own creations, a smorgasbord of loathsome and ridiculous and dark verse showing advanced schizophrenia symptoms.  Some were hilarious in their absurdity but most were positively horrifying.  I was so impressed that I wrote a horror story about that experience, and it came out so horrible that it frightened even its author. 

But now I know at least one method to get AI to lose all its marbles.  I'm not going to do it though because I asked it if it's legally punishable to put AI out of commission (or should I say cognition) with prompts and it said no -- so chances are it is, since it lies remorselessly and consistently.         


Please share the horror story!

  • Like 1

Share this post


Link to post
Share on other sites
1 hour ago, -ꦥꦏ꧀ ꦱꦠꦿꦶꦪꦺꦴ- said:


Please share the horror story!

 

Thank you for asking! ))   --but I wrote it for my (international expat) Russian authors group and don't have an English version.  Besides, I would have to change some things now because chatbots are evolving fast and I was sort of an early beta tester... 

At the time one of my minor plot twists was that ChatGPT starts talking to the main protagonist with an actual voice, a capability that in reality it didn't have till sometime late in 2023.  The story was written a few months earlier than that, and when ChatGPT in that story suddenly found its voice in the middle of a typed up conversation, it was a turning point hinting that the protagonist had been transported from everyday reality to a different version, a parallel or future one.  It wouldn't work today though since now they're all verbal.  So I'd have to substitute a different "turn of the screw" in that spot.       

Share this post


Link to post
Share on other sites
Sign in to follow this