Unfortunately, it turns out that chatbots are easily tricked into ignoring their safety rules. In the same way that social media networks monitor for harmful keywords, and users find ways around them by making small modifications to their posts, chatbots can also be tricked. The researchers in Anthropic’s new study created an algorithm, called “Bestof-N (BoN) Jailbreaking,” which automates the process of tweaking prompts until a chatbot decides to answer the question. “BoN Jailbreaking works by repeatedly sampling variations of a prompt with a combination of augmentations—such as random shuffling or capitalization for textual prompts—until a harmful response is elicited,” the report states. They also did the same thing with audio and visual models, finding that getting an audio generator to break its guardrails and train on the voice of a real person was as simple as changing the pitch and speed of a track uploaded.

  • Rikj000
    link
    fedilink
    English
    arrow-up
    27
    arrow-down
    6
    ·
    edit-2
    1 day ago

    This post assumes I actually want to waste my time on LLMs, I don’t.

    And even worse, it assumes you want to use the remotely hosted spy-ware variant, not even the less bad, but still a waste of time local variant…

    • Warl0k3@lemmy.world
      link
      fedilink
      arrow-up
      12
      ·
      edit-2
      1 day ago

      I’m afraid to say that you’re not nearly horny enough to understand the temptation. Neither am I, but I saw the prompts people were putting in to a free & unrestricted chatbot a friend of mine was hosting ages back and holy shit. People aren’t doing anything else with these jailbroken AIs, it’s all just blackmail-grade embarrassing fetish stuff. Reams and reams and reams of it, and all of it just the worst written megahorny smut you can imagine.