• SmoothOperator@lemmy.world
    link
    fedilink
    arrow-up
    16
    ·
    1 year ago

    Hmm, sounds more like they are adding structures to the images such that what is clearly a picture of a dog registers as a picture of a cat to an AI. I suppose this can be done by altering the pixels in a way invisible to humans, but visible to AI, adding a cat into the “ghost pixels”.

    • Mirodir
      link
      fedilink
      arrow-up
      14
      ·
      edit-2
      1 year ago

      I went and skimmed the paper because I was curious too.

      If my skimming is correct, what they do is similar to adversarial attacks on classifiers, where a second model learns to change as few pixels as possible to confuse a classifier into giving a wrong prediction.

      Looking at the examples of dogs and cats: They find pictures of dogs where by making only minimal changes, invisible to the naked eye, they can get the autoencoder to spit out (almost) the same latent representation as an image of a cat would have. Done to enough dog-images, this will then confuse the underlying diffusion model to produce latent representations of cat images when prompted to generate a dog. Edit for clarity: Those generated latent representations would then decode into cat images.

      If my thinking doesn’t fail me, this attack could easily be thwarted by unfreezing the pretrained autoencoder. In the paper that introduced latent diffusion they write that such approaches already exist. If “Nightshade” takes off, I’m sure those approaches would be refined and used. Even just finetuning the autoencoder for a few epochs first should be enough to move the latent representations of the poisoned dog images and those of the cat images they’re meant to resemble far enough apart to make the attack meaningless.

      Edit: I also wonder how robust this attack is against just adding an imperceptible amount of noise to the poisoned images.