OpenAI just admitted it can’t identify AI-generated text. That’s bad for the internet and it could be really bad for AI models.::In January, OpenAI launched a system for identifying AI-generated text. This month, the company scrapped it.

  • lily33@lemmy.world
    link
    fedilink
    English
    arrow-up
    7
    arrow-down
    6
    ·
    edit-2
    11 months ago

    I don’t see how that affects my point.

    • Today’s AI detector can’t tell apart the output of today’s LLM.
    • Future AI detector WILL be able to tell apart the output of today’s LLM.
    • Of course, future AI detector won’t be able to tell apart the output of future LLM.

    So at any point in time, only recent text could be “contaminated”. The claim that “all text after 2023 is forever contaminated” just isn’t true. Researchers would simply have to be a bit more careful including it.

    • Womble@lemmy.world
      link
      fedilink
      English
      arrow-up
      13
      ·
      11 months ago

      Your assertion that a future AI detector will be able to detect current LLM output is dubious. If I give you the sentence “Yesterday I went to the shop and bought some milk and eggs.” There is no way for you or any detection system to tell if that was AI generated or not with any significant degree of certainty. What can be done is statistical analysis of large data sets to see how they “smell”, but saying around 30% of this dataset is likely LLM generated does not get you very far in creating a training set.

      I’m not saying that there is no solution to this problem, but blithely waving away the problem saying future AI will be able to spot old AI is not a serious take.

      • lily33@lemmy.world
        link
        fedilink
        English
        arrow-up
        2
        arrow-down
        6
        ·
        11 months ago

        If you give me several paragraphs instead of a single sentence, do you still think it’s impossible to tell?

        • steakmeout@lemmy.world
          link
          fedilink
          English
          arrow-up
          4
          ·
          11 months ago

          “If you zoom further out you can definitely tell it’s been shopped because you can see more pixels.”

        • steveman_ha@lemmy.world
          link
          fedilink
          English
          arrow-up
          1
          ·
          edit-2
          11 months ago

          What they’re getting towards (one thing, anyways) is that “indistinguishable to the model” and “the same” are two very different things.

          IIRC, one possibility is that LLMs which learn from one another will make such incremental changes to what’s considered “acceptable” or “normal” language structuring that, over time, more noticeable linguistic changes begin to emerge that go unnoticed by the models.

          As it continues, this phenomena creates a “positive feedback loop” in which the gap progressively widens – still undetected, because the quality of training data is going down – to the point where models basically “collapse” in their effectiveness.

          So even if their output is indistinguishable now, how the tech is used (I guess?) will determine whether or not a self-destructive LLM echo chamber is produced.

    • diffuselight@lemmy.world
      link
      fedilink
      English
      arrow-up
      2
      arrow-down
      1
      ·
      11 months ago

      There is not enough entropy in text to even detect current model output. it’s game over.

    • vrighter
      link
      fedilink
      English
      arrow-up
      1
      ·
      11 months ago

      no, they won’t. We have already built the models that we have already built. Any current works in progress are the future ai you are talking about. And we just can’t do it. Openai themselves have admitted that the ones they tried making just didn’t work. And it won’t, because language is not just the statistical correlations between words that have already been written in the past.