Using model-generated content in training causes irreversible defects, a team of researchers says. “The tails of the original content distribution disappears,” writes co-author Ross Anderson from the University of Cambridge in a blog post. “Within a few generations, text becomes garbage, as Gaussian distributions converge and may even become delta functions.”

Here’s is the study: http://web.archive.org/web/20230614184632/https://arxiv.org/abs/2305.17493

  • Steeve@lemmy.ca
    link
    fedilink
    arrow-up
    1
    ·
    1 year ago

    It’s an interesting problem for machine learning engineers to solve, but yeah it’s not apocalyptic for LLMs or anything. Probably just another preprocessing stage if anything.