This is a proposal by some AI bro to add a file called llms.txt that contains a version of your websites text that is easier to process for LLMs. Its a similar idea to the robots.txt file for webcrawlers.

Wouldn’t it be a real shame if everyone added this file to their websites and filled them with complete nonsense. Apparently you only need to poison 0.1% of the training data to get an effect.

  • haverholm@kbin.earth
    link
    fedilink
    arrow-up
    37
    ·
    1 month ago

    Theoretically speaking, what level of nonsense are we talking about in order to really mess up the training model?

    a) Something that doesn’t represent the actual contents of the website (like posting “The Odyssey” to the llms.txt of a software documentation site),

    b) a randomly generated wall of real words out of context, or

    c) just straight lorem ipsum filler?

    • raoul@lemmy.sdf.org
      link
      fedilink
      arrow-up
      25
      arrow-down
      1
      ·
      1 month ago

      We could respect this convention the same way the IA webcrawlers respect robot.txt 🤷‍♂️

      • Tower@lemm.ee
        link
        fedilink
        arrow-up
        9
        ·
        1 month ago

        Do webcrawlers from places other than Iowa respect that file differently?

      • DaGeek247@fedia.io
        link
        fedilink
        arrow-up
        4
        ·
        1 month ago

        I’ve had a page that bans by ip listed as ‘dont visit here’ on my robots.txt file for seven months now. It’s not listed anywhere else. I have no banned IPs on there yet. Admittedly, i’ve only had 15 visitors in that past six months though.

  • ad_on_is@lemm.ee
    link
    fedilink
    English
    arrow-up
    3
    ·
    25 days ago

    So AI should get the most relevant info, while we (humans) have to fight through ads, and popups and shit… At this point, I feel discriminated