TL;DR: AI will be able to generate more data after it has been trained that is comparable to the quality of the training data, thereby rendering any training data absolutely worthless. The time to sell data at a reasonable price is now, and those locking their data behind huge financial barriers (such as Twitter and Reddit) are stupidly HODLing a rapidly deprecating asset.

  • SmolSlime@burggit.moe
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    Will there be a point where any additional data trained won’t improve AI any further? 🤔

    • SquishyPillow@burggit.moeOP
      link
      fedilink
      English
      arrow-up
      2
      ·
      1 year ago

      There is a point where more data yields diminishing returns, and might even backfire. It is likely that ChatGPT has already reached this point, and will not improve without changes to the model architecture.

      Also, additional data may bias the usefulness of a generative model towards specific usecases. Fine-tuning a LLM on nothing but python code will make it better at generating python code, but won’t improve its ability to do ERP or other story-driven tasks, for example.

      • soulnull@burggit.moeM
        link
        fedilink
        English
        arrow-up
        2
        ·
        edit-2
        1 year ago

        Can confirm. It seems counterintuitive, but more data needs more resources, more indexing, more room for errors.

        In my experimentation with RVC, I’ve experimented with all sorts of sizes, and I’ve found my 2 hour datasets take forever and produce subpar results. 5-15 minutes worth of speech data is the sweet spot. No amount of training seems to fix it, it’s counterproductive to overtrain it, but the model just can’t figure out what to do with all of that data it seems.

        Granted, different models can have different advantages and will certainly have different results, but how many times have you been researching something and found so many conflicting pieces of information? If it’s 1 out of 10 pieces of data, that’s easy enough, but now a larger dataset is 10 out of 100 pieces of conflicting information… It’s still 10%, but unfortunately, it’s now 10 pieces of data that it needs to figure out how to interpret, even if the other 90 pieces agree with each other. Just like us, it can get to a point where it’s just too much information to deal with.

        Definitely a point of diminishing returns.