The big AI models are running out of training data (and it turns out most of the training data was produced by fools and the intentionally obtuse), so this might mark the end of rapid model advancement

  • lurkerlady [she/her]@hexbear.net
    link
    fedilink
    English
    arrow-up
    33
    arrow-down
    1
    ·
    edit-2
    24 days ago

    This is accurate, though I am actually going to explain why. These big model companies (Google, ClosedAI, etc) parasitize the open-weights/open-source community that actually makes good Loras, fine tunes, and research papers. Consumer hardware simply hasn’t gotten good and cheap enough for very good fine tune training, and thats why this is all slowly petering out. In a couple of generations of consumer GPUs, which will be when we get consumer GPUs geared towards AI (re: super high VRAM counts of like 70gb+ for an affordable sub 700 usd cost), we might see another leap forward in this tech. Though I will say that this mostly pertains to LLMs, generative AI models like Stable Diffusion have a lot of tricks up their sleeves that can still be explored. Most of recent research and tweaking has been based around building a structure for the AI to build on, to sort of guide it rather than letting it take random stabs at things, in order to improve outputs. Some people have been doing things like hard coding color theory, framing a photograph, etc, and interpreting human language to trigger that hard code.

    We’ve had statistical models like these since the 50s. Consumer hardware has always been the big materialist bottleneck, this is all powered by small research teams and hobbyist nerds. You can throw a ton of money at it and have a giant research team, but the performance you squeeze out of adding 400b more parameters to your 13b model or having a gigantic locked-down datacenter is going to be diminishing.

    Also, synthetic data can be useful, people are hating on it in this thread but its a great way to reinforce good habits in the AI and interpret garbled code and speech that would otherwise confuse the AI. I sometimes feel like people just see something about ‘AI bad’ and upvote it and don’t try to understand it, where it is useful and where it is not, and so on.

      • lurkerlady [she/her]@hexbear.net
        link
        fedilink
        English
        arrow-up
        9
        ·
        edit-2
        24 days ago

        Synthetic data is basically a fancy way of saying ‘I’m properly formatting data and reinforcing the ai’s good outputs’. Rearranging words, fixing / adding tags, that sort of thing. This is generated with various tools that usually have an LLM or VLM plugged in, though some are as simple as a regex script.

    • MacN'Cheezus@lemmy.today
      link
      fedilink
      English
      arrow-up
      3
      ·
      24 days ago

      Better hardware isn’t going to change anything except scale if the underlying approach stays the same. LLMs are not intelligent, they’re just guessing a bunch of words that are statistically most likely to satisfy the user’s request based on their training data. They don’t actually understand what they’re saying.