- cross-posted to:
- reddit@lemmy.ml
- intelligence@lemmy.ml
- lemmyworld@lemmy.world
- cross-posted to:
- reddit@lemmy.ml
- intelligence@lemmy.ml
- lemmyworld@lemmy.world
If this is true, then some things start to make a lot of sense. By making the user calls so incredibly expensive for the API, Reddit makes it so it would be come prohibitively expensive for basically anyone else to be able to access. Google could likely still afford it (but they would certainly pay a lot to do so), but an upstart that would be more likely to wreck OpenAI, like MidJourney did with DALL-E, becomes far less likely to be able to afford the cost.
It basically gives OpenAI padding between itself and upstart competitors where it matters most, training data. Further still, these changes might also explain some of the changes to the API, like the blocking of NSFW material. Doing that makes it easier for OpenAI to train on it, without needing to worry as much about filtering. It also explains the urgency of it too, as OpenAI is desperately seeking to keep upstarts, especially open-source ones, from being able to compete with them. It’s why they are lobbying governments around the world to allow only them to be kings of AI. Rapidly closing off any new Reddit data, or access to old data for new upstarts, would explain why there was such short notice.
Now, does this completely stop the ability to train on Reddit data? No. Web scraping is certainly always an option, but that’s a lot more computationally expensive on the front and back-end, the data will be very dirty (more computational work), and Reddit can combat this with de-indexing techniques. For the data sizes that OpenAI, or other seeking to make a GPT-like LLM, use for training their AI, web scraping likely isn’t feasible for the whole of Reddit.
It should be mentioned too that this doesn’t have to be the only reason for Reddit to make these changes. It still shuts down 3rd party apps and forces (those that remain) to use their ad-ridden stock app, it gives them greater control over how people interact with the site, and now it seems it gives the Reddit admins reason to directly intervene in subreddits to control how they operate after the protests. This combined with making OpenAI/Sam Altman happy, things start to add up.
It kills multiple birds with one stone, if you will.
Yes, it’s no longer compute power that is the limiting factor. OpenAI, Google, and other large corps can afford it straight out, and even smaller entities like StabilityAI can manage it by renting GPU’s. Heck, I saw an offer a few days back to rent H100’s for just a few bucks an hour. Those costs do add up, but that’s hardly cost prohibitive either.
Training data, both in quantity and quality, is now the defining feature that determines the “make or break” status for an LLM, and that’s not just a playing field for the largest corps. Even a GPT3/3.5 clone isn’t out of reach for a group like Stability, and smaller, more niche use models are capable of being trained on a fraction of the data needed for GPT3/3.5. There’s already attempts to have Co-Pilot style models run locally on machines which don’t need massive specs. Same goes for image generation diffusion models, as well as GANs again too. DALL-E and DALL-E 2 seemed incredible… Until Stable Diffusion launched and blew it out of the water. And MidJourney is by far the current king of that, blowing both DALL-E 2 and Stable Diffusion away. Adobe also has their’s coming soon (or already out?) for Photoshop, that they claim isn’t trained on copyrighted imagery, which if true means they have really pushed the bounds of what’s possible, given the early results I’ve seen from it.
So yes, training data will be the king maker for AI/ML models going forward. Much like you said, it fits with the trend of Big Data that’s been going on for roughly a decade or so now. That was born out of the desire to build custom advertising and analytic profiles, but it’s grown to power so much more than that now. Reddit is definitely a gold mine for such data.