How can Lemmy scale?

HelloLemmySup@sh.itjust.works · 1 year ago

How can Lemmy scale?

MentalEdge@sopuli.xyz · edit-2 1 year ago

You’ve misunderstood. Every instance does not contain all content from every other instance. Only that which at least one user has specifically requested by entering the id of a community in the !name@instan.ce format in search.

This means that the star trek instance, will only ever need to mostly host start trek content. It wont get flooded with everything else on the entire network, as it grows. Some portion of it, yes, as users on the star trek instance will inevitably sub to at least some stuff outside it, too.

Additionally, pictures and media are cached, but not permanently federated. When you upload a picture, you may have noticed the link becoming one that points to the instance you’re posting from. This doesn’t change even when that post gets federated to other instances, they are still fetching that image from the instance it was posted from (unless its a recent post, in which case the image may well be cached, as well).

This means that whats gets federated, is mostly just a bunch of text data, and even then, just a subset that is needed. A much lighter load.

At the smallest scale, you could have a node with just one user, perhaps that user creates a community or two. But this means that that instance will ONLY EVER store the subs of that one user, and the content of the communities they created. Not even close to the total content of the entire fediverse.

HelloLemmySup@sh.itjust.works · 1 year ago

Ok thats a bit better. I didnt know about that detail.

Still that only moves the problem to the future. As I understand you should pick a community at random to sign up and then from that community access the rest. Then its a matter of time that enough users from StarTrek that have signed up there subscribe to enough big communities for the problem to appear, no?

maegul (he/they)@lemmy.ml · 1 year ago

Yea, I think you’re right. Once any instance has enough users with enough interests and subscriptions to enough communities, you get a scenario where a good portion of the whole network is duplicated on every or many nodes of the whole network. This is how the fediverse works, and I’ve yet seen anyone seriously address what this looks like at large scales and long timelines.

Storage space isn’t too expensive I guess, so maybe it’s something we can just solve when we come to it.

But, the problem may be worse with threadiverse platforms (lemmy/kbin and any other grouped or threaded platform) for exactly the reason you highlight … the whole community and all of its discussions get duplicated. For microblogging platforms, things are more granular as it’s only single posts by people who are followed that duplicated.

It may not be fatal and may be something we can solve when we get, which makes sense as getting up to a significant scale of users is tough in its own right … but it’d sure be nice to see someone think through the numbers.

HelloLemmySup@sh.itjust.works · 1 year ago

That’s why in my mind something like a consensus algorithm with the data duplicated N times where N < number of instances with subscribed people would make more sense. As it is right now I can’t see it scaling pass the few instances that can afford to keep it running.

MentalEdge@sopuli.xyz · edit-2 1 year ago

This is literally how the entire internet works. You are describing CDNs.

Additionally, from the perspective of the protocol (ActivityPub), there is no such difference which you are describing.

Communities are “users” which can be “followed” (subsribed to) by other “real” users. Essentially they are bot users that other users can post content through, to its followers. There is nothing different in how the threadiverse functions compared to the fediverse at large. Only its format.

maegul (he/they)@lemmy.ml · 1 year ago

Except we’re not CDNs. Instances aren’t run by companies, most of the time, but by volunteers. Not that I have anything against companies running instances, but it’s not what the fediverse is about.

So the question of resources is a sensible one, whether or not the current protocol and architecture has worked in the past.

And the threadiverse’s difference in format is precisely the difference in highlighting. Sync for microblogs is over the posts of individual users that are followed. Sync for the threadiverse is over a community which comprises many users’ posts. Communities with threads versus single user microblogs … this is the format difference. And it’s the difference in what gets synced. Right, please correct me if I’m wrong.

And so, if I’m right, the question of how much gets duplicated also differs.

Whether the threadiverse has more duplication depends on the details, of course. My reasoning was that it would be easier for more duplication to occur on the threadiverse, as whole collections of conversations of many users will duplicated simply from one users single subscription. This is compared to the microblog platforms where users often only follow hundreds of people (my impression only).

Of course, it may be that any users output is distributed over many communities, so that communities turn out not to be larger overall (maybe this was your point). And also, as you say, cacheing and duplicating is how the internet works, so we should have ways to handle it.

All in all though, it would be nice to have some basic numerical analysis done, especially if we want people to start instances without worrying about getting burnt by ballooning costs.

Irisos@lemmy.umainfo.live · edit-2 1 year ago

Honestly that “problem” will only appear in a few dozens of years after an instance started federating with others.

Most content is plain text, which is inexpensive to store. The only real weakness are images if your instance’s users post them a lot.

But it can be easily mitigated by configuring how big file uploads can be and encouraging the use of links.

MentalEdge@sopuli.xyz · 1 year ago

There a a lot of ways to mitigate this. The total activity of a day, is negligable, which means you’re presenting the inevitability of infinite data needing to be stored.

But that is the same issue any online service ever has had to deal with. And there are so many solutions. An instance admin might choose to delete inactive users or communities, or only choose to keep data for, say, 10 years.

You bring up the inevitability of there being enough users to eventually sub to everything. But that assumes infinite users. Any instance is only ever going to sub a subset of the rest of the fediverse. Even if some instances grow so karge that they sub to most of they wont need to host more than their text data. The entirety of wikipedia fits on a thumbdribe.

And you forget one more thing, the more users on an instance sub to the same thing, the more of the database can be shared, since they are storing the same subs with same comments.

Yes, storage and resource usage increase as the usercount increses, but efficency goes up along with it. That single user instance would be using WAY more space per user than a multi-user instance would.

HelloLemmySup@sh.itjust.works · 1 year ago

I think its not so much storage as it is requests I am worried. If a small instance wants to join and a few users subscribe to a few big communities then it needs to potentially proccess a lot of updates from the pubsub. I would imagine these messages are optimized so you get many updates within the same message but still.

Can the small instance be federated only one way? Meaning big can see small and comment in small but small cant see big communities (only comments made from big to its own small communities?)

MentalEdge@sopuli.xyz · edit-2 1 year ago

In the future, we’ll likely gain similar migration tools to mastodon. This means we’ll be able to “split” any instances that get too large to function, if such a thing ever happens.

If half of the users move away, but stayed subscribed to a given sub on that overloaded server, this would still reduce the load significantly, because any interaction now goes to the new server, once, when it is synced. And then that server handles pushing it out to all the users.

As for custom setups, I don’t see why they wouldn’t be possible. The server software would simply have to be made to work that way. AFAIK ActivityPub, the standard, doesn’t have anything in it that would make federation all or nothing.

bionicjoey@lemmy.ca · 1 year ago

At the smallest scale, you could have a node with just one user, perhaps that user creates a community or two. But this means that that instance will ONLY EVER store the subs of that one user

Honestly, once Lemmy becomes a bit more mature and stable, I will probably end up doing this. Selfhosting seems like a great way to fully break any dependence on external actors.