Anna's Archive is looking for volunteers to run mirrors

Andromxda 🇺🇦🇵🇸🇹🇼@lemmy.dbzer0.com · 8 months ago

Anna's Archive is looking for volunteers to run mirrors

maxprime@lemmy.ml · 8 months ago

For anyone wanting to contribute but on a smaller and more feasible scale, you can help distribute their database using torrents.

https://annas-archive.org/torrents

empireOfLove2@lemmy.dbzer0.com · edit-2 8 months ago

I know the last time this came up there was a lot of user resistance to the torrent scheme. I’d be willing to seed 200-500gb but having minimum torrent archive sizes of like 1.5TB and larger really limits the number of people willing to give up that storage, as well as defeats a lot of the resiliency of torrents with how bloody long it takes to get a complete copy. I know that 1.5TB takes a massive chunk out of my already pretty full NAS, and I passed on seeding the first time for that reason.

It feels like they didn’t really subdivide the database as much as they should have…

maxprime@lemmy.ml · 8 months ago

There are plenty of small torrents. Use the torrent generator and tell the script how much space you have and it will give you the “best” (least seeded) torrents whose sum is the size you give it. It doesn’t have to be big, even a few GB is suitable for some smaller torrents.

empireOfLove2@lemmy.dbzer0.com · edit-2 8 months ago

Almost all the small torrents that I see pop up are already seeded relatively good (~10 seeders) though, which reinforces the fact that A. the torrents most desperately needing seeders are the older, largest ones and B. large torrents don’t attract seeders because of unreasonable space requirements.

Admittedly, newer torrents seem to be split into 300gb or less pieces, which is good, but there’s still a lot of monster torrents in that list.

GravitySpoiled@lemmy.ml · 8 months ago

Thx.

Do you know how useful it is to host such a torrent? Who is accessing the content via that torrent?

maxprime@lemmy.ml · 8 months ago

Anyone who wants to. I think a lot of LLM trainers access them.

GravitySpoiled@lemmy.ml · 8 months ago

Doesn’t sound like I should host some of it. I’d be more down to host it for endusers

umbrella@lemmy.ml · 8 months ago

how big is the database?

books can’t be that big, but i’m guessing the selection is simply huge?

xrtxn@lemmy.sdf.org · 8 months ago

The selection is literally all books that can be found on the internet.

tsonfeir@lemm.ee · 8 months ago

So how big is that?

Index@feddit.nl · 8 months ago

According to their total dataset size excluding duplicates, over 900 TB

rufus · 8 months ago

Sure, that’s a bit more than $65.000 per year with Backblaze.

tsonfeir@lemm.ee · 8 months ago

Shit, my synology has more than that… alas, it is full of movie “archives”

state_electrician · 8 months ago

You run a petabyte Synology at home?

tsonfeir@lemm.ee · 8 months ago

Well, it’s not just a single synology, it’s got a bunch of expansion units, and there are multiple host machines.

dutchkimble@lemy.lol · 8 months ago

I’m guessing you’re talking GBs?

tsonfeir@lemm.ee · 8 months ago

Nope.

dutchkimble@lemy.lol · 8 months ago

That’s awesome - how many drives and of what sizes do you have? Also why synology instead of higher enterprise grade solution at this point?

FigMcLargeHuge@sh.itjust.works · edit-2 8 months ago

They put a link in with the total…

Total Excluding duplicates 133,708,037 files 913.1 TB

umbrella@lemmy.ml · edit-2 8 months ago

wait what? how expensive is it to buy and run? is it practical at all, what are the common snags? always wanted to get into doing some archiving.

tsonfeir@lemm.ee · 8 months ago

It’s an investment. It’s like the price of a small car. But it was built over time, so not like one lump sum.

Originally, it was to have easier access to my already insane Blu-ray collection. But I started getting discs from Redbox, rental stores, libraries, etc. they are full rip, not that compressed PB stuff. Now there are like 3000 movies and fuck knows how many tv shows.

A lot of my effort was to have the best release available. Or, have things that got canceled. Like the Simpsons episode with MJ, which is unavailable to stream.

Snags… well, synology is sooo easy. Once you figure out how you want you drives set up, there’s nothing to it.

Whatever you do, always have redundant drives. Yes, you lose space, but eventually one of them is gonna die and you don’t want to lose data.

redcalcium@lemmy.institute · 8 months ago

You should write a will instructing your family to send those disks to the internet archive for preservation if something happened to you.

AmbiguousProps@lemmy.today · 8 months ago

Correct me if I’m wrong, but they only index shadow libraries and do not host any files themselves (unless you count the torrents). So, you don’t need 900+ TB of storage to create a mirror.

Pussista@sh.itjust.works · 8 months ago

I imagine a couple of terabytes at the very least, though, I could be underestimating how many books have got deDRMed so far.

tsonfeir@lemm.ee · 8 months ago

Apparently it’s 900TB

Pussista@sh.itjust.works · 8 months ago

Girl, what? No wonder they’re having trouble hosting their archive. Does Anna’s Archive host copyrighted content as well or is all that copyleft?

redcalcium@lemmy.institute · 8 months ago

They host academic papers and books, most of them are copyrighted contents. They recently got in trouble for scraping a book metadata service to generate a list of books that hasn’t been archived yet: https://torrentfreak.com/lawsuit-accuses-annas-archive-of-hacking-worldcat-stealing-2-2-tb-data-240207/

Pussista@sh.itjust.works · 8 months ago

Is hosting all that stuff even legal? I mean, they’re not making any money off of it, but they’re still a “piracy” hub. How have they survived this long?

AmbiguousProps@lemmy.today · edit-2 8 months ago

They index, not host, no? (Unless you count the torrents, which are distributed)

smnwcj@fedia.io · 8 months ago

The archive includes copyrighted works. Often multiple copies of each work, across different formats.

FreudianCafe@lemmy.ml · 8 months ago

I guess more than 5?

spiderman@ani.social · edit-2 8 months ago

bigger than zlib or project Gutenberg?

redcalcium@lemmy.institute · 8 months ago

It is huge! They claimed to have preserved about 5% of the world’s books.

umbrella@lemmy.ml · 8 months ago

oh i actually tought it was way more! there wasnt a single book i wanted (or even tought to look up) that i didnt actually find in there.

HeartyOfGlass@lemm.ee · 8 months ago

Could anyone broad-stroke the security requirements for something like this? Looks like they’ll pay for hosting up to a certain amount, and between that and a pipeline to keep the mirror updated I’d think it wouldn’t be tough to get one up and running.

Just looking for theory - what are the logistics behind keeping a mirror like this secure?

thanksforallthefish@literature.cafe · edit-2 8 months ago

Could be worth asking on selfhosted (how do I link a sub on lemmy ?) They probably have more relevant experience at this sort of thing.

Edit

Does this work ?

https://lemmy.world/c/selfhosted

can@sh.itjust.works · 8 months ago

!selfhosted@lemmy.world might work for more people.

rufus · edit-2 8 months ago

!datahoarder@lemmy.ml

Is probably more suitable. I’d be interested in the total size, though.

catloaf@lemm.ee · 8 months ago

900 TB, according to other comments here.

Illecors@lemmy.cafe · 8 months ago

Is it all or nothing sort of deal?

catloaf@lemm.ee · 8 months ago

There are partial torrents, also according to the other comments.

Spunky Monkey@lemm.ee · 8 months ago

It does. 😉

obviouspornalt@lemmynsfw.com · 8 months ago

They outline it pretty well here:

https://annas-blog.org/how-to-run-a-shadow-library.html

tsonfeir@lemm.ee · 8 months ago

This is a fascinating read

Vigilante@lemmy.today · 8 months ago

Also link any ways to donate if they’re accepting that.

Andromxda 🇺🇦🇵🇸🇹🇼@lemmy.dbzer0.com · 8 months ago

https://annas-archive.org/donate

matcha_addict@lemy.lol · 8 months ago

I had no idea about this project. Is it like a better search engine for libgen etc?

weirdo_from_space@sh.itjust.works · 8 months ago

It searches through libgens, z-library and has it’s own mirrors of the files they serve on top of that. I think it was created as a response to Z-Library’s domain getting seized but I could be wrong.

Andromxda 🇺🇦🇵🇸🇹🇼@lemmy.dbzer0.com · 8 months ago

It has way more content than Libgen