ancuuiqter@lemmy.world to

Piracy: ꜱᴀɪʟ ᴛʜᴇ ʜɪɢʜ ꜱᴇᴀꜱ@lemmy.dbzer0.comEnglish · 3 years ago

Copyright lawsuits against Meta and OpenAI mention shadow libraries, including Library Genesis, as sources of training data

156

Copyright lawsuits against Meta and OpenAI mention shadow libraries, including Library Genesis, as sources of training data

ancuuiqter@lemmy.world to

Piracy: ꜱᴀɪʟ ᴛʜᴇ ʜɪɢʜ ꜱᴇᴀꜱ@lemmy.dbzer0.comEnglish · 3 years ago

archive.md

cross-posted from: https://lemmy.world/post/1330512

Below are direct quotes from the filings.

OpenAI

As noted in Paragraph 32, supra, the OpenAI Books2 dataset can be estimated to contain about 294,000 titles. The only “internet-based books corpora” that have ever offered that much material are notorious “shadow library” websites like Library Genesis (aka LibGen), Z-Library (aka B-4ok), Sci-Hub, and Bibliotik. The books aggregated by these websites have also been available in bulk via torrent systems. These flagrantly illegal shadow libraries have long been of interest to the AI-training community: for instance, an AI training dataset published in December 2020 by EleutherAI called “Books3” includes a recreation of the Bibliotik collection and contains nearly 200,000 books. On information and belief, the OpenAI Books2 dataset includes books copied from these “shadow libraries,” because those are the most sources of trainable books most similar in nature and size to OpenAI’s description of Books2.

Meta

Bibliotik is one of a number of notorious “shadow library” websites that also includes Library Genesis (aka LibGen), Z-Library (aka B-ok), and Sci-Hub. The books and other materials aggregated by these websites have also been available in bulk via torrent systems. These shadow libraries have long been of interest to the AI-training community because of the large quantity of copyrighted material they host. For that reason, these shadow libraries are also flagrantly illegal.

This article from Ars Tecnica covers a few more details. Filings are viewable at the law firm’s site here.

Chat

SinJab0n@mujico.org
link
fedilink
English
arrow-up
3·
3 years ago
I agree with u, it should be free to every PERSON who wants it.

As i said before thats the fundamental difference between individuals and a company stealing.
- Vendetta9076@sh.itjust.works
  link
  fedilink
  English
  arrow-up
  1·
  3 years ago
  We dont agree. Its not stealing and companies should have access to the same free information.

Piracy: ꜱᴀɪʟ ᴛʜᴇ ʜɪɢʜ ꜱᴇᴀꜱ@lemmy.dbzer0.com

piracy@lemmy.dbzer0.com

You are not logged in. However you can subscribe from another Fediverse account, for example Lemmy or Mastodon. To do this, paste the following into the search field of your instance: !piracy@lemmy.dbzer0.com

⚓ Dedicated to the discussion of digital piracy, including ethical problems and legal advancements.

Rules • Full Version

1. Posts must be related to the discussion of digital piracy

2. Don’t request invites, trade, sell, or self-promote

3. Don’t request or link to specific pirated titles, including DMs

4. Don’t submit low-quality posts, be entitled, or harass others

Loot, Pillage, & Plunder

📜 c/Piracy Wiki (Community Edition):

🏴‍☠️ Other communities

FUCK ADOBE!

!GenP@lemmy.dbzer0.com

Torrenting/P2P:

Gaming:

💰 Please help cover server costs.


Ko-fi	Liberapay

Visibility: Public

This community can be federated to other instances and be posted/commented in by their users.

143 users / day
1.44K users / week
3.13K users / month
8.82K users / 6 months
1.02K local subscribers
68.8K subscribers
4.63K Posts
111K Comments
Modlog