A group of authors filed a lawsuit against Meta, alleging the unlawful use of copyrighted material in developing its Llama 1 and Llama 2 large language models....
Fair use covers research, but creating a training database for your commercial product is distinctly different from research. They’re not publishing scientific papers, along with their data, which others can verify; they are developing a commercial product for profit. Even compared to traditional R&D this is markedly different, as they aren’t building a prototype - the test version will eventually become the finished product.
The way fair use works is that a judge first decides whether it fits into one of the categories - news, education, research, criticism, or comment. This does not really fit into the category of “research”, because it isn’t research, it’s the final product in an interim stage. However, even if it were considered research, the next step in fair use is the nature, in particular whether it is commercial. AI is highly commercial.
AI should not even be classified in a fair use category, but even if it were, it should not be granted any exemption because of how commercial it is.
They use other peoples’ work to profit. They should pay for it.
Facebook steals the data of individuals. They should pay for that, too. We don’t exchange our data for access to their website (or for access to some 3rd party Facebook pays to put a pixel on), the website is provided free of charge, and they try and shoehorn another transaction into the fine print of the terms and conditions where the user gives up their data free of charge. It is not proportionate, and the user’s data is taken without proper consideration (ie payment, in terms of the core principles of contract law).
Frankly, it is unsurprising that an entity like Facebook, which so egregiously breaks the law and abuses the rights of every human being who uses the interent, would try to abuse content creators in such a fashion. Their abuse needs to be stopped, in all forms, and they should be made to pay for all of it.
They’re not publishing scientific papers, along with their data, which others can verify;
Not that I think this is really relevant here but I’m pretty sure Meta has published scientific papers on Llama and the Llama 1 & 2 models are open and accessible to anyone.
No that is relevant, however I would still argue that a paper without enough data to replicate their work (ie releasing the code of their LLM) isn’t really anything that should qualify as research. The whole point of academia is that someone else verifies your work - or rather, they try to prove you wrong.
They have released it on github. The code is only about 500 lines. But releasing the model is arguably more important because that sort of compute is not affordable to any mortals.
Yeah I mean what they’ve released is essentially the design of the battery and starter system, without the design of the actual motor. You can’t replicate their product and prove their work with what they’ve published.
I said fair use covers news, education, research, criticism, or comment.
for purposes such as criticism, comment, news reporting, teaching (including multiple copies for classroom use), scholarship, or research
Then I said the next thing considered is whether it is commercial.
In determining whether the use made of a work in any particular case is a fair use the factors to be considered shall include—
(1) the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes
I didn’t cover everything in the law, I just covered the relevant points in a way that could be easily understood and related to the subject at hand.
My point is that the copying AI does isn’t really research, but even if it were considered research it is absolutely commercial and thus should not have a fair use exemption.
You need to read this carefully. It’s a statute. It means exactly what it says.
purposes such as criticism, comment, news reporting, teaching (including multiple copies for classroom use), scholarship, or research
Such as means that these are examples. This is not a complete list.
the factors to be considered shall include
All of these factors must be considered. It does not mean that other factors cannot be considered. These are not categories.
A commercial purpose does not rule out a finding of fair use (and vice versa). It must be considered and that is all.
I don’t think that Meta’s use can be classed as commercial. Presumably, they do hope that the research budget will pay off eventually. But what must be considered is the particular copying in question. Llama 2’s license looks to me fairly non-commercial.
Eventually, fair use derives from the constitution. Copyright is a limitation on the freedom of the press (and of speech). But it cannot completely do away with these freedoms. The examples given in the statue here could not be banned completely even if they were not mentioned.
I’ve seen a number of far-right commenters admit that this money grab would harm AI development (a “useful Art”). I think mostly these commenters hold some far-right ideology à la Ayn Rand that values property over society, but some may just be selfish and believe that they would personally benefit. Either way, it’s straight up anti-constitutional.
Here’s the summary for the wikipedia article you mentioned in your comment:
The Copyright Clause (also known as the Intellectual Property Clause, Copyright and Patent Clause, or the Progress Clause) describes an enumerated power listed in the United States Constitution (Article I, Section 8, Clause 8). The clause, which is the basis of copyright and patent laws in the United States, states that: [the United States Congress shall have power] To promote the Progress of Science and useful Arts, by securing for limited Times to Authors and Inventors the exclusive Right to their respective Writings and Discoveries.
Such as means that these are examples. This is not a complete list.
AI developers have explicitly envoked the research exemption. That is why I focused on that. I disagree that what they do is “research” for the reasons I gave previously. Bringing up the fact there are other exemptions is beside the point - they aren’t claiming any other exemption!
All of these factors must be considered. It does not mean that other factors cannot be considered. These are not categories.
Sure, but I never said that commerciality was the only thing that should be considered. My claim here is simply that it is so overwhelmingly commercial in nature that it overrides anything else and thus they should not be awarded the privilege of an exemption.
A commercial purpose does not rule out a finding of fair use (and vice versa).
A commercial purpose might not rule out of a finding of fair use. That does not mean it cannot rule out such a finding. All factors must be considered, but any one factor can outweigh the others.
I never said it was an exclusive category, I just brought it up as the most significant factor - one which is not reasonably overruled by any of the others in this circumstance. In fact, every one of those arguably fails. To give detail:
The copying is done in a commercial nature. They sell AI services. It’s offered very cheap right now - even for free for limited personal use - but eventually that will change as their demand for profit grows.
The nature of the copied work is varied and includes all kinds of work, commercial and non-commercial. The copying is pandemic.
The whole work has been copied into the training database. Significant portions of the work can and have been reproduced by the finished product, in spite of the finished product allegedly not containing the original work in its database. Furthermore, even if a human genuinely believes they aren’t copying something they read before, that does not mean they are innocent of copyright infringement - it is the similarity of the two works that make the determining factor.
AI work is already flooding the market and pushing out original creators. Childrens’ books is one area where this is happening extensively - not only does this make it harder for genuine authors to get a break in the market, but they’re effectively training children to think AI work is normal. It’s not hard to see us headed to a future where people think AI is “real” and original work is “fake”, simply by volume.
I will admit, not all of those arguments are very strong (particularly 4.). However 1. is the strongest and I think overrides any argument the other way for any other.
I don’t think that Meta’s use can be classed as commercial. Presumably, they do hope that the research budget will pay off eventually.
Those two statements contradict one another. Of course they want it to be commercial eventually - or, rather, they want to eventually turn a profit. Hell, AI is already being used in a commercial manner: if you want to make significant or non-personal use of AI systems currently on the market, you have to pay for it.
Eventually, fair use derives from the constitution.
Setting aside the fact that AI extends far beyond the borders of the US and its constitution, fair use and copyright are derived from copyright law, which is written by Congress. The Constitution grants Congress the right to write such laws, but no one is “invoking the Constitution” when they enforce copyright or claim fair use. The Constitution gives permission, but the law forms the definition.
AI is not simply a “useful Art”. It is a commercial venture that exploits original work without duly compensating the authors of said work. Congress has a greater duty to protect those original authors than it does a business that seeks to exploit their work. I say this as someone who has never really made much of anything original myself. I play a bit of music, but don’t compose and just do covers. I probably (lol limewire definitely) infringe on copyright - but I do so exclusively in a non-commercial manner.
Blurting out “far-right” is borderline a personal insult - one which is laughably far from the mark when addressed towards me - and points to you clutching at straws to cling to a frivilous argument.
I now feel the need to ask, why do you so passionately defend AI businesses here? Why do you support them?
Are you that infatuated with the novelty of their product that you have let go of objectivity?
I also have to emphasise again that I’m a little disgusted that you made this political. You’ve tried to build an argument that “it is a Constitutional right” to infringe copyright in order to have AI tools, and you’re implying that anyone who opposes that idea is some kind of far-right nutjob. I hadn’t even heard of Ayn Rand before you mentioned her, but have you actually read her work, or did you just watch the Atlas Shrugged movie and form your opinions from internet memes?
I’d actually probably agree with you about AI - if it was non-commercial in nature and truly for the benefit of the people. As it is, I think you are blinded by the sheen of a new toy, without realising it’s coated in lead paint.
A commercial purpose might not rule out of a finding of fair use.
ARRRRG I spent so long reviewing this comment, over and over and over again, and still there were words wrong. I’m not editing it though, I want the comment to stay clean.
(3) the amount and substantiality of the portion used in relation to the copyrighted work as a whole
(4) the effect of the use upon the potential market for or value of the copyrighted work.
The main argument for this being fair use is both that a single work of copyright bears little to no relationship to the end product, and that the model itself does not effect the market for - or value of - the copyrighted work (note: the market for additional works produced is not what is in question here, it is the market for the work that has been copied).
The main argument for this being fair use is both that a single work of copyright bears little to no relationship to the end product
It bears relationship to the end product when the end product reproduces the original work.
that the model itself does not effect the market for - or value of - the copyrighted work
Given that AI is poised to take over the position of original writers and flood the market with fake work, copying not only their words but their very style, I’d argue it does affect the value of existing work. With children’s books already being heavily written by AI, it seems quite likely that we will before too long get to the point where people expect things to be written by AI, thus devaluing true creative and original work.
I appreciate your enthusiasm here, but the law (and precedent reading of the law) simply does not bear out a clear interpretation like you’re suggesting.
It bears relationship to the end product when the end product reproduces the original work.
This is not how copyright has been applied when speaking of other machine learning processes using logical regression that is considered fair use, as in Text and Data Mining classifications(TDM) (proposed class 7(a) and 7(b) (page 102) in Recommendation of the Register of Copyrights 2021). The model itself is simply a very large regression model, that has created metadata analysis from unstructured data sources. When determining weather an LLM fits into this fair use category, they will look at what the model is and how it is created, not to whether it can be prompted to recreate a similar work. To quote from Comments in Response of Notice of Inquiry on the matter:
Understanding the process of training foundation models is relevant to the generative AI systems’ fair use defenses because the scope of copyright protection does not extend to “statistical information” such as “word frequencies, syntactic patterns, and thematic markers.” Processing in-copyright works to extract “information about the original [work]” does not infringe because it does not “replicat[e] protected expression.
Granted, what is novel about this particular case (LLM’s generally) is their apparent ability to re-construct substantially similar works from the same overall process of TDM. Acknowledged, but to borrow again from the same comments as above:
Yet, in limited situations, Generative AI models do copy the training data.24 So unlike prior copy-reliant technologies that courts have held are fair use, it is impossible to say
categorically that inputs and outputs of Generative AI will always be fair use. We note in addition that some have argued that the ability of Generative AI to produce artifacts that could pass for human expression and the potential scale of such production may have implications not seen in previous non-expressive use cases. The difficulty with such arguments is that the harm asserted does not flow from the communication of protected expression to any human audience.
Basically, they are asserting that applying copyright to this use that falls outside of its explicit scope would not prevent the same harm caused by that same technology created without the use of the copyrighted works. Any work sufficiently described in publicly available text data could be reconstructed by a sufficiently weighted regression model and the correct prompting. E.g. - if I described a desired output sufficiently enough in my input to the model, the output could be substantially similar to a protected work, regardless of its lack of representation in the training data.
I happen to agree that these AI models represent a threat to the work and livelihoods of real artists, and that the benefit as currently captured by billion-dollar companies is a substantial problem that must be addressed, but I simply do not think the application of copyright in this manner is appropriate (as it will prevent legitimate uses of the technology), nor do i think it is sufficiently preventative in future consolidation of wealth by the use of these models.
Nevermind my personal objections to copyright law on the basis of my worldview - I just don’t think copyright is the correct tool to use for the desired protection.
This is not how copyright has been applied when speaking of other machine learning processes using logical regression that is considered fair use, as in Text and Data Mining classifications(TDM) (proposed class 7(a) and 7(b) (page 102) in Recommendation of the Register of Copyrights 2021).
Your link is merely proposed recommendations. That is not legislation nor case law. Also, the sections on TDM that you reference clearly state (my emphasis):
for the purpose of scholarly research and teaching.
I think this is even more abundantly clear that the research exemption does not apply. AI “research” is in no way “scholarly”, it is commercial product development and thus does not align with fair use copyright exemptions.
It’s also not talking about building AI, but circumventing DRM in order to preserve art. They’re saying that there should be an exemption to the illegal practice of circumventing DRM in certain, limited circumstances. However, they’re still only suggesting this! So not only does this not apply to your argument, it isn’t even actually in force.
To put your other link into context, this also is not law, but comments from legal professors.
Understanding the process of training foundation models is relevant to the generative AI systems’ fair use defenses because the scope of copyright protection does not extend to “statistical information” such as “word frequencies, syntactic patterns, and thematic markers.” Processing in-copyright works to extract “information about the original [work]” does not infringe because it does not “replicat[e] protected expression.
The flaw here is that the work isn’t processed in situ, it is copied into a training database, then processed. The processing may be fine, but the copying is illegal.
If they had a legitimate license to view the work for their purpose, and processed in situ, that might be different.
The difficulty with such arguments is that the harm asserted does not flow from the communication of protected expression to any human audience.
The argument here is that, while it sometimes infringes copyright, the harm it causes isn’t primarily from the infringing act. Not always, though that depends. If AI is used to pass off as someone else, then the AI manufacturer has built a tool that facilitates an illegal act, by copying the original work.
However, this, again, ignores the fact that the commercial enterprise has copied the data into their training database without duly compensating the rightsholder.
Critical to understanding whether this applies is to understand “use” in the first place. I would argue it’d even more important because it’s a threshold question in whether you even need to read 107.
17 U.S. Code § 106 - Exclusive rights in copyrighted works
Subject to sections 107 through 122, the owner of copyright under this title has the exclusive rights to do and to authorize any of the following:
(1)to reproduce the copyrighted work in copies or phonorecords;
(2)to prepare derivative works based upon the copyrighted work;
(3)to distribute copies or phonorecords of the copyrighted work to the public by sale or other transfer of ownership, or by rental, lease, or lending;
(4)in the case of literary, musical, dramatic, and choreographic works, pantomimes, and motion pictures and other audiovisual works, to perform the copyrighted work publicly;
(5)in the case of literary, musical, dramatic, and choreographic works, pantomimes, and pictorial, graphic, or sculptural works, including the individual images of a motion picture or other audiovisual work, to display the copyrighted work publicly; and
(6)in the case of sound recordings, to perform the copyrighted work publicly by means of a digital audio transmission.
Copyright protects just what it sounds like- the right to “copy” or reproduce a work along the examples given above. It is not clear that use in training AI falls into any of these categories. The question mainly relates to items 1 and 2.
If you read through the court filings against OpenAI and Stability AI, much of the argument is based around trying to make a claim under case 1. If you put a model into an output loop you can get it to reproduce small sections of training data that include passages from copyrighted works, although of course nowhere near the full corpus can be retrieved because the model doesn’t contain any thing close to a full data set - the models are much too small and that’s also not how transformers architecture works. But in some cases, models can preserve and output brief sections of text or distorted images that appear highly similar to at least portions of training data. Even so, it’s not clear that this is protected under copyright law because they are small snippets that are not substitutes for the original work, and don’t affect the market for it.
Case 2 would be relevant if an LLM were classified as a derivative work. But LLMs are also not derivative works in the conventional definition, which is things like translated or abridged versions, or different musical arrangements in the case of music.
For these reasons, it is extremely unclear whether copyright protections are even invoked, becuase the nature of the use in model training does not clearly fall under any of the enumerated rights. This is not the first time this has happened, either - the DMCA of 1998 amended the Copyright Act of 1976 to add cases relating to online music distribution as the previous copyright definitions did not clearly address online filesharing.
There are a lot of strong opinions about the ethics of training models and many people are firm believers that either it should or shouldn’t be allowed. But the legal question is much more hazy, because AI model training was not contemplated even in the DMCA. I’m watching these cases with interest because I don’t think the law is at all settled here. My personal view is that an act of congress would be necessary to establish whether use of copyrighted works in training data, even for purposes of developing a commercial product, should be one of the enumerated protections of copyright. Under current law, I’m not certain that it is.
(1)to reproduce the copyrighted work in copies or phonorecords
The works are copied in their entirey and reproduced in the training database. AI businesses do not deny this is copying, but instead claim it is research and thus has a fair use exemption.
I argue it is not research, but product development - and furthermore, unlike traditional R&D, it is not some prototype that is different and separate from the commercial product. The prototype is the commercial product.
(2)to prepare derivative works based upon the copyrighted work
AI can and has reproduced significant portions of copyrighted work, even in spite of the fact that the finished product allegedly does not include the work in its database (it just read the training database).
Furthermore, even if a human genuinely and honestly believes they’re writing something original, that does not matter when they reproduce work that they have read before. What determines copyright infringement is the similarity of the two works.
If you read through the court filings against OpenAI and Stability AI, much of the argument is based around trying to make a claim under case 1.
The position that I take is that the arguments made against OpenAI and Stability AI in court are not complete. They’re not quite good enough. However, that doesn’t mean there isn’t a valid argument that is good enough. I just hope we don’t get a ruling in favour of AI businesses simply because the people challenging them didn’t employ the right ammunition.
With regards to Case 2, I refer back to my comment about the similarity of the work. The argument isn’t that the LLM itself is an infringement of copyright, but that the LLM, as designed by the business, infringes copyright in the same way a human would.
I definitely agree it is all extremely unclear. However, I maintain that the textual definition of the law absolutely still encompasses the feeling that peoples’ work is being ripped off for a commercial venture. Because it is so commercial, original authors are being harmed as they will not see any benefit from the commercial profits.
I would also like to point you to my other comment, which I put a lot of time into and where I expanded on many other points (link to your instance’s version): https://lemmy.world/comment/6706240
The works are copied in their entirey and reproduced in the training database. AI businesses do not deny this is copying, but instead claim it is research and thus has a fair use exemption.
The copying of the data is not, by itself, infringement. It depends on the use and purpose of the copied data, and the defense argues that training a model against the data is fair use under TDM use-cases.
AI can and has reproduced significant portions of copyrighted work, even in spite of the fact that the finished product allegedly does not include the work in its database (it just read the training database).
The model does not have a ‘database’, it is a series of transform nodes weighted against unstructured data. The transformation of the copyrighted works into a weighted regression model is what is being argued is fair use.
Furthermore, even if a human genuinely and honestly believes they’re writing something original, that does not matter when they reproduce work that they have read before.
yup, and it isn’t the act of that human reading a copyrighted work that is considered as infringement, it is the creation of the work that is substantially similar. In the same analogy, it wouldn’t be the creation of the AI model that is the infringement, but each act of creation thereafter that is substantially similar to a copyrighted work. But this comes with a bunch of other problems for the plaintiffs, and would be a losing case without merit.
The position that I take is that the arguments made against OpenAI and Stability AI in court are not complete
The argument isn’t that the LLM itself is an infringement of copyright, but that the LLM, as designed by the business, infringes copyright in the same way a human would.
Trying really hard not to come off as rude, but there’s a good reason why this isn’t the argument being put forward in the lawsuits. If this was their argument, the LLM could be considered a commissioned agent, placing the liability on the agent commissioning the work (e.g. the human prompting the work) - not OpenAI or Stability - in much the same way a company is held responsible for the work produced by an employee.
I really do understand the anger and frustration apparent in these comments, but I would really like to encourage you to learn a bit more about the basis for these cases before spending substantial effort writing long responses.
The copying of the data is not, by itself, infringement.
Copyright is absolute. The rightsholder has complete and total right to dictate how it is copied. Thus, any unauthorised copying is copyright infringement. However, fair use gives exemption to certain types of copying. The copyright is still being infringed, because the rightsholder’s absolute rights are being circumvented, however the penalty is not awarded because of fair use.
This is all just pedantry, though, and has no practical significance. Saying “fair use means copyright has not been infringed” doesn’t change anything.
it is a series of transform nodes weighted against unstructured data.
That’s a database. Or perhaps rather some kind of 3D array - which could just be considered an advanced form of database. But yeah, you’re right here, you win this pedantry round lol. 1-1.
it wouldn’t be the creation of the AI model that is the infringement, but each act of creation thereafter that is substantially similar to a copyrighted work. But this comes with a bunch of other problems for the plaintiffs, and would be a losing case without merit.
Yeah I don’t want to go down the avenue of suing the AI itself for infringement. However…[1][2][3]
Trying really hard not to come off as rude
You’re not coming off as rude at all with what you’ve said, in fact I welcome and appreciate your rebuttals.
I really do understand the anger and frustration apparent in these comments, but I would really like to encourage you to learn a bit more about the basis for these cases before spending substantial effort writing long responses.
You say that as if I haven’t enjoyed fleshing out the ideas and sharing them. By the way, right now I’m sharing with you lemmy’s hidden citation feature :o)
Although, I was much happier replying to you before I just saw the downvotes you’ve apparently given me across the board. That’s a bit poor behaviour on your part, you shouldn’t downvote just because you disagree - and you can’t even say that I’m wrong as a justification when the whole thing is being heavily debated and adjudicated over whether it is right or wrong.
I thought we were engaging in a positive manner, but apparently you’ve been spitting in my face.
but there’s a good reason why this isn’t the argument being put forward in the lawsuits.
The LLM absolutely could be considered an agent, but the way it acts is merely prompted by the user. The actual behaviour is dictated by the organisation that built it. In any case, this is only my backup argument if you even consider the initial copying to be research - which it isn’t. ↩︎
Copyright is absolute. The rightsholder has complete and total right to dictate how it is copied.
Really and truly, this is not how this works. The exemptions granted by the office of the registrar are granting an exemption to copyright claims against fair uses. It isn’t talking about whether the claim can be awarded damages, it’s talking about the claim being exempt in entirety. You can think about copyright as an exemption to the first amendment right to free speech, and the exemption to copyright as describing where that ‘right’ does not apply. Copyright holders do not get to control the use of their work where fair use has been determined by the registrar, which is reconsidered every 3 years.
This is all just pedantry, though, and has no practical significance. Saying “fair use means copyright has not been infringed” doesn’t change anything.
True enough, but it seems like it’s important for your understanding in how copyright works.
Or perhaps rather some kind of 3D array - which could just be considered an advanced form of database. But yeah, you’re right here, you win this pedantry round lol. 1-1.
I wasn’t being pedantic, that distinction is important for how copyright is conceptualized. The AI model is the thing being considered for infringement, so it’s important to note that the works being claimed within it do not exist as such within the model. The ‘3-d array’ does not contain copyrighted works. You can think of it as a large metadata file, describing how to construct language as analyzed through the training data. The nature and purpose of the ‘work’ is night-and-day different from the works being claimed, and ‘database’ is a clear misrepresentation (possibly even intentionally so) of what it is.
Yeah I don’t want to go down the avenue of suing the AI itself for infringement.
That was exactly what you pivoted to in your comment here, i’m not sure why you’re now saying you don’t want to go down that avenue. I’m confused what you’re arguing at this point.
Although, I was much happier replying to you before I just saw the downvotes you’ve apparently given me across the board. That’s a bit poor behaviour on your part, you shouldn’t downvote just because you disagree - and you can’t even say that I’m wrong as a justification when the whole thing is being heavily debated and adjudicated over whether it is right or wrong.
I’ve down-voted your comments because they contain inaccuracies and could be misleading to others. You shouldn’t let my grading of your comments reflect my attitude towards you; i’m sure you’re a fine individual. Downvotes don’t mean anything on Lemmy anyway, i’m not sure ‘spitting in your face’ is a fair or accurate description, but I don’t want to invalidate your feelings, so I apologize for making you feel that way as that wasn’t my intent.
Fair use covers research, but creating a training database for your commercial product is distinctly different from research. They’re not publishing scientific papers, along with their data, which others can verify;
Since when is there a legal requirement to publish the results of your research?
They use other peoples’ work to profit. They should pay for it.
Sorry but that’s just not how the world works. A big part of it is just plain practicality - how could you possibly find out who to pay? If I wanted to pay you one cent for the right to learn from things you’ve written on the fediverse, how would I even contact you? Or even find out who you are since I assume TWeaK isn’t your real name. And how would I get the money to you?
Like it or not, a lot of value created doesn’t get paid for. That’s just the way the world works… and among other things, Fair Use codifies that fact into law.
Facebook steals the data of individuals. They should pay for that, too. We don’t exchange our data for access to their website (or for access to some 3rd party Facebook pays to put a pixel on)
Facebook isn’t “stealing” that data. Third party websites voluntarily and put tracking pixels on their site with full awareness that visitors are going to be tracked. That’s why they do it - the website operator is given access to all of the data facebook picks up. If you have a complaint, it should primarily be with the website operator especially if they don’t ask the user for permission first (a lot of sites ask these days, I always say no personally. And run a browser extension that blocks it on sites that don’t ask).
My answer to both your comments is that just because a lot of people get away with breaking the law and abusing peoples’ rights doesn’t mean it hasn’t happened and they can’t or shouldn’t be held to account.
It’s Amazon, I’m pretty sure they know who to pay for any book they want. They are all already on their platform, with payment information too. They don’t even have to ask the author where to send the money, they already know. They could do it whenever they want, they have the funds to do it, billions of dollars laying around.
They’ve just decided they’d rather have it for free and keep the money.
But don’t you dare think you could do the same, you should pay for copyrighted work, and you should buy it on one of Amazon’s services that sell you access to say copyrighted work. Fucking peasant…
Fair use covers research, but creating a training database for your commercial product is distinctly different from research. They’re not publishing scientific papers, along with their data, which others can verify; they are developing a commercial product for profit. Even compared to traditional R&D this is markedly different, as they aren’t building a prototype - the test version will eventually become the finished product.
The way fair use works is that a judge first decides whether it fits into one of the categories - news, education, research, criticism, or comment. This does not really fit into the category of “research”, because it isn’t research, it’s the final product in an interim stage. However, even if it were considered research, the next step in fair use is the nature, in particular whether it is commercial. AI is highly commercial.
AI should not even be classified in a fair use category, but even if it were, it should not be granted any exemption because of how commercial it is.
They use other peoples’ work to profit. They should pay for it.
Facebook steals the data of individuals. They should pay for that, too. We don’t exchange our data for access to their website (or for access to some 3rd party Facebook pays to put a pixel on), the website is provided free of charge, and they try and shoehorn another transaction into the fine print of the terms and conditions where the user gives up their data free of charge. It is not proportionate, and the user’s data is taken without proper consideration (ie payment, in terms of the core principles of contract law).
Frankly, it is unsurprising that an entity like Facebook, which so egregiously breaks the law and abuses the rights of every human being who uses the interent, would try to abuse content creators in such a fashion. Their abuse needs to be stopped, in all forms, and they should be made to pay for all of it.
Not that I think this is really relevant here but I’m pretty sure Meta has published scientific papers on Llama and the Llama 1 & 2 models are open and accessible to anyone.
No that is relevant, however I would still argue that a paper without enough data to replicate their work (ie releasing the code of their LLM) isn’t really anything that should qualify as research. The whole point of academia is that someone else verifies your work - or rather, they try to prove you wrong.
They have released it on github. The code is only about 500 lines. But releasing the model is arguably more important because that sort of compute is not affordable to any mortals.
Yeah I mean what they’ve released is essentially the design of the battery and starter system, without the design of the actual motor. You can’t replicate their product and prove their work with what they’ve published.
That’s not at all how fair use works.
That is exactly how fair use works. Look up the legislation and quote where it says I’m wrong.
Sure. I mean, not sure why you wouldn’t just look it up yourself but ok. It takes like 60 seconds to look up and copy/paste.
Notwithstanding the provisions of sections 106 and 106A, the fair use of a copyrighted work, including such use by reproduction in copies or phonorecords or by any other means specified by that section, for purposes such as criticism, comment, news reporting, teaching (including multiple copies for classroom use), scholarship, or research, is not an infringement of copyright. In determining whether the use made of a work in any particular case is a fair use the factors to be considered shall include— (1) the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes; (2) the nature of the copyrighted work; (3) the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and (4) the effect of the use upon the potential market for or value of the copyrighted work. The fact that a work is unpublished shall not itself bar a finding of fair use if such finding is made upon consideration of all the above factors.
So where does that say I’m wrong?
I said fair use covers news, education, research, criticism, or comment.
Then I said the next thing considered is whether it is commercial.
I didn’t cover everything in the law, I just covered the relevant points in a way that could be easily understood and related to the subject at hand.
My point is that the copying AI does isn’t really research, but even if it were considered research it is absolutely commercial and thus should not have a fair use exemption.
You need to read this carefully. It’s a statute. It means exactly what it says.
Such as means that these are examples. This is not a complete list.
All of these factors must be considered. It does not mean that other factors cannot be considered. These are not categories.
A commercial purpose does not rule out a finding of fair use (and vice versa). It must be considered and that is all.
I don’t think that Meta’s use can be classed as commercial. Presumably, they do hope that the research budget will pay off eventually. But what must be considered is the particular copying in question. Llama 2’s license looks to me fairly non-commercial.
Eventually, fair use derives from the constitution. Copyright is a limitation on the freedom of the press (and of speech). But it cannot completely do away with these freedoms. The examples given in the statue here could not be banned completely even if they were not mentioned.
The US Constitution itself allows congress to create copyrights. Or more precisely, it empowers congress to promote the Progress of Science and useful Arts by creating copyrights. That’s another limitation.
I’ve seen a number of far-right commenters admit that this money grab would harm AI development (a “useful Art”). I think mostly these commenters hold some far-right ideology à la Ayn Rand that values property over society, but some may just be selfish and believe that they would personally benefit. Either way, it’s straight up anti-constitutional.
Here’s the summary for the wikipedia article you mentioned in your comment:
The Copyright Clause (also known as the Intellectual Property Clause, Copyright and Patent Clause, or the Progress Clause) describes an enumerated power listed in the United States Constitution (Article I, Section 8, Clause 8). The clause, which is the basis of copyright and patent laws in the United States, states that: [the United States Congress shall have power] To promote the Progress of Science and useful Arts, by securing for limited Times to Authors and Inventors the exclusive Right to their respective Writings and Discoveries.
to opt out, pm me ‘optout’. article | about
AI developers have explicitly envoked the research exemption. That is why I focused on that. I disagree that what they do is “research” for the reasons I gave previously. Bringing up the fact there are other exemptions is beside the point - they aren’t claiming any other exemption!
Sure, but I never said that commerciality was the only thing that should be considered. My claim here is simply that it is so overwhelmingly commercial in nature that it overrides anything else and thus they should not be awarded the privilege of an exemption.
A commercial purpose might not rule out of a finding of fair use. That does not mean it cannot rule out such a finding. All factors must be considered, but any one factor can outweigh the others.
I never said it was an exclusive category, I just brought it up as the most significant factor - one which is not reasonably overruled by any of the others in this circumstance. In fact, every one of those arguably fails. To give detail:
I will admit, not all of those arguments are very strong (particularly 4.). However 1. is the strongest and I think overrides any argument the other way for any other.
Those two statements contradict one another. Of course they want it to be commercial eventually - or, rather, they want to eventually turn a profit. Hell, AI is already being used in a commercial manner: if you want to make significant or non-personal use of AI systems currently on the market, you have to pay for it.
Setting aside the fact that AI extends far beyond the borders of the US and its constitution, fair use and copyright are derived from copyright law, which is written by Congress. The Constitution grants Congress the right to write such laws, but no one is “invoking the Constitution” when they enforce copyright or claim fair use. The Constitution gives permission, but the law forms the definition.
AI is not simply a “useful Art”. It is a commercial venture that exploits original work without duly compensating the authors of said work. Congress has a greater duty to protect those original authors than it does a business that seeks to exploit their work. I say this as someone who has never really made much of anything original myself. I play a bit of music, but don’t compose and just do covers. I probably (lol limewire definitely) infringe on copyright - but I do so exclusively in a non-commercial manner.
Blurting out “far-right” is borderline a personal insult - one which is laughably far from the mark when addressed towards me - and points to you clutching at straws to cling to a frivilous argument.
I now feel the need to ask, why do you so passionately defend AI businesses here? Why do you support them?
Are you that infatuated with the novelty of their product that you have let go of objectivity?
I also have to emphasise again that I’m a little disgusted that you made this political. You’ve tried to build an argument that “it is a Constitutional right” to infringe copyright in order to have AI tools, and you’re implying that anyone who opposes that idea is some kind of far-right nutjob. I hadn’t even heard of Ayn Rand before you mentioned her, but have you actually read her work, or did you just watch the Atlas Shrugged movie and form your opinions from internet memes?
I’d actually probably agree with you about AI - if it was non-commercial in nature and truly for the benefit of the people. As it is, I think you are blinded by the sheen of a new toy, without realising it’s coated in lead paint.
ARRRRG I spent so long reviewing this comment, over and over and over again, and still there were words wrong. I’m not editing it though, I want the comment to stay clean.
Pretty sure @General_Effort@lemmy.world is referring to this portion:
The main argument for this being fair use is both that a single work of copyright bears little to no relationship to the end product, and that the model itself does not effect the market for - or value of - the copyrighted work (note: the market for additional works produced is not what is in question here, it is the market for the work that has been copied).
Sorry for the double reply, but I did also expand further upon (3) and (4), and other aspects, in my latest reply to /u/General_Effort@lemmy.world (link to your instance’s version): https://midwest.social/comment/6225045
It bears relationship to the end product when the end product reproduces the original work.
Given that AI is poised to take over the position of original writers and flood the market with fake work, copying not only their words but their very style, I’d argue it does affect the value of existing work. With children’s books already being heavily written by AI, it seems quite likely that we will before too long get to the point where people expect things to be written by AI, thus devaluing true creative and original work.
I appreciate your enthusiasm here, but the law (and precedent reading of the law) simply does not bear out a clear interpretation like you’re suggesting.
This is not how copyright has been applied when speaking of other machine learning processes using logical regression that is considered fair use, as in Text and Data Mining classifications(TDM) (proposed class 7(a) and 7(b) (page 102) in Recommendation of the Register of Copyrights 2021). The model itself is simply a very large regression model, that has created metadata analysis from unstructured data sources. When determining weather an LLM fits into this fair use category, they will look at what the model is and how it is created, not to whether it can be prompted to recreate a similar work. To quote from Comments in Response of Notice of Inquiry on the matter:
Granted, what is novel about this particular case (LLM’s generally) is their apparent ability to re-construct substantially similar works from the same overall process of TDM. Acknowledged, but to borrow again from the same comments as above:
Basically, they are asserting that applying copyright to this use that falls outside of its explicit scope would not prevent the same harm caused by that same technology created without the use of the copyrighted works. Any work sufficiently described in publicly available text data could be reconstructed by a sufficiently weighted regression model and the correct prompting. E.g. - if I described a desired output sufficiently enough in my input to the model, the output could be substantially similar to a protected work, regardless of its lack of representation in the training data.
I happen to agree that these AI models represent a threat to the work and livelihoods of real artists, and that the benefit as currently captured by billion-dollar companies is a substantial problem that must be addressed, but I simply do not think the application of copyright in this manner is appropriate (as it will prevent legitimate uses of the technology), nor do i think it is sufficiently preventative in future consolidation of wealth by the use of these models.
Nevermind my personal objections to copyright law on the basis of my worldview - I just don’t think copyright is the correct tool to use for the desired protection.
Your link is merely proposed recommendations. That is not legislation nor case law. Also, the sections on TDM that you reference clearly state (my emphasis):
I think this is even more abundantly clear that the research exemption does not apply. AI “research” is in no way “scholarly”, it is commercial product development and thus does not align with fair use copyright exemptions.
It’s also not talking about building AI, but circumventing DRM in order to preserve art. They’re saying that there should be an exemption to the illegal practice of circumventing DRM in certain, limited circumstances. However, they’re still only suggesting this! So not only does this not apply to your argument, it isn’t even actually in force.
To put your other link into context, this also is not law, but comments from legal professors.
The flaw here is that the work isn’t processed in situ, it is copied into a training database, then processed. The processing may be fine, but the copying is illegal.
If they had a legitimate license to view the work for their purpose, and processed in situ, that might be different.
The argument here is that, while it sometimes infringes copyright, the harm it causes isn’t primarily from the infringing act. Not always, though that depends. If AI is used to pass off as someone else, then the AI manufacturer has built a tool that facilitates an illegal act, by copying the original work.
However, this, again, ignores the fact that the commercial enterprise has copied the data into their training database without duly compensating the rightsholder.
Critical to understanding whether this applies is to understand “use” in the first place. I would argue it’d even more important because it’s a threshold question in whether you even need to read 107.
17 U.S. Code § 106 - Exclusive rights in copyrighted works Subject to sections 107 through 122, the owner of copyright under this title has the exclusive rights to do and to authorize any of the following: (1)to reproduce the copyrighted work in copies or phonorecords; (2)to prepare derivative works based upon the copyrighted work; (3)to distribute copies or phonorecords of the copyrighted work to the public by sale or other transfer of ownership, or by rental, lease, or lending; (4)in the case of literary, musical, dramatic, and choreographic works, pantomimes, and motion pictures and other audiovisual works, to perform the copyrighted work publicly; (5)in the case of literary, musical, dramatic, and choreographic works, pantomimes, and pictorial, graphic, or sculptural works, including the individual images of a motion picture or other audiovisual work, to display the copyrighted work publicly; and (6)in the case of sound recordings, to perform the copyrighted work publicly by means of a digital audio transmission.
Copyright protects just what it sounds like- the right to “copy” or reproduce a work along the examples given above. It is not clear that use in training AI falls into any of these categories. The question mainly relates to items 1 and 2.
If you read through the court filings against OpenAI and Stability AI, much of the argument is based around trying to make a claim under case 1. If you put a model into an output loop you can get it to reproduce small sections of training data that include passages from copyrighted works, although of course nowhere near the full corpus can be retrieved because the model doesn’t contain any thing close to a full data set - the models are much too small and that’s also not how transformers architecture works. But in some cases, models can preserve and output brief sections of text or distorted images that appear highly similar to at least portions of training data. Even so, it’s not clear that this is protected under copyright law because they are small snippets that are not substitutes for the original work, and don’t affect the market for it.
Case 2 would be relevant if an LLM were classified as a derivative work. But LLMs are also not derivative works in the conventional definition, which is things like translated or abridged versions, or different musical arrangements in the case of music.
For these reasons, it is extremely unclear whether copyright protections are even invoked, becuase the nature of the use in model training does not clearly fall under any of the enumerated rights. This is not the first time this has happened, either - the DMCA of 1998 amended the Copyright Act of 1976 to add cases relating to online music distribution as the previous copyright definitions did not clearly address online filesharing.
There are a lot of strong opinions about the ethics of training models and many people are firm believers that either it should or shouldn’t be allowed. But the legal question is much more hazy, because AI model training was not contemplated even in the DMCA. I’m watching these cases with interest because I don’t think the law is at all settled here. My personal view is that an act of congress would be necessary to establish whether use of copyrighted works in training data, even for purposes of developing a commercial product, should be one of the enumerated protections of copyright. Under current law, I’m not certain that it is.
This is an extremely good write-up, thank you for this.
The works are copied in their entirey and reproduced in the training database. AI businesses do not deny this is copying, but instead claim it is research and thus has a fair use exemption.
I argue it is not research, but product development - and furthermore, unlike traditional R&D, it is not some prototype that is different and separate from the commercial product. The prototype is the commercial product.
AI can and has reproduced significant portions of copyrighted work, even in spite of the fact that the finished product allegedly does not include the work in its database (it just read the training database).
Furthermore, even if a human genuinely and honestly believes they’re writing something original, that does not matter when they reproduce work that they have read before. What determines copyright infringement is the similarity of the two works.
The position that I take is that the arguments made against OpenAI and Stability AI in court are not complete. They’re not quite good enough. However, that doesn’t mean there isn’t a valid argument that is good enough. I just hope we don’t get a ruling in favour of AI businesses simply because the people challenging them didn’t employ the right ammunition.
With regards to Case 2, I refer back to my comment about the similarity of the work. The argument isn’t that the LLM itself is an infringement of copyright, but that the LLM, as designed by the business, infringes copyright in the same way a human would.
I definitely agree it is all extremely unclear. However, I maintain that the textual definition of the law absolutely still encompasses the feeling that peoples’ work is being ripped off for a commercial venture. Because it is so commercial, original authors are being harmed as they will not see any benefit from the commercial profits.
I would also like to point you to my other comment, which I put a lot of time into and where I expanded on many other points (link to your instance’s version): https://lemmy.world/comment/6706240
The copying of the data is not, by itself, infringement. It depends on the use and purpose of the copied data, and the defense argues that training a model against the data is fair use under TDM use-cases.
The model does not have a ‘database’, it is a series of transform nodes weighted against unstructured data. The transformation of the copyrighted works into a weighted regression model is what is being argued is fair use.
yup, and it isn’t the act of that human reading a copyrighted work that is considered as infringement, it is the creation of the work that is substantially similar. In the same analogy, it wouldn’t be the creation of the AI model that is the infringement, but each act of creation thereafter that is substantially similar to a copyrighted work. But this comes with a bunch of other problems for the plaintiffs, and would be a losing case without merit.
Trying really hard not to come off as rude, but there’s a good reason why this isn’t the argument being put forward in the lawsuits. If this was their argument, the LLM could be considered a commissioned agent, placing the liability on the agent commissioning the work (e.g. the human prompting the work) - not OpenAI or Stability - in much the same way a company is held responsible for the work produced by an employee.
I really do understand the anger and frustration apparent in these comments, but I would really like to encourage you to learn a bit more about the basis for these cases before spending substantial effort writing long responses.
Copyright is absolute. The rightsholder has complete and total right to dictate how it is copied. Thus, any unauthorised copying is copyright infringement. However, fair use gives exemption to certain types of copying. The copyright is still being infringed, because the rightsholder’s absolute rights are being circumvented, however the penalty is not awarded because of fair use.
This is all just pedantry, though, and has no practical significance. Saying “fair use means copyright has not been infringed” doesn’t change anything.
That’s a database. Or perhaps rather some kind of 3D array - which could just be considered an advanced form of database. But yeah, you’re right here, you win this pedantry round lol. 1-1.
Yeah I don’t want to go down the avenue of suing the AI itself for infringement. However…[1][2][3]
You’re not coming off as rude at all with what you’ve said, in fact I welcome and appreciate your rebuttals.
You say that as if I haven’t enjoyed fleshing out the ideas and sharing them. By the way, right now I’m sharing with you lemmy’s hidden citation feature :o)
Although, I was much happier replying to you before I just saw the downvotes you’ve apparently given me across the board. That’s a bit poor behaviour on your part, you shouldn’t downvote just because you disagree - and you can’t even say that I’m wrong as a justification when the whole thing is being heavily debated and adjudicated over whether it is right or wrong.
I thought we were engaging in a positive manner, but apparently you’ve been spitting in my face.
The LLM absolutely could be considered an agent, but the way it acts is merely prompted by the user. The actual behaviour is dictated by the organisation that built it. In any case, this is only my backup argument if you even consider the initial copying to be research - which it isn’t. ↩︎
Really and truly, this is not how this works. The exemptions granted by the office of the registrar are granting an exemption to copyright claims against fair uses. It isn’t talking about whether the claim can be awarded damages, it’s talking about the claim being exempt in entirety. You can think about copyright as an exemption to the first amendment right to free speech, and the exemption to copyright as describing where that ‘right’ does not apply. Copyright holders do not get to control the use of their work where fair use has been determined by the registrar, which is reconsidered every 3 years.
True enough, but it seems like it’s important for your understanding in how copyright works.
I wasn’t being pedantic, that distinction is important for how copyright is conceptualized. The AI model is the thing being considered for infringement, so it’s important to note that the works being claimed within it do not exist as such within the model. The ‘3-d array’ does not contain copyrighted works. You can think of it as a large metadata file, describing how to construct language as analyzed through the training data. The nature and purpose of the ‘work’ is night-and-day different from the works being claimed, and ‘database’ is a clear misrepresentation (possibly even intentionally so) of what it is.
That was exactly what you pivoted to in your comment here, i’m not sure why you’re now saying you don’t want to go down that avenue. I’m confused what you’re arguing at this point.
I’ve down-voted your comments because they contain inaccuracies and could be misleading to others. You shouldn’t let my grading of your comments reflect my attitude towards you; i’m sure you’re a fine individual. Downvotes don’t mean anything on Lemmy anyway, i’m not sure ‘spitting in your face’ is a fair or accurate description, but I don’t want to invalidate your feelings, so I apologize for making you feel that way as that wasn’t my intent.
Oh, but they are researching how to massively profit off stuff they steal from other people, so it counts again.
Since when is there a legal requirement to publish the results of your research?
Sorry but that’s just not how the world works. A big part of it is just plain practicality - how could you possibly find out who to pay? If I wanted to pay you one cent for the right to learn from things you’ve written on the fediverse, how would I even contact you? Or even find out who you are since I assume TWeaK isn’t your real name. And how would I get the money to you?
Like it or not, a lot of value created doesn’t get paid for. That’s just the way the world works… and among other things, Fair Use codifies that fact into law.
Facebook isn’t “stealing” that data. Third party websites voluntarily and put tracking pixels on their site with full awareness that visitors are going to be tracked. That’s why they do it - the website operator is given access to all of the data facebook picks up. If you have a complaint, it should primarily be with the website operator especially if they don’t ask the user for permission first (a lot of sites ask these days, I always say no personally. And run a browser extension that blocks it on sites that don’t ask).
My answer to both your comments is that just because a lot of people get away with breaking the law and abusing peoples’ rights doesn’t mean it hasn’t happened and they can’t or shouldn’t be held to account.
Tl;Dr I may pirate anything I want because I want many items and cannot figure out how to pay for every item individually
It’s Amazon, I’m pretty sure they know who to pay for any book they want. They are all already on their platform, with payment information too. They don’t even have to ask the author where to send the money, they already know. They could do it whenever they want, they have the funds to do it, billions of dollars laying around.
They’ve just decided they’d rather have it for free and keep the money.
But don’t you dare think you could do the same, you should pay for copyrighted work, and you should buy it on one of Amazon’s services that sell you access to say copyrighted work. Fucking peasant…