To my understanding, if the material is publicly available or obtained legally (i.e., not pirated), then training a model with it falls under fair use, at least in the US and some other jurisdictions.
If the training is established as fair use, the underlying license doesn't is effectively irrelevant. The term you added would just be void if someone ever brought it to a court.
I don't think it's fair use, but everyone on Earth disagree with me. So even with the standard default licence that prohibits absolutely everything, the humanity-1 consider it fair use.
Would such a license fall under the definition of free software? Difficult to say. Counter-proposition: a license which permits training if the model is fully open.
My next project will be released under a GPL-like license with exactly this condition added. If you train a model on this code, the model must be open source & open weights
Because it would violate freedom zero. Adding such terms to the GNU GPL would also mean that you can remove them, they would be considered "further restrictions" and can be removed (see section 7 of the GNU GPL version 3).
The article goes deep into these two cases deemed most relevant but really there are a wide swath of similar cases all focused around defining sharper borders than ever around what is essentially the question "exactly when does it become copyright violation" with plenty of seemingly "obvious" answers which quickly conflict with each other.
I also have the feeling it will be much like Google LLC v. Oracle America, Inc., much of this won't really be clearly resolved until the end if the decade. I'd also not ve surprised if seemingly very different answers ended up bubbling up in the different cases, driven by the specifics of the domain.
Not a lawyer, just excited to see the outcomes :).
Ideally, Congress would just settle this basket of copyright concerns, as they explicitly have the power to do—and have done so repeatedly in the specific context of computers and software.
To my understanding, if the material is publicly available or obtained legally (i.e., not pirated), then training a model with it falls under fair use.
Once training is established as fair use, it doesn't really matter if the license is MIT, GPL, or a proprietary one.
That is just the sort of point I am trying to make. That is a copyright law issue, not a contractual one. If the GPL is a contract then you are in breach of contract regardless of fair use or equivalents.
France and most of europe has fair use (https://fr.wikipedia.org/wiki/Copie_priv%C3%A9e) but also has a mandatory tax on every sold medium that can do storage to recover the "lost fees" due to fair use
> To my understanding, if the material is publicly available or obtained legally (i.e., not pirated), then training a model with it falls under fair use.
And the current norm that the trillion dollar companies have lobbied for is that you can train on copyrighted material all you want so that's the reality we are living in. Everything ever published is all theirs.
I am really surprised that media businesses, which are extremely influential around the world, have not pushed back against this more. I wonder whether they are looking at cost savings that will get from the technology as a worthwhile trade-off.
It's not specific to open source but it's most clearly enforceable with open source as there will be many contributors from many jurisdictions with the one unifying factor being they all made their copyright available under the same license terms.
With proprietary or more importantly single-owner code, it's far easier for this to end up in a settlement rather than being drug out into an actual ruling, enforcement action, and establishment of precedence.
That's the key detail. It's not specific to GPL or open source but if you want to see these orgs held to account and some precedence established, focusing on GPL and FOSS licensed code is the clearest path to that.
I honestly think that the most extreme take that "any output of an LLM falls under all the copyright of all its training data" is not really defensible, especially when contrasted with human learning, and would be curious to hear conflicting opinions.
My view is that copyright in general is a pretty abstract and artificial concept; thus corresponding regulation needs to justifiy itself by being useful, i.e. encouraging and rewarding content creation.
/sidenote:
Copyright as-is barely holds up there; I would argue that nobody (not even old established companies) is significantly encouraged or incentivised by potential revenue more than 20 years in the future (much less current copyright durations). The system also leads to bad ressource allocation, with almost all the rewards ending up at a small handful of most successful producers-- this effectively externalizes large portions of the cost of "raising" artists.
I view AI overlap under the same lense-- if current copyright rules would lead to undesirable outcomes (by making all AI training or use illegal/infeasible) then law/interpretation simply has to be changed.
A program's output is likely not owned by the program's authors. For example, if you create a document with Microsoft Word, you are the one who owns it, not Microsoft.
Unless the license says otherwise. The fact that Word doesn't (I wouldn't even be sure if that was true, honestly, especially for the online versions) doesn't mean anything.
They could start selling a version of Word tomorrow that gives them the right to train from everything you type on your entire computer into any program. Or that requires you to relinquish your rights to your writing and to license it back from Microsoft, and to only be able to dispute this through arbitration. They could add a morals clause.
I might be crazy, and I'd love to hear from somebody who knows about this, but I've been assuming that AI companies have been pulling GPL code out of the training material specifically to avoid this.
Corporations have always talked about the virality of GPL, sometimes *but not always) to the point of exaggeration, you'd think that after getting the proof of concept done the AI companies would be running away at full speed from setting a bomb like that in their goldmine.
Putting in tons of commonly read books and scientific papers is safer, they can just eventually cross-license with the massive conglomerates that own everything. But the GPL is by nature hostile, and has been openly and specifically hostile from the beginning. MIT and Apache, etc. you can just include a fistful of licenses to download, or even come up with architectures that track names to add for attribution-ware. But the GPL will obviously (and legitimately) claim to have relicensed the entire model and maybe all its output (unless they restricted it to LGPL.)
Not crazy - there's a rational self-interest in doing this.
But I'm not certain that the relevant players have the same consequence-fearing mindset that you do, and to be honest they're probably right. The theft is too great to calculate the consequences, and by the time it's settled, what are you gonna do - turn off Forster's machine?
What triggers me is how insistant Claude Code is on adding "co-authored by Claude" in commits, in spite of my settings and an instruction in CLAUDE.md. I wish all these tech bros were as willing to credit the human shoulders on which their products are built. But they'd be much less successful in our current system if they were that kind of people.
Try changing the system prompt or switch to opencode [0] - they allegedly reverse engineered Claude Code, and so the performance you get with Claude models should be very similar to Claude Code.
I had a very similar view once, and have since understood that this is mainly a difference in perspective:
It's easy as a developer to slip into a role where you want to build/package (maybe sell) some software product with minimal obligations. BSD-likes are obviously great there.
But the GPL follows a different perspective: It tries to make sure that every user of any software product is always capable of tinkering and changing it himself, and the more permissive licenses do not help there because they don't prevent (or even discourage!) companies from just selling you stripped and obfuscated binary blobs that put you fully at the vendors mercy.
As somebody who thinks that people currently own the code that they write, I wonder why you're in people's business who want to write GPL'd software.
Are you complaining about proprietary software? I hear the restrictions are a lot tighter for Photoshop's source code, or iOS's, but for some reason you are one of the people who hate GPL as a hobby. Please don't show up whining about "spirits" when Amazon puts you out of business.
I'm not in anyone's business just sharing my opinion on GPL. I understand why people go GPL / AGPL just not for me. To each their own if they want to go down that path.
"Information wants to be free"? Many individuals pirated movies and games and got away with it. Of course two wrongs don't make a right and all that. Nonetheless one should be compensated for creating material that ai trained on for the same reasons copyright is compensated - to incentives people to produce it.
I thought the whole concept of a viral license was legally questionable to begin with. There haven't been cases about this, as far as I know, and GPL virality enforcement has just been done by the community.
I know it's not popular on HN to have anything but supportive statements around GPL, and I'm a big GPL supporter myself, but there is nuance in what is being said here.
That case was important, but it's not abojt the virality. There have been no concluded court cases involving the virality portion causing the rest of the code to also be GPL'd, but there are plenty involving enforcement of GPL on the GPL code itself.
The distinction is important because the article is about the virality causing the whole LLM model to be GPL'd, not just about the GPL'd code itself.
I'd like to think it wouldn't be a problem to enforce, but I've also never seen a court ruling truly about the virality portion to back that up either - which is all GP is saying.
There is no "virality", and the article's use of "propagation" to mean the same thing is wrong. The GPL doesn't "cause" anything to be GPLed that hasn't been explicitly licensed under the GPL by the owner of its copyright. The GPL grants a license to use the copyright material to which it applies. To satisfy the terms of that license for a particular use may require that you license other code under the GPL, but if you don't the GPL can't magically make that code GPLed. You will, however, not be covered by the license so unless your use is permitted for some other reason (eg. fair use or a different license you have been granted) your use of the the original code will be a violation of copyright. All of this has been repeatedly tested in court.
It's sad to see Microsoft's FUD still festering 20 years later.
Conversely, to my knowledge there has been no court decision that indicates that the GPL is _not_ enforceable. I think you might want to be more familiar with the area before you decide if it's legally questionable or not.
If you don't like the license, then don't accept it.
You are then restricted by copyright just like with any other creation.
If I include the source code of Windows into my product, I can't simply choose to re-license it to say public domain and give it to someone else, the license that I have from Microsoft to allow me to use their code won't let me - it provides restrictions. It's just as "viral" as the GPL.
I like the GPL. I just don't know how much you can actually enforce it.
Also, "don't use my code" is not viral. If you break the MSFT license, you pay them, which is a very well-tested path in courts. The idea of forced public disclosure does not seem to be.
We need a new license that forbids all training. That is the only way to stop big corporations from doing this.
To my understanding, if the material is publicly available or obtained legally (i.e., not pirated), then training a model with it falls under fair use, at least in the US and some other jurisdictions.
If the training is established as fair use, the underlying license doesn't is effectively irrelevant. The term you added would just be void if someone ever brought it to a court.
Wouldn't it be still legal to train on the data due to fair use?
I don't think it's fair use, but everyone on Earth disagree with me. So even with the standard default licence that prohibits absolutely everything, the humanity-1 consider it fair use.
We need a ruling that LLM generated code enters public domain automatically and can't be covered by any license.
Would such a license fall under the definition of free software? Difficult to say. Counter-proposition: a license which permits training if the model is fully open.
It isn't the difficult, a license that forbids how the program is used is a non-free software license.
"The freedom to run the program as you wish, for any purpose (freedom 0)."
My next project will be released under a GPL-like license with exactly this condition added. If you train a model on this code, the model must be open source & open weights
Not sure why the FSF or any other organization hasn't released a license like this years ago already.
Because it would violate freedom zero. Adding such terms to the GNU GPL would also mean that you can remove them, they would be considered "further restrictions" and can be removed (see section 7 of the GNU GPL version 3).
Model weights, source, and output.
The article goes deep into these two cases deemed most relevant but really there are a wide swath of similar cases all focused around defining sharper borders than ever around what is essentially the question "exactly when does it become copyright violation" with plenty of seemingly "obvious" answers which quickly conflict with each other.
I also have the feeling it will be much like Google LLC v. Oracle America, Inc., much of this won't really be clearly resolved until the end if the decade. I'd also not ve surprised if seemingly very different answers ended up bubbling up in the different cases, driven by the specifics of the domain.
Not a lawyer, just excited to see the outcomes :).
Ideally, Congress would just settle this basket of copyright concerns, as they explicitly have the power to do—and have done so repeatedly in the specific context of computers and software.
The article repeatedly treats license and contract as though they are the same, even though the sidebar links to a post that discusses the difference.
A lot of it boils down to whether training an LLM is a breach of copyright of the training materials which is not specific to GPL or open source.
To my understanding, if the material is publicly available or obtained legally (i.e., not pirated), then training a model with it falls under fair use.
Once training is established as fair use, it doesn't really matter if the license is MIT, GPL, or a proprietary one.
That is just the sort of point I am trying to make. That is a copyright law issue, not a contractual one. If the GPL is a contract then you are in breach of contract regardless of fair use or equivalents.
fair use only applies in the united states (and Poland, and a very limited set of others)
https://en.wikipedia.org/wiki/Fair_use#/media/File:Fair_use_...
and it is certainly not part of the Berne Convention
in almost every country in the world even timeshifting using your VCR and ripping your own CDs is copyright infringement
France and most of europe has fair use (https://fr.wikipedia.org/wiki/Copie_priv%C3%A9e) but also has a mandatory tax on every sold medium that can do storage to recover the "lost fees" due to fair use
> To my understanding, if the material is publicly available or obtained legally (i.e., not pirated), then training a model with it falls under fair use.
Is this legally settled?
A GPL license is a contract in most other countries. Just not US probably.
And the current norm that the trillion dollar companies have lobbied for is that you can train on copyrighted material all you want so that's the reality we are living in. Everything ever published is all theirs.
I am really surprised that media businesses, which are extremely influential around the world, have not pushed back against this more. I wonder whether they are looking at cost savings that will get from the technology as a worthwhile trade-off.
In practice it wouldn't matter a whit if they lobbied for it or not.
Lobbying is for people trying to stop them; externalities are for the little people.
It's not specific to open source but it's most clearly enforceable with open source as there will be many contributors from many jurisdictions with the one unifying factor being they all made their copyright available under the same license terms.
With proprietary or more importantly single-owner code, it's far easier for this to end up in a settlement rather than being drug out into an actual ruling, enforcement action, and establishment of precedence.
That's the key detail. It's not specific to GPL or open source but if you want to see these orgs held to account and some precedence established, focusing on GPL and FOSS licensed code is the clearest path to that.
I honestly think that the most extreme take that "any output of an LLM falls under all the copyright of all its training data" is not really defensible, especially when contrasted with human learning, and would be curious to hear conflicting opinions.
My view is that copyright in general is a pretty abstract and artificial concept; thus corresponding regulation needs to justifiy itself by being useful, i.e. encouraging and rewarding content creation.
/sidenote: Copyright as-is barely holds up there; I would argue that nobody (not even old established companies) is significantly encouraged or incentivised by potential revenue more than 20 years in the future (much less current copyright durations). The system also leads to bad ressource allocation, with almost all the rewards ending up at a small handful of most successful producers-- this effectively externalizes large portions of the cost of "raising" artists.
I view AI overlap under the same lense-- if current copyright rules would lead to undesirable outcomes (by making all AI training or use illegal/infeasible) then law/interpretation simply has to be changed.
And then also to all code made from the GPL’d ai model?
A program's output is likely not owned by the program's authors. For example, if you create a document with Microsoft Word, you are the one who owns it, not Microsoft.
Unless the license says otherwise. The fact that Word doesn't (I wouldn't even be sure if that was true, honestly, especially for the online versions) doesn't mean anything.
They could start selling a version of Word tomorrow that gives them the right to train from everything you type on your entire computer into any program. Or that requires you to relinquish your rights to your writing and to license it back from Microsoft, and to only be able to dispute this through arbitration. They could add a morals clause.
I might be crazy, and I'd love to hear from somebody who knows about this, but I've been assuming that AI companies have been pulling GPL code out of the training material specifically to avoid this.
Corporations have always talked about the virality of GPL, sometimes *but not always) to the point of exaggeration, you'd think that after getting the proof of concept done the AI companies would be running away at full speed from setting a bomb like that in their goldmine.
Putting in tons of commonly read books and scientific papers is safer, they can just eventually cross-license with the massive conglomerates that own everything. But the GPL is by nature hostile, and has been openly and specifically hostile from the beginning. MIT and Apache, etc. you can just include a fistful of licenses to download, or even come up with architectures that track names to add for attribution-ware. But the GPL will obviously (and legitimately) claim to have relicensed the entire model and maybe all its output (unless they restricted it to LGPL.)
Wouldn't you just pull it out?
Not crazy - there's a rational self-interest in doing this.
But I'm not certain that the relevant players have the same consequence-fearing mindset that you do, and to be honest they're probably right. The theft is too great to calculate the consequences, and by the time it's settled, what are you gonna do - turn off Forster's machine?
I hope you're right in at least some cases!
If you were a thoughtful, careful, law-abiding business, yes.
I submit the evidence suggests the genAI companies have none of those attributes.
What triggers me is how insistant Claude Code is on adding "co-authored by Claude" in commits, in spite of my settings and an instruction in CLAUDE.md. I wish all these tech bros were as willing to credit the human shoulders on which their products are built. But they'd be much less successful in our current system if they were that kind of people.
Try changing the system prompt or switch to opencode [0] - they allegedly reverse engineered Claude Code, and so the performance you get with Claude models should be very similar to Claude Code.
[0] https://github.com/sst/opencode
I've changed the settings and added the instruction to the prompt, hence my frustration :)
there's an option for claude to disable co-authoring, see: https://code.claude.com/docs/en/settings
{ "includeCoAuthoredBy": false }
As someone who has spent more time than most developing open source software, I will say I genuinely dislike copyleft and GPL.
For those who are into freedom, I don't see how dictating how you use what you build in such a manner is in the spirit of free and open.
Just my opinion on it, to each their own on the matter.
I had a very similar view once, and have since understood that this is mainly a difference in perspective:
It's easy as a developer to slip into a role where you want to build/package (maybe sell) some software product with minimal obligations. BSD-likes are obviously great there.
But the GPL follows a different perspective: It tries to make sure that every user of any software product is always capable of tinkering and changing it himself, and the more permissive licenses do not help there because they don't prevent (or even discourage!) companies from just selling you stripped and obfuscated binary blobs that put you fully at the vendors mercy.
It's not dictating how you use what you build? It's dictating how you redistribute what you build on top of other people's work.
Ok but I just have no interest in imposing restrictions on how people distribute what I build in such a manner either. That's just me.
As somebody who thinks that people currently own the code that they write, I wonder why you're in people's business who want to write GPL'd software.
Are you complaining about proprietary software? I hear the restrictions are a lot tighter for Photoshop's source code, or iOS's, but for some reason you are one of the people who hate GPL as a hobby. Please don't show up whining about "spirits" when Amazon puts you out of business.
I'm not in anyone's business just sharing my opinion on GPL. I understand why people go GPL / AGPL just not for me. To each their own if they want to go down that path.
I disagree as someone who has also spent a huge amount of time on open source software. It’s all GPL or AGPL :)
That's your prerogative. It's just not for me and GPL is basically something I avoid when possible.
GPL and copyright in general don't apply to billionaires, so pretty much a non-topic.
It's just a side cost of doing business, because asking for forgiveness is cheaper and faster than asking for permission.
"Information wants to be free"? Many individuals pirated movies and games and got away with it. Of course two wrongs don't make a right and all that. Nonetheless one should be compensated for creating material that ai trained on for the same reasons copyright is compensated - to incentives people to produce it.
With an attitude like that they don't
I thought the whole concept of a viral license was legally questionable to begin with. There haven't been cases about this, as far as I know, and GPL virality enforcement has just been done by the community.
The GPL was tested in court as early as 2006 [1] and plenty of times since. There are no serious doubts about its enforceability.
[1] https://www.fsf.org/news/wallace-vs-fsf
I know it's not popular on HN to have anything but supportive statements around GPL, and I'm a big GPL supporter myself, but there is nuance in what is being said here.
That case was important, but it's not abojt the virality. There have been no concluded court cases involving the virality portion causing the rest of the code to also be GPL'd, but there are plenty involving enforcement of GPL on the GPL code itself.
The distinction is important because the article is about the virality causing the whole LLM model to be GPL'd, not just about the GPL'd code itself.
I'd like to think it wouldn't be a problem to enforce, but I've also never seen a court ruling truly about the virality portion to back that up either - which is all GP is saying.
There is no "virality", and the article's use of "propagation" to mean the same thing is wrong. The GPL doesn't "cause" anything to be GPLed that hasn't been explicitly licensed under the GPL by the owner of its copyright. The GPL grants a license to use the copyright material to which it applies. To satisfy the terms of that license for a particular use may require that you license other code under the GPL, but if you don't the GPL can't magically make that code GPLed. You will, however, not be covered by the license so unless your use is permitted for some other reason (eg. fair use or a different license you have been granted) your use of the the original code will be a violation of copyright. All of this has been repeatedly tested in court.
It's sad to see Microsoft's FUD still festering 20 years later.
That case has little to do with the license itself and nothing to do with its virality.
There have been a number of of cases, which are linked from Wikipedia (https://en.wikipedia.org/wiki/GNU_General_Public_License#Leg...) - most recently Entr’Ouvert v. Orange had a strong judgement (under French law) in favour of the GPL.
Conversely, to my knowledge there has been no court decision that indicates that the GPL is _not_ enforceable. I think you might want to be more familiar with the area before you decide if it's legally questionable or not.
I'm not suggesting that you avoid following it. I'm just not that convinced it's enforceable in the US. The French ruling is good, though.
If you don't like the license, then don't accept it.
You are then restricted by copyright just like with any other creation.
If I include the source code of Windows into my product, I can't simply choose to re-license it to say public domain and give it to someone else, the license that I have from Microsoft to allow me to use their code won't let me - it provides restrictions. It's just as "viral" as the GPL.
I like the GPL. I just don't know how much you can actually enforce it.
Also, "don't use my code" is not viral. If you break the MSFT license, you pay them, which is a very well-tested path in courts. The idea of forced public disclosure does not seem to be.