Hi folks,
Recap:
The modern practice of AI has blurred the boundary between the code and data, which leads to some potential ambiguity to the interpretation of the definition of
open source as well as the respective licenses. Such ambiguous interpretation in fact deviates from and violates the spirit of free software.
On Fri, 24 Feb 2023 at 05:23, Charles Plessy <plessy@debian.org> wrote:
- Joe tests the NN with the 10+1 images of TS and decides if the NN is
fine or not. If he decides that it is fine and it can go into
production, then Joe's employer should share all above stated.
Instead, if he decides that it is crap, he will trash it and he can
not share anything because the sharing will have zero value for
anyone. This is compliant with the clause of fair use in which I
explicitly added "testing" as a condition to avoid sharing. After all,
if there is no value produced why should we force Joe to share his
failure? In particular cases a failure (vulnerability) is valuable information but for security reasons it is better that Joe is not
forced to comply with the GPLv3 terms. It is better to give Joe the
freedom to share only those information that he considers safe to
share in public. However, if Joe's company does a business with this - providing a PoC to a client - then they have to comply with GPLv3
because the statements for which commercial and business are covered
by GPLv3.
Dear Mo,
thank you for the heads-up.
I was using permissive licenses in the past thinking about making life
easier to individuals, but I feel robbed by massive scrapping to train
AI models.
Just in case I updated my email signature.
Also, is there a DFSG-free license that forces the training dataset and
the result of the training process to be open source if a work under that license is present in the training data? Would GPLv3 be sufficient?
cloud technologies posed a challenge to the GPLv2 because under thatlicense everyone has the right to change the code but do not share it
Also, is there a DFSG-free license that forces the training dataset andthe result of the training process to be open source if a work under that license is present in the training data? Would GPLv3 be sufficient?
If I am not mistaken, the GPLv3 was developed to clarify some
ambiguous language in the GPLv2, mostly with respect to patents. It
doesn't address SaaS -- you are still free to modify the code and keep
your modifications private, even if you run a publicly accessible
service on the modified code.
The Affero GPL <https://www.gnu.org/licenses/agpl-3.0.html> was
developed to specifically address SaaS. This license requires that if
you run a service over a network, you must offer the corresponding
source code to all users of the service.
Charles Plessy wrote:
Also, is there a DFSG-free license that forces the training dataset andthe result of the training process to be open source if a work under that license is present in the training data? Would GPLv3 be sufficient?
As I understand, that is an open legal question. The Affero GPL would
be such a license *if* the training dataset would be considered part
of the code. While that does seem to make sense, as AI code is
essentially non-functional without the training, I am not aware that
there has ever been a pronouncement by a court of law that affirms or
denies it, nor I am aware of any free/open source license that
contains language that deals specifically with that issue, and I'm
pretty sure that there's lot of room for lawyers to argue their point.
As I understand, that is an open legal question. The Affero GPL would be
such a license *if* the training dataset would be considered part of the code. While that does seem to make sense, as AI code is essentially non-functional without the training, I am not aware that there has ever
been a pronouncement by a court of law that affirms or denies it, nor I
am aware of any free/open source license that contains language that
deals specifically with that issue, and I'm pretty sure that there's lot
of room for lawyers to argue their point.
"Russ" == Russ Allbery <rra@debian.org> writes:
Russ, I'm sure you are aware, but things get very interesting if the
input to AI training is not fair use.
In particular, if Github copilot is a derivative work of everything fed
to it (including all the copylefted works), that gets kind of awkward
for Microsoft.
Perhaps the Github user agreement grants permission for every copyright holder who has a Github account.
But for everyone else, things could be very interesting.
I hope this helps to acknowledge and convince us - as the open-source
and software-libre community - about the great responsabilitiy that is
a burden on our shoulders. Such a responsibility cannot be delegated to
a few because the stake on the table is too high.
My proposal to apply the GPLv3 or AGPLv3 - not directly to an object
but - to a collection of objects using the database protection,
automatically also solves the problem of a blurry "fair use"
definition. However, to be more incisive about "fair use", it is
better to declare explicitly what is not "fair use". Otherwise, we
risk having to explain this in court. Like in this file header:
https://github.com/robang74/isar/blob/evo2/meta/recipes-support/expand-on-first-boot/files/expand-last-partition.sh
# (C) 2022, Roberto A. Foglietta <roberto.foglietta@gmail.com>
# SPDX-License-Identifier: all rights reserved, but fair use allowed
# Fair use includes test, learning and marketing but not sales, redistribution
# leasing, renting or every other commercial/business activities without the # consent of the author. Every company or individual allowed to use this
# code behind these limitations will be listed here below, if any.
"Roberto A. Foglietta" <roberto.foglietta@gmail.com> writes:
My proposal to apply the GPLv3 or AGPLv3 - not directly to an object
but - to a collection of objects using the database protection, automatically also solves the problem of a blurry "fair use"
definition. However, to be more incisive about "fair use", it is
better to declare explicitly what is not "fair use". Otherwise, we
risk having to explain this in court. Like in this file header:
https://github.com/robang74/isar/blob/evo2/meta/recipes-support/expand-on-first-boot/files/expand-last-partition.sh
# (C) 2022, Roberto A. Foglietta <roberto.foglietta@gmail.com>
# SPDX-License-Identifier: all rights reserved, but fair use allowed
# Fair use includes test, learning and marketing but not sales, redistribution
# leasing, renting or every other commercial/business activities without the
# consent of the author. Every company or individual allowed to use this
# code behind these limitations will be listed here below, if any.
I'm afraid this is not how fair use works. The whole point of fair use is that the copyright holder has no control over uses that are fair use.
They can grant additional rights with a copyright license, but they cannot stop legal fair use, no matter what they write in their license and no
matter what their personal opinions are about what would fall into fair
use.
As a general principle, as a free software advocate, I approve of an expansive definition of fair use and believe that far more uses of copyrighted material should be fair use than are normally considered fair
use today. Expansive definitions of fair use are a key legal component to enabling reverse engineering and compatible replacement of non-free
software with free software, for example.
On Sun, 26 Feb 2023 at 21:47, Russ Allbery <rra@debian.org> wrote:
"Roberto A. Foglietta" <roberto.foglietta@gmail.com> writes:
My proposal to apply the GPLv3 or AGPLv3 - not directly to an object
but - to a collection of objects using the database protection,
automatically also solves the problem of a blurry "fair use"
definition. However, to be more incisive about "fair use", it is
better to declare explicitly what is not "fair use". Otherwise, we
risk having to explain this in court. Like in this file header:
https://github.com/robang74/isar/blob/evo2/meta/recipes-support/expand-on-first-boot/files/expand-last-partition.sh
# (C) 2022, Roberto A. Foglietta <roberto.foglietta@gmail.com>
# SPDX-License-Identifier: all rights reserved, but fair use allowed
# Fair use includes test, learning and marketing but not sales, redistribution
# leasing, renting or every other commercial/business activities without the
# consent of the author. Every company or individual allowed to use this >> > # code behind these limitations will be listed here below, if any.
I'm afraid this is not how fair use works. The whole point of fair use is >> that the copyright holder has no control over uses that are fair use.
They can grant additional rights with a copyright license, but they cannot >> stop legal fair use, no matter what they write in their license and no
matter what their personal opinions are about what would fall into fair
use.
I am sorry for having confuse you trying to explain a simple fact:
- fair use as legal term is a blurry one
- fair use cannot be limited but expanded (as I did over there)
- fair use could include {testing, learning, storage} and usually it does
HOWEVER
- fair use cannot include {business, commercial, marketing} rights in
anyway and in any conditions
WHY?
Because the principle of the copyright existence is about protecting
the authors' exclusive of that {business, commercial, marketing}
rights.
Because copyleft is a copyright that trades exclusive rights for
freedom instead of money, this is certainly happening also for the
copyleft.
CONCLUSION
We might have problems in identifying all the fair use cases but we
can be very certain about what is NOT fair use.
(in another e-mail about database/collection protection)
- fair use cannot include {business, commercial, marketing} rights in
anyway and in any conditions
On Mon, 27 Feb 2023 at 07:16, Russ Allbery <rra@debian.org> wrote:
"Roberto A. Foglietta" <roberto.foglietta@gmail.com> writes:
No court ruling was ever emitted in favour of Google vs Oracle
leveraging fair use but it was an agreement between the two parties
supported by Microsoft.
https://arstechnica.com/tech-policy/2021/04/how-the-supreme-court-saved-the-software-industry-from-api-copyrights/
As you can learn from the Ars Technica's article linked here above.
"Roberto A. Foglietta" <roberto.foglietta@gmail.com> writes:
- fair use cannot include {business, commercial, marketing} rights in anyway and in any conditions
This is definitely not true in the United States; there is a Supreme Court decision saying the exact opposite. The ruling in Google v. Oracle said Google's commercial and business use of Oracle's copyrighted APIs met the test for fair use.
You can't reconstruct the law from first principles without looking at the actual test that is applied by courts. (And as mentioned this may be different in different jurisdictions, for additional complexity.)
In the
US there's a four-part balancing test for fair use, and the analysis can
be quite complicated.
A totally automatic procedure like web crawling and web indexing
re-enter in your example, perfectly. However, the input collection that
a ML/AI training system needs is a protectable work because the data
should be structured, selected and properly labeled even if these
activities are done with rules like it happens using SQL for
databases.
So, web indexing and statistics are created over a input collections
that are *not* a creative works and these tools access to every
copyrighted works in fair use as long as they respect the robots:no
meta-tag when it is applied to a copyrighted work. Instead, training a
ML/AI is a completely another story and their input collections are a protectable collection under the copyright law.
"Roberto A. Foglietta" <roberto.foglietta@gmail.com> writes:
A totally automatic procedure like web crawling and web indexing
re-enter in your example, perfectly. However, the input collection that
a ML/AI training system needs is a protectable work because the data
should be structured, selected and properly labeled even if these activities are done with rules like it happens using SQL for
databases.
Yes, I agree, I think that a trained AI model is a protectable work.
However, it is not protectable *by you* unless you're the one who wrote
the model and chose its training.
Therefore, putting a clause in your copyright license saying that if your work is incorporated into an AI model, that AI model as a collection is covered by some particular license is not really a thing you can do. The best you can do is the standard GPL thing of saying that you don't have to license your collection under any particular license, but if you don't,
you don't have any right to include this specific work. Maybe that's what you were getting at, and I just didn't understand.
On Mon, 27 Feb 2023 at 07:16, Russ Allbery <rra@debian.org> wrote:
This is definitely not true in the United States; there is a Supreme
Court decision saying the exact opposite. The ruling in Google
v. Oracle said Google's commercial and business use of Oracle's
copyrighted APIs met the test for fair use.
It is true despite a single US case judgment.
No court ruling was ever emitted in favour of Google vs Oracle
leveraging fair use but it was an agreement between the two parties
supported by Microsoft.
I can reconstruct the interpretation of a law from basic principles
otherwise it would not be a law but something that appeared from
nothing: no any law roots, no any law authority.
Moreover, it does not matter how fair use is defined in many different legislations around the world. By copyright principle, it cannot allow
doing activities like {business, commercial, marketing} without the
consent of the author or of the license.
- then I decided to protect my projects repositories as database
(collection) in addition to the standard way to protect the code with
a well-known license
- because of the copyright law about databases, if someone creates a
larger database that contains my database or a part of it, then they
have to comply with the license that I choose to protect my project as
a database.
You see, it is a very simple and straightforward concept. The only two
ways to get off this are 1. make unlawful the database copyright law,
2. make a law for which the training input collection is not coverable
by the copyright law. In both cases every employer can bring to their
home a copy of a database or a copy of AI training inputs and share it
with all the rest of the world. Moreover, the 1. includes the 2 while
the 2. would seriously undermine the database copyright law because
every database could be a training set for an AI/ML engine.
Russ, do you agree? :-)
No. It's entirely possible that using databases as training sets for an AI/ML engine is fair use under existing United States law and precedent as long as that use is sufficiently transformative (the first factor of the test, and I suspect the most important one here).
The obvious example is
a search engine, which performs a similar transformation of clearly copyrighted works into a new service with a different purpose, without the explicit permission of the copyright holders.
This is the reason why people have focused so much on GitHub Copilot's willingness to insert large blocks of code from other projects verbatim. Reproducing code from other projects is less transformative and looks more like simple copying, and therefore opens GitHub to a legal argument that their AI model is not sufficiently transformative to be fair use.
Because the principle of the copyright existence is about protecting
the authors' exclusive of that {business, commercial, marketing}
rights.
2. if an author does not exercise a right for a long period of time
enforcing it then that right is lost for the principle of "usucapio"
in latin
On Mon, 2023-02-27 at 01:45 +0100, Roberto A. Foglietta wrote:
Because the principle of the copyright existence is about protecting
the authors' exclusive of that {business, commercial, marketing}
rights.
The purpose of copyright is allegedly (in the USA) "To promote the
Progress of Science and useful Arts, by securing for limited Times to
Authors and Inventors the exclusive Right to their respective Writings
and Discoveries.".
"Roberto" == Roberto A Foglietta <roberto.foglietta@gmail.com> writes:
"Roberto" == Roberto A Foglietta <roberto.foglietta@gmail.com> writes:
Roberto> On Mon, 27 Feb 2023 at 19:08, Russ Allbery <rra@debian.org> wrote:
>>
>> No. It's entirely possible that using databases as training sets
>> for an AI/ML engine is fair use under existing United States law
>> and precedent as long as that use is sufficiently transformative
>> (the first factor of the test, and I suspect the most important
>> one here).
Roberto> Considering what you reported in the previous e-mail about
Roberto> US national law in 17 U.S.C. § 107 in 1976, It is not
Roberto> possible to use an entire or a significant portion of a
Roberto> database for {business, commercial, marketing} purposes
Roberto> without the copyright holder.
Please stop!
It's clear that you are not building support for your argument.
You've made your case to the best of your ability and not been
convincing.
But beyond that, this discussion is no longer on topic for
debian-project.
Debian cannot decide what the law is.
We've established that this situation is complicated.
You've proposed various things that someone could do to limit the use of
free software in AI training sets.
Other people have pointed out that may or may not work.
You think it will.
You haven't managed to convince your critics..
We won't know until this gets hashed out in courts.
That's about the level of detail appropriate for debian-project.
Further discussion of this issue at this time on this list does not
serve the community.
We won't know until this gets hashed out in courts.
Debian cannot decide what the law is.
Everyone that has a kind of urgency about doing business can employ me
and I will set up a near-complete solution for them that I did not
explain to everyone
[[[ To any NSA and FBI agents reading my email: please consider ]]]
[[[ whether defending the US Constitution against all enemies, ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]
> About upgrading A/L/GPLv3 in A/L/GPLv4, it seems to me quite an
> urgent thing to do but challenging it in a court might happen years
> from now. So there is a lot of time for preparation.
Making a new version of the GPL is a big effort, and I'm the one who
has to lead it. I have not been able to follow this discussion; it
was long an complicated. If it described a reason to change the GPL,
I could not see it.
Would you please tell me the problem that you think the GPL needs to
be changed for?
From that context and point of view, there are three main points I want to contribute to this discussion:
[[[ To any NSA and FBI agents reading my email: please consider ]]]
[[[ whether defending the US Constitution against all enemies, ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]
[[[ To any NSA and FBI agents reading my email: please consider ]]]
[[[ whether defending the US Constitution against all enemies, ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]
> I do not know personally "Bradley M. Kuhn" <bkuhn@sfconservancy.org>
> but I appreciate very much his answer in which he set several points
> https://lists.debian.org/debian-project/2023/03/msg00004.html
I will take a look. Thanks.
| Sysop: | Keyop |
|---|---|
| Location: | Huddersfield, West Yorkshire, UK |
| Users: | 546 |
| Nodes: | 16 (2 / 14) |
| Uptime: | 151:44:37 |
| Calls: | 10,383 |
| Files: | 14,054 |
| Messages: | 6,417,815 |