Forum: >>> Magnum BBS <<<

A policy on use of AI-generated content in Debian

From Tiago Bortoletto Vaz@21:1/5 to All on Thu May 2 20:10:02 2024

Hi,

I would like Debian to discuss and decide on the usage of AI-generated content within the project. I fear that we are already facing negative consequences in some areas of Debian as a result of its use. If I happen to be wrong, I'm still afraid that it will happen in a very short time.

You might already know that recently Gentoo made a strong move in this context and drafted their AI policy:

- https://wiki.gentoo.org/wiki/Project:Council/AI_policy
- https://www.mail-archive.com/gentoo-dev@lists.gentoo.org/msg99042.html

I personally agree with the author's rationale on the aspects pointed out (copyright, quality and ethical ones). But at this point I guess we might have more questions than answers, that's why I think it'd be helpful to have some input before suggesting any concrete proposals. Perhaps the most important step now is to get an idea of how Debian folks actually feels about this matter.
And how we feel about moving in a similar direction to what the gentoo project did.

If things move in a somewhat consensual direction, I intend to bring the discussion to -vote in order to discuss a possible GR.

Bests,

--
tvaz

-----BEGIN PGP SIGNATURE-----

iQIzBAABCgAdFiEEOAYLMZqeqHbTW+jfgVxjVIQAXyEFAmYz1PAACgkQgVxjVIQA XyEkXA/8CMKeO13+Ddln0QUWP9IXSO8VTteGJz15YXPwfBCCV6jPws/JrwXc5kSD C498yN5RgmkYOhfrwG+I5YtTVAbGof4QbbBzBwLXB9CcOzvXXzJpC3Y60RyU/Vds d1IqKSxOVOflivvGqgBgMjnOlMbNZvBKCRtNXT8tIhKeBCEKcx+pJlRoGS6vs4U1 j2woPn6w5yoaW+RFDLbwwVldOqgqgIMIcBOUBGwqrmebaLObNg6bLEoqeOmNif/B uEJiK4PIwzVda9hiNLHYlft0NJFqjV4AJkql0hQMd/5yhHPFle3m5YnaUp2PdLX8 saXLNJKTQShcog1zrwxccwopmKsm76GxOnManu4a66iR6hox45yg+ZXcU1HvmL8J ARyNr+HkyPvwOyBBbv2J16Z0x4/ZPHtbdtRjRziklyuWgHfzjFqA5ZnBL3VH0WpH B1r1QuDvBj1N5p+kQWzAO4x6UUIwl9Vb0QYNSn1f6lQHJ0A99+oSzIBDW4D9lq3m WoZ2TUpadh0xS4GHE3KR8nIjYkjamNco1GTkLJVmKFqLVSb1mjYcpd5uKSa5gOI7 XyTedVWlXJEBQ7bUqAjNBZwlxdj5mQyxNMwTHlI7Lip0ND9VROo2qy55kA8hhtFp SPtNAxT+Fqj2RtM82CiUJOkvg0UqU39yg7/LaNsrpY3itczYxsI=
=t7vP
-----END PGP SIGNATURE-----

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Ansgar =?UTF-8?Q?=F0=9F=99=80?=@21:1/5 to Tiago Bortoletto Vaz on Thu May 2 20:40:01 2024

Hi,

On Thu, 2024-05-02 at 14:01 -0400, Tiago Bortoletto Vaz wrote:

I would like Debian to discuss and decide on the usage of AI-
generated content within the project.

It's just another tool that might or might not be non-free like people
using Photoshop, Google Chrome, Gmail, Windows, ... to make
contributions. Or a spamfilter to filter out some.

You might already know that recently Gentoo made a strong move in
this context and drafted their AI policy:

- https://wiki.gentoo.org/wiki/Project:Council/AI_policy
- https://www.mail-archive.com/gentoo-
dev@lists.gentoo.org/msg99042.html

I think that is a bad: policy we don't ban Tor because it can be used
copyright violations, terrorism, stalking, hacking or other unethical
things. We don't have a general ban on human contributions due to
quality concerns. I don't see why AI as yet another tool should be
different.

Ansgar

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Dominik George@21:1/5 to All on Thu May 2 21:00:01 2024

Hi,

It's just another tool that might or might not be non-free like people
using Photoshop, Google Chrome, Gmail, Windows, ... to make
contributions. Or a spamfilter to filter out some.

That's entirely not the point.

It is not about **the tool** being non-free, but the result of its use being non-free.

Generative AI tools **produce** derivatives of other people's copyrighted works.

That said, we already have the necessary policies in place:

* d/copyright must be accurate
* all sources must be reproducible from their preferred form of modification

Both are not possible using generative AI.

-nik

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Sam Hartman@21:1/5 to All on Thu May 2 22:20:02 2024

"Ansgar" == Ansgar 🙀 <ansgar@43-1.org> writes:

Ansgar> Hi,
Ansgar> On Thu, 2024-05-02 at 14:01 -0400, Tiago Bortoletto Vaz wrote:
>> I would like Debian to discuss and decide on the usage of AI-
>> generated content within the project.

Ansgar> It's just another tool that might or might not be non-free like people
Ansgar> using Photoshop, Google Chrome, Gmail, Windows, ... to make
Ansgar> contributions. Or a spamfilter to filter out some.

I tend to agree with the above. AI is just another tool, and I trust
DDs to use it appropriately.

I probably would not use AI to write large blocks of code, because I
find that auditing the quality of AI generated code is harder than just
writing the code in most cases.

I might:

* use debgpt to guess answers to questions about packaging that I could
verify in some manner.

* Use generative AI to suggest names of projects, help improve
descriptions, summarize content, etc.

* See if generative AI could help producing a message with a desired
tone.

--Sam

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Mo Zhou@21:1/5 to Tiago Bortoletto Vaz on Thu May 2 22:20:02 2024

On 5/2/24 14:01, Tiago Bortoletto Vaz wrote:

You might already know that recently Gentoo made a strong move in this context
and drafted their AI policy:

- https://wiki.gentoo.org/wiki/Project:Council/AI_policy
- https://www.mail-archive.com/gentoo-dev@lists.gentoo.org/msg99042.html

People might not already know I wrote this 4 years ago https://salsa.debian.org/deeplearning-team/ml-policy/-/blob/master/ML-Policy.rst

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Mo Zhou@21:1/5 to Dominik George on Thu May 2 22:20:02 2024

On 5/2/24 14:47, Dominik George wrote:

That's entirely not the point.

It is not about **the tool** being non-free, but the result of its use being non-free.

Generative AI tools **produce** derivatives of other people's copyrighted works.

Yes. That includes the case where LLMs generates copyrighted contents with large portions of overlap. For instance, https://www.nytimes.com/2023/12/27/business/media/new-york-times-open-ai-microsoft-lawsuit.html
because those copyrighted contents are a part of their original training dataset.
Apart from the LLMs (large language models), the image generation models and other generative AIs will also do something similar, partly copying
their copyrighted
training data to the generated results, to some extent.

That said, we already have the necessary policies in place:

* d/copyright must be accurate
* all sources must be reproducible from their preferred form of modification

Both are not possible using generative AI.

Both are possible.

For example, if a developer uses LLM to aid programming, and the LLM copied some code from a copyrighted source. But the developer is very unlikely able
to tell whether the generated code contains verbatim copy of copyrighted contents, lest the source of that copyrighted parts if any.

Namely, if we look at new software projects, we do not know whether the code files are purely human-written, or with some aid from AI.

Similar things happens with other file types, such as images. For
instance, you
may ask a generative AI to generate a logo, or some artworks as a part
of a software
project. And those generated results, with or without further
modifications, can be
stored in .ico, .jpg, and .png formats, etc.

Now, the problem is, FTP masters will not question the reproducibility of
a code file, or a .png file. If the upstream author does not acknowledge
the use
of AI during the development process, it is highly likely that nobody
else on
the earth will know that.

This does not sound like a situation where we can take any action to
improve.
My only opinion towards this is to trust the upstream authors' acknowledgements.

BTW, ML-Policy has foreseen such issue and covered it to some extent: https://salsa.debian.org/deeplearning-team/ml-policy/-/blob/master/ML-Policy.rst
See the "Generated Artifacts" section.

It seems that the draft Open Source AI Definition does not cover
contents generated
by AI models yet: https://discuss.opensource.org/t/draft-v-0-0-8-of-the-open-source-ai-definition-is-available-for-comments/315

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Russ Allbery@21:1/5 to Tiago Bortoletto Vaz on Thu May 2 22:20:02 2024

Tiago Bortoletto Vaz <tiago@debian.org> writes:

I personally agree with the author's rationale on the aspects pointed
out (copyright, quality and ethical ones). But at this point I guess we
might have more questions than answers, that's why I think it'd be
helpful to have some input before suggesting any concrete proposals.
Perhaps the most important step now is to get an idea of how Debian
folks actually feels about this matter. And how we feel about moving in
a similar direction to what the gentoo project did.

I'm dubious of the Gentoo approach because it is (as they admit)
unenforceable, which to me means that it's not a great policy. A position statement, maybe, but that's a different sort of thing.

I also agree in part with Ansgar: we don't make policies against what
tools people use locally for developing software.

I think the piece that has the most direct impact on Debian is if the
output from the AI software is found to be a copyright infringement and therefore something that Debian does not have permission to redistribute
or that violates the DFSG. But we're going to be facing that problem with upstreams as well, so the scope of that problem goes far beyond the
question of direct contributions to Debian, and I don't think direct contributions to Debian will be the most significant part of that problem.

This is going to be a tricky and unsettled problem for some time, since
it's both legal (in multiple distributions) and moral, and it's quite
possible that the legal judgments will not align with moral judgments.
(Around copyright, this is often the case.) I'm dubious of our ability to
get ahead of the legal process on this, given that it's unlikely that
we'll even be able to *detect* if upstreams are using AI. I think this is
a place where it's better to plan on being reactive than to attempt to be proactive. If we get credible reports that software in Debian is not redistributable under the terms of the DFSG, we should deal with that like
we would with any other DFSG violation. That may involve making judgment
calls about the legality of AI-generated content, but hopefully this will
have settled out a bit in broader society before we're forced to make a decision on a specific case.

I also doubt that there is much alignment within Debian about the morality
of copyright infringement in general. We're a big-tent project from that perspective. Our project includes people who believe all software
copyright is an ill-advised legal construction that limits people's
freedom, and people who believe strongly in moral rights expressed through copyright and in the right of an author to control how their work is used.
We could try to reach some sort of project consensus on the moral issues
here, but I'm a bit dubious we would be successful.

At the moment, my biggest concern about the practical impact of AI is that
most of the output is low-quality garbage and, because it's now automated,
the volume of that low-quality garbage can be quite high. (I am
repeatedly assured by AI advocates that this will improve rapidly. I
suppose we will see. So far, the evidence that I've seen has just led me
to question the standards and taste of AI advocates.) But I don't think dealing with this requires any new *policies*. I think it's a fairly
obvious point of Debian collaboration that no one should deluge their
fellow project members in low-quality garbage, and if that starts
happening, I think we have adequate mechanisms to complain and ask that it
stop without making new policy.

About the only statement that I've wanted to make so far is to say that
anyone relying on AI to summarize important project resources like Debian Policy or the Developers Guide or whatnot is taking full responsibility
for any resulting failures. If you ask an AI to read Policy for you and
it spits out nonsense or lies, this is not something the Policy Editors
have any time or bandwidth to deal with.

--
Russ Allbery (rra@debian.org) <https://www.eyrie.org/~eagle/>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tiago Bortoletto Vaz@21:1/5 to Dominik George on Fri May 3 00:10:01 2024

On Thu, May 02, 2024 at 08:47:31PM +0200, Dominik George wrote:

Hi,

It's just another tool that might or might not be non-free like people >using Photoshop, Google Chrome, Gmail, Windows, ... to make
contributions. Or a spamfilter to filter out some.

That's entirely not the point.

It is not about **the tool** being non-free, but the result of its use being non-free.

Generative AI tools **produce** derivatives of other people's copyrighted works.

That said, we already have the necessary policies in place:

* d/copyright must be accurate
* all sources must be reproducible from their preferred form of modification

Both are not possible using generative AI.

That sounds right, but those policies are very related to Debian packages, while Debian will also release a fair amount of other kinds of content.

Bests,

--
tvaz

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Sam Hartman@21:1/5 to All on Fri May 3 01:30:01 2024

"Dominik" == Dominik George <natureshadow@debian.org> writes:

Dominik> Generative AI tools **produce** derivatives of other people's copyrighted works.

Dominik> That said, we already have the necessary policies in place:

Russ pointed out this is a fairly complicated claim.

It is absolutely true that generative AI models have produced output
that contains copyrighted text.

The questions of whether that text is an infringing derivative of those copyrighted works are making their way through a number of law suits in
my country at least.
And as Russ points out the moral issues are going to be even harder to
figure out.

I don't think it is as simple as you write above, and I agree with
Russ's thoughts on the situation.

--Sam

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tiago Bortoletto Vaz@21:1/5 to Russ Allbery on Fri May 3 03:00:01 2024

Hi Russ,

On Thu, May 02, 2024 at 11:59:10AM -0700, Russ Allbery wrote:

Tiago Bortoletto Vaz <tiago@debian.org> writes:

I personally agree with the author's rationale on the aspects pointed
out (copyright, quality and ethical ones). But at this point I guess we might have more questions than answers, that's why I think it'd be
helpful to have some input before suggesting any concrete proposals. Perhaps the most important step now is to get an idea of how Debian
folks actually feels about this matter. And how we feel about moving in
a similar direction to what the gentoo project did.

I'm dubious of the Gentoo approach because it is (as they admit) unenforceable, which to me means that it's not a great policy. A position statement, maybe, but that's a different sort of thing.

Right, note that they acknowledged this policy is a working in progress. Not perfect, but 'something needed to be done, quickly'. It's hard to find a balance here, but I kind of share this sense of urgency.

Also, I agree that a statement is indeed a more appropriate tool for the circumstance. Although I mention Gentoo's policy, I acknowledge that I should have worded better the title of my first message, as proposing a policy (in Debian terms a policy is really a *Policy* and has huge implications!) didn't reflect very well my intentions.

[...]

About the only statement that I've wanted to make so far is to say that anyone relying on AI to summarize important project resources like Debian Policy or the Developers Guide or whatnot is taking full responsibility
for any resulting failures. If you ask an AI to read Policy for you and
it spits out nonsense or lies, this is not something the Policy Editors
have any time or bandwidth to deal with.

This point resonates with problems we might be facing already, for instance
in the NM process and also in Debconf submissions (there's no point of going into details here because so far we can't proof anything, and even if we could, of course we wouldn't bring any of the involved to the public arena). So I'm actually more concerned about LLM being mindlessly applied in our communication processes (NM, bts, debconf, irc, planet, wiki, website, debian.net stuff, etc) than one using some AI-assisted code in our infra, at least for now.

Again, I correct myself and emphasize that I would rather discuss a possible statement than a policy for Debian on this matter.

Bests,

--
tvaz

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Charles Plessy@21:1/5 to All on Fri May 3 06:30:01 2024

Le Thu, May 02, 2024 at 02:01:20PM -0400, Tiago Bortoletto Vaz a �crit :

I would like Debian to discuss and decide on the usage of AI-generated content
within the project.

Hi Tiago,

as a Debian developer I refrain from using commercial AI to genereate
code used in my packaging work or native packages, because I think that
these systems are copyright laundering machines that allow to suck the
energy invested in Free Sofware and transfer it in proprietary works
(and to a lower extent to un-GPL works).

If I would hear that other Debian developers use them in that context, I
would seriously question whether there is any value to spend my
volunteer time in keeping debian/copyright files accurate to the level
of details our Policy asks for. When the world and ourselves will have
given up on respecting Free Software copyrights and passing attribution,
I will not see the point spending time doing more than the bare minimum,
for instance like in Anaconda, where you just get License: MIT and the
right to download the sources and check yourself the year of attribution
and names of contributors.

This said, I have not found time to try debgpt and feel guilty about
this. If there will be a tool that is trained with source code for
which the authors gave their consent, where the license terms were
compatible, and provides its output under the most viral terms (probably
AGPL), I would love to use it and give attribution to community of
contributors to the software.

So in summary, I probably would vote for a GR calling against the use of
the current commercial AI for generating Debian packaging, native, or infrastructure code, unless of course good arguments are further
provided against. This said, I think that we can not and should not
control for people not respecting the call.

Have a nice day,

Charles

--
Charles Plessy Nagahama, Yomitan, Okinawa, Japan
Debian Med packaging team http://www.debian.org/devel/debian-med Tooting from work, https://fediscience.org/@charles_plessy Tooting from home, https://framapiaf.org/@charles_plessy

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Ansgar =?UTF-8?Q?=F0=9F=99=80?=@21:1/5 to Dominik George on Fri May 3 07:50:01 2024

On Thu, 2024-05-02 at 20:47 +0200, Dominik George wrote:

t's just another tool that might or might not be non-free like people
using Photoshop, Google Chrome, Gmail, Windows, ... to make
contributions. Or a spamfilter to filter out some.

That's entirely not the point.

It is not about **the tool** being non-free, but the result of its use being non-free.

Generative AI tools **produce** derivatives of other people's copyrighted works.

They *can* do that, but so can humans (and will). Humans look at a
product or code and write new code that sometimes resembles the
original very much.

The claim "everything a generative AI tool outputs is a derivative
work" would be rather bold.

That said, we already have the necessary policies in place:

* d/copyright must be accurate
* all sources must be reproducible from their preferred form of modification

Both are not possible using generative AI.

They are, just as they are for human results. The preferred form of modification can be based on the output from a generative AI,
especially if it is further edited.

But this is not something new: a camera or microphone records data, but
we use the captured data (and not the original source) as the preferred
form of modification. Sometimes even after generous preprocessing by
(non-free) firmware.

(We don't include human neural network data as part of the preferred
form of modification either ;-))

Ansgar

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Dominik George@21:1/5 to All on Fri May 3 11:50:01 2024

Hi,

Generative AI tools **produce** derivatives of other people's copyrighted works.

They *can* do that, but so can humans (and will). Humans look at a
product or code and write new code that sometimes resembles the
original very much.

Can I ask the LLM where it was probably inspired?

Can I show the LLM another work and ask it whether there might be a chance theu got inspired by it (and get a different answer than that it probably sucked in everything, so yes)?

Is there a chance that the LLM did not only read some source code, but also had friendly interactions with the author?

Admittedly, at that point, we get into philosophical questions, which I don't consider any less important for free software.

-nik

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Andrey Rakhmatullin@21:1/5 to Charles Plessy on Fri May 3 14:30:01 2024

On Fri, May 03, 2024 at 01:04:29PM +0900, Charles Plessy wrote:

If I would hear that other Debian developers use them in that context, I would seriously question whether there is any value to spend my
volunteer time in keeping debian/copyright files accurate to the level
of details our Policy asks for.

There is a popular old opinion unrelated to AI that there is not.

--
WBR, wRAR

-----BEGIN PGP SIGNATURE-----

iQJhBAABCgBLFiEEolIP6gqGcKZh3YxVM2L3AxpJkuEFAmY01EAtFIAAAAAAFQAP cGthLWFkZHJlc3NAZ251cGcub3Jnd3JhckBkZWJpYW4ub3JnAAoJEDNi9wMaSZLh 6loP/jAfs9gSbSxfHDLPlNxO6xjq7iu5xcbxHzTn7jfoJJ7wjt+5iCtveJP+yKc0 MRiCv5BJ3gsfqZpqACOo6nSopIh/Vto1RmugounXYcfq/4rY1vdhDv6OygtMjw8o mBorrZAL5nL+b4YKtVyExAdZaBZDO+59q0Z7lmUZnijWIkMdCzpPfFsZOx3ydyL7 whVXfSlyHuGoVFpZpBpDvH9yi5cHkWzsTJ6w2kLR2UwYahPTBGklNyr+TgHkGwsP TA9jPdy+UozRhN+0QeKORyVrT3rYLtIsJrqMNUhPeo2Y3gKs9ujkcDUtYR95t4qc xJX2XxDywsz9/26y42WSOLH5td9ZAiMOhW+mQpIq3hKSLYkTqCXE91anPR36xBzP BACJjJYqtEGMwkAC+rNDdLdxeXAi7cPQCVWitwDicHbsqUGf09UkdZAgWrwVhuAg CYGUucNjxJ5DO9AUZwpyq75FmNrSUd2ZMdkgZDKwOwVtwrceDF5ZgRASEydASFrQ 1qnPxtElRK2petYeNvrKHym5rRh4MIaGshFzx45ZYOK8raQGhdQZvq1fRK81Yzmj cPmtzVsR156S3Ru1cCZEjI0JnkNzJOJjHyEtjL55VCHBjSycWs6yffilmNTTer5R C5wbaGUP7lugUWUtgpYJCThZPdHtlFesLXcUj++YEeo5eAoN
=NuKZ
-----END PGP SIGNATURE-----

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Jose-Luis Rivas@21:1/5 to Tiago Bortoletto Vaz on Fri May 3 16:10:01 2024

On Thu May 2, 2024 at 9:21 PM -03, Tiago Bortoletto Vaz wrote:

Right, note that they acknowledged this policy is a working in progress. Not perfect, but 'something needed to be done, quickly'. It's hard to find a balance here, but I kind of share this sense of urgency.

[...]

This point resonates with problems we might be facing already, for instance in the NM process and also in Debconf submissions (there's no point of going into details here because so far we can't proof anything, and even if we could,
of course we wouldn't bring any of the involved to the public arena). So I'm actually more concerned about LLM being mindlessly applied in our communication
processes (NM, bts, debconf, irc, planet, wiki, website, debian.net stuff, etc)
than one using some AI-assisted code in our infra, at least for now.

Hi Tiago,

It seems you have more context than the rest which provides a sense of
urgency for you, where others do not have this same information and
can't share this sense of urgency.

If I were to assume based on the little context you shared, I would say
there's someone doing a NM application using LLM, answering stuff with
LLM and passing all their communications through LLMs.

In that case, there's even less point in making a policy about it, in my opinion. Since as you stated: you can't prove anything, and ultimately
it would land in the hands of the people approving submissions or NMs to
judge if the person is qualified or not. And you can't block
communications from LLM generated content when you can't even prove it's
LLM generated content. How to enforce it?

And I doubt a statement would do much, as well. What would be
communicated? "Communications produced by LLMs are troublesome"? I don't
know if there's much substance to have a statement of that sort.

OTOH, LLM-assisted rewrite of your own content may help non-native
English speakers to write better and improve communication
effectiveness. Hence, saying "communications produced by LLMs are
troublesome" would be troublesome itself, since how can you as a
receiver differentiate if it's their own content or other's content.

Some may say "a statement could at least be used as a pointer to say
'these are our expectations regarding use of AI'", but ultimately is in
the hands of those judging to filter out or not. And if those judging
can't even prove if AI was used, what's the point?

I can't see the point of "something needs to be done" without a clear
reasoning of the expectations out of that being done.

--Jose

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefano Zacchiroli@21:1/5 to Tiago Bortoletto Vaz on Fri May 3 18:30:02 2024

On Thu, May 02, 2024 at 08:21:28PM -0400, Tiago Bortoletto Vaz wrote:

So I'm actually more concerned about LLM being mindlessly applied in
our communication processes (NM, bts, debconf, irc, planet, wiki,
website, debian.net stuff, etc) than one using some AI-assisted code
in our infra, at least for now.

On that front, useful "related work" are the policies that scientific
journals and conferences (which are exposed *a lot* to this, given their
main activity is vetting textual documents) have put in place about
this.

The general policy usually contains two main points (paraphrased below):

(1) You are free to use AI tools to *improve* your content, but not to
create it from scratch for you.

This point is particular important for non-native English speakers,
who can benefit a lot more than natives from tool support for tasks
like proofreading/editing. I suspect the Debian community might be
particularly sensible to this argument. (And note that on this one
the barrier between ChatGPT-based proofreading and other grammar/
style checkers will become more and more blurry in the future.)

(2) You need to disclose the fact you have used AI tools, and how you
have used them.

Exactly as in your case, Tiago, people managing scientific journals and conferences have absolutely no way of checking if these rules are
respected or not. (They have access to large-scale plagiarism detection
tools, which is a related but different concern.) They just ask people
to *state* they followed this policy upon submission, but that's it.

If your main concern is people using LLMs or the like in some of the
processes you mention, a checkbox requiring such a statement upon
submission might go a longer way than a project-wide statement (which
will sit in d-d-a unknown to n-m applicants a few years from now).

Cheers
--
Stefano Zacchiroli . zack@upsilon.cc . https://upsilon.cc/zack _. ^ ._
Full professor of Computer Science o o o \/|V|\/ Télécom Paris, Polytechnic Institute of Paris o o o </> <\> Co-founder & CTO Software Heritage o o o o /\|^|/\ https://twitter.com/zacchiro . https://mastodon.xyz/@zacchiro '" V "'

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tiago Bortoletto Vaz@21:1/5 to Jose-Luis Rivas on Fri May 3 19:00:01 2024

Hi Jose,

Thanks for you input, I have a few comments:

On Fri, May 03, 2024 at 11:02:47AM -0300, Jose-Luis Rivas wrote:

On Thu May 2, 2024 at 9:21 PM -03, Tiago Bortoletto Vaz wrote:

Right, note that they acknowledged this policy is a working in progress. Not
perfect, but 'something needed to be done, quickly'. It's hard to find a balance here, but I kind of share this sense of urgency.

[...]

This point resonates with problems we might be facing already, for instance in the NM process and also in Debconf submissions (there's no point of going
into details here because so far we can't proof anything, and even if we could,
of course we wouldn't bring any of the involved to the public arena). So I'm
actually more concerned about LLM being mindlessly applied in our communication
processes (NM, bts, debconf, irc, planet, wiki, website, debian.net stuff, etc)
than one using some AI-assisted code in our infra, at least for now.

Hi Tiago,

It seems you have more context than the rest which provides a sense of urgency for you, where others do not have this same information and
can't share this sense of urgency.

Yes.

If I were to assume based on the little context you shared, I would say there's someone doing a NM application using LLM, answering stuff with
LLM and passing all their communications through LLMs.

In that case, there's even less point in making a policy about it, in my opinion. Since as you stated: you can't prove anything, and ultimately
it would land in the hands of the people approving submissions or NMs to judge if the person is qualified or not. And you can't block
communications from LLM generated content when you can't even prove it's
LLM generated content. How to enforce it?

Hmm I tend to disagree here. Proving by investigation isn't the only way to get some truth about the situation. We can get it by simply asking the person if they used LLM to generate their work (be it an answer to NM questions, or a contribution to Debian website, or an email to this mailing list...). In that scenario, having a policy, a position statement or even a gentle guideline would make a huge difference in the ongoing exchange.

And I doubt a statement would do much, as well. What would be
communicated? "Communications produced by LLMs are troublesome"? I don't
know if there's much substance to have a statement of that sort.

Just to set the scene a little on how I think about the issue: when I brought up this discussion, I didn't have in mind someone evil attempting to use AI to deliberately disrupt the project. We know already that policies or statements are never sufficient to deal with people in this category. Rather, I see many people (mostly younger contributors) who're getting to use LLMs in their daily life in a quite mindless way -- which of course is not our business if they do so in their private life. However, the issues that can arise using this kind of technology without much consideration in a community like Debian are not obvious to everyone, and I don't expect every Debian contributor to have a sufficiently good understanding of the matter, or maturity, at the moment they start contributing to the project. We can draw some analogy here in relation to the CoC and the Diversity Statement. They might seem quite obvious to some, and less so to others.

So far I've felt a certain resistance to adopting something as sharp as Gentoo did (which I've already agreed with). However, I still have the feeling that a position in the form of a statement or even a guideline could help us both avoid and mitigate possible problems in the future.

Bests,

--
tvaz

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Russ Allbery@21:1/5 to Stefano Zacchiroli on Fri May 3 19:50:01 2024

Stefano Zacchiroli <zack@debian.org> writes:

(1) You are free to use AI tools to *improve* your content, but not to
create it from scratch for you.

This point is particular important for non-native English speakers,
who can benefit a lot more than natives from tool support for tasks
like proofreading/editing. I suspect the Debian community might be
particularly sensible to this argument. (And note that on this one
the barrier between ChatGPT-based proofreading and other grammar/
style checkers will become more and more blurry in the future.)

This underlines a key point to me, which is that "AI" is a marketing term,
not a technical classification. Even LLMs, a more technical
classification, can be designed to do different things, and I expect
hybrid models to become more widespread as the limitations of trying to do literally everything via an LLM become more apparent.

Grammar checkers, automated translation, and autocorrect are all useful
tools in their appropriate place. Some people have moral concerns about
how they're constructed and other people don't. I'm not sure we'll have a consensus on that. So far, at least, there don't seem to be the sort of
legal challenges for those types of applications that there are for the
"write completely new text based on a prompt" tyle of LLM.

Just on a personal note, I do want to make a plea to non-native English speakers to not feel like you need to replace your prose with something generated by an LLM.

I don't want to understate the benefits of grammar checking, translation,
and other tools, and I don't want to underestimate the frustration and difficulties in communicating in a non-native language. I think ethical
tools to assist with that are great. But I would much rather puzzle out
odd or less-than-fluent English, extend assumptions of good will, and work through the occasional misunderstanding, if that means I can interact with
a real human voice.

I know, I know, supposedly this is all getting better, but so much of the
text produced by ChatGPT and similar tools today sounds like a McKinsey consultant trying to sell war crimes to a marketing executive. Yes, it's precisely grammatical and well-structured English. It's also sociopathic, completely soulless, and almost impossible to concentrate on because it's
full of the sort of slippery phrases and opaque verbosity of a politician trying to distract from some sort of major scandal. I want to talk to
you, another human being, not to an LLM trained to sound like a corporate
web site.

--
Russ Allbery (rra@debian.org) <https://www.eyrie.org/~eagle/>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Sam Hartman@21:1/5 to All on Fri May 3 20:00:02 2024

"Tiago" == Tiago Bortoletto Vaz <tiago@debian.org> writes:

Tiago> Hi Jose,
Tiago> Thanks for you input, I have a few comments:

Tiago> On Fri, May 03, 2024 at 11:02:47AM -0300, Jose-Luis Rivas wrote:
>> On Thu May 2, 2024 at 9:21 PM -03, Tiago Bortoletto Vaz wrote:
>> > Right, note that they acknowledged this policy is a working in progress. Not
>> > perfect, but 'something needed to be done, quickly'. It's hard to find a
>> > balance here, but I kind of share this sense of urgency.
>> >
>> > [...]
>> >
>> > This point resonates with problems we might be facing already, for instance
>> > in the NM process and also in Debconf submissions (there's no point of going
>> > into details here because so far we can't proof anything, and even if we could,
>> > of course we wouldn't bring any of the involved to the public arena). So I'm
>> > actually more concerned about LLM being mindlessly applied in our communication
>> > processes (NM, bts, debconf, irc, planet, wiki, website, debian.net stuff, etc)
>> > than one using some AI-assisted code in our infra, at least for now.
>> >
>>
>> Hi Tiago,
>>
>> It seems you have more context than the rest which provides a sense of
>> urgency for you, where others do not have this same information and
>> can't share this sense of urgency.

Tiago> Yes.

Oh, wow, I had no idea that your argument for urgency came from the NM
case.

I actually think that NM is not benefitted from a policy here.
We already have a fairly good standard: did you prove to your
application manager, your advocates, and the reviewers (FD or DAM as appropriate) that you can be trusted and that you have the necessary
technical and other skills to be a DD.

I think it's fairly clear that using an LLM to answer questions in the
NM process does not show that you have the technical skills.
(Using it instead of reading a man page for similar results and then
going and doing the work might be fine, but cutting and pasting an
answer to an application question into the message you send to your AM
clearly doesn't demonstrate your own technical skill.)

I as an AM would find that an applicant using an LLM as more than a
possibly incorrect man page without telling me would violate trust. I
don't need a policy to come to that conclusion. I don't think I would
have any trouble convincing DAM or FD to back my decision.

I think coming up with a policy for this situation is going to be
tricky.

Do I mind an applicant asking an LLM to refresh their memory on how to
import a new upstream version?
No, not at all.
Do they need to cite the LLM in their answer?
If it really is a memory refresh and they know the material well enough
to have confidence that the LLM answer is correct, I do not think they
need to cite.
If they don't know the material well enough to know the LLM is correct,
then and LLM is a bad choice.

But the same is true of a human I might ask.
If I asked you to remind me something about importing a new upstream,
and it really was just a reminder, I would not cite your contribution
unless I used a significant chunk of text you had written.
If you gave me bad info and I didn't catch it, then we learn I probably
should not be trusted to pick good sources for my education.

--Sam

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Mo Zhou@21:1/5 to Stefano Zacchiroli on Sat May 4 04:00:01 2024

On 5/3/24 12:10, Stefano Zacchiroli wrote:

On that front, useful "related work" are the policies that scientific journals and conferences (which are exposed *a lot* to this, given their
main activity is vetting textual documents) have put in place about
this.

Indeed. Here are some examples:
Nature: https://www.nature.com/nature-portfolio/editorial-policies/ai
ICML: https://icml.cc/Conferences/2023/llm-policy
CVPR: https://cvpr.thecvf.com/Conferences/2024/ReviewerGuidelines
https://cvpr.thecvf.com/Conferences/2024/AuthorGuidelines

Some additional points to the two from Stefano:
1. Nature does not allow LLM to be an author.
2. CVPR holds the author who used LLM responsible for all LLM's fault.
3. CVPR agrees that the paper reviewers skipping their work with LLM
is harming the community.

The general policy usually contains two main points (paraphrased below):

(1) You are free to use AI tools to *improve* your content, but not to
create it from scratch for you.

Polishing language is the case where I find LLMs most useful. But in fact,
as an author, when I really care about the quality of whatever I wrote,
I will find the state-of-the-art LLMs (such as ChatGPT4) poor in logic,
poor in understanding my deep insight. They eventually turn into a
smart language tutor to me.

(2) You need to disclose the fact you have used AI tools, and how you
have used them.

Yes, It is commonly encouraged to acknowledge the use of AI tools.

Exactly as in your case, Tiago, people managing scientific journals and conferences have absolutely no way of checking if these rules are
respected or not. (They have access to large-scale plagiarism detection tools, which is a related but different concern.) They just ask people
to *state* they followed this policy upon submission, but that's it.

If the cheater who use LLM is lazy enough, not editing the LLM outputs
at all --- you will find it super easy to identify whether a chunk of text
is produced by LLM on your own. For example, I use ChatGPT basically
everyday in
March, and its answers always feel like being organized in the same
format. No human answers questions in the same boring format all the time.

If your main concern is people using LLMs or the like in some of the processes you mention, a checkbox requiring such a statement upon
submission might go a longer way than a project-wide statement (which
will sit in d-d-a unknown to n-m applicants a few years from now).

For the long run, there is no way to enforce a ban on the use of AI over
this project. What is doable, from my point of view, is to confirm that
a person acknowledges the issues, potential risk and implications of
the use of AI tools, and hold people who use AI to be responsible for
AI's fault.

Afterall, it's easy to identify one's intention of using AI -- it is either
for good or bad. If the NM applicants can easily get the answer of an
NM question, maybe it is time to refresh the question? Afterall nobody
can stop one from learning from AI outputs when they need suggestion
or reference answers -- and they are responsible for the wrong answer
if AI is wrong.

Apart from deliberately conducting bad acts using AIs, one thing that seems benign but harmful to the community is slacking off and skipping important
work with AIs. But still, this can be covered by a single rule as well --
"Let the person who use AI to be responsible for AI's fault."

Simple, and doable.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Jack Warkentin@21:1/5 to Mo Zhou on Mon May 6 21:40:01 2024

Hi Everybody

I am an 86-year-old long time user of Debian GNU/Linux (at least 20
years). And I subscribe to the debian-project mailing list as well as a
couple of other debian mailing lists. I sometimes have problems
understanding abbreviations and acronymns used on these lists and
occasionally in package documentation.

While reading this thread I could not understand the "NM" acronymn (and
some other abbreviations as well). I finally found out NM's meaning by
looking at https://www.debian.org/sitemap and reading "Debian New
Members Corner".

It would be helpful if Debian would create (and keep up-to-date) a web
page of acronymns and abbreviations used by Debian literati. Or is there already such a one, but not listed on the site map?

Regards

Jack

Jack Warkentin, phone 902-404-0457, email jwrk@eastlink.ca
39 Inverness Avenue, Halifax, Nova Scotia, Canada, B3P 1X6

Mo Zhou wrote:

On 5/3/24 12:10, Stefano Zacchiroli wrote:

On that front, useful "related work" are the policies that scientific
journals and conferences (which are exposed *a lot* to this, given their
main activity is vetting textual documents) have put in place about
this.

Indeed. Here are some examples:
Nature: https://www.nature.com/nature-portfolio/editorial-policies/ai
ICML: https://icml.cc/Conferences/2023/llm-policy
CVPR: https://cvpr.thecvf.com/Conferences/2024/ReviewerGuidelines
          https://cvpr.thecvf.com/Conferences/2024/AuthorGuidelines

Some additional points to the two from Stefano:
1. Nature does not allow LLM to be an author.
2. CVPR holds the author who used LLM responsible for all LLM's fault.
3. CVPR agrees that the paper reviewers skipping their work with LLM
    is harming the community.

The general policy usually contains two main points (paraphrased below):

(1) You are free to use AI tools to *improve* your content, but not to
     create it from scratch for you.

Polishing language is the case where I find LLMs most useful. But in fact,
as an author, when I really care about the quality of whatever I wrote,
I will find the state-of-the-art LLMs (such as ChatGPT4) poor in logic,
poor in understanding my deep insight. They eventually turn into a
smart language tutor to me.

(2) You need to disclose the fact you have used AI tools, and how you
     have used them.

Yes, It is commonly encouraged to acknowledge the use of AI tools.

Exactly as in your case, Tiago, people managing scientific journals and
conferences have absolutely no way of checking if these rules are
respected or not. (They have access to large-scale plagiarism detection
tools, which is a related but different concern.) They just ask people
to *state* they followed this policy upon submission, but that's it.

If the cheater who use LLM is lazy enough, not editing the LLM outputs
at all --- you will find it super easy to identify whether a chunk of text
is produced by LLM on your own. For example, I use ChatGPT basically
everyday in
March, and its answers always feel like being organized in the same
format. No human answers questions in the same boring format all the time.

If your main concern is people using LLMs or the like in some of the
processes you mention, a checkbox requiring such a statement upon
submission might go a longer way than a project-wide statement (which
will sit in d-d-a unknown to n-m applicants a few years from now).

For the long run, there is no way to enforce a ban on the use of AI over
this project. What is doable, from my point of view, is to confirm that
a person acknowledges the issues, potential risk and implications of
the use of AI tools, and hold people who use AI to be responsible for
AI's fault.

Afterall, it's easy to identify one's intention of using AI -- it is either for good or bad. If the NM applicants can easily get the answer of an
NM question, maybe it is time to refresh the question? Afterall nobody
can stop one from learning from AI outputs when they need suggestion
or reference answers -- and they are responsible for the wrong answer
if AI is wrong.

Apart from deliberately conducting bad acts using AIs, one thing that seems benign but harmful to the community is slacking off and skipping important work with AIs. But still, this can be covered by a single rule as well -- "Let the person who use AI to be responsible for AI's fault."

Simple, and doable.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tiago Bortoletto Vaz@21:1/5 to All on Wed May 8 19:10:01 2024

This is much more of a general note regarding this thread.

Apparently we are far from a consensus on an official Debian position
regarding the use of generative AI as a whole in the project.
We're therefore content to use the resources we have and let each team
handle the content using their own criteria -- though not what I
expected... expectations adjusted and all is fine :-)

I'd be particularly happy to incorporate suggestions from Zack and
others in the areas I work on in Debian. Thanks anyway to everyone for
the input, especially Mo Zhou, Russ and Zack. I hope this debate will
come up again at a time when we better understand the consequences of
all this.

On 2024-05-03 21:32, Mo Zhou wrote:
[...]

Bests,

--
tvaz

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	546
Nodes:	16 (2 / 14)
Uptime:	151:44:02
Calls:	10,383
Files:	14,054
Messages:	6,417,815

A policy on use of AI-generated content in Debian

Who's Online

System Info