• A policy on use of AI-generated content in Debian

    From Tiago Bortoletto Vaz@21:1/5 to All on Thu May 2 20:10:02 2024
    Hi,

    I would like Debian to discuss and decide on the usage of AI-generated content within the project. I fear that we are already facing negative consequences in some areas of Debian as a result of its use. If I happen to be wrong, I'm still afraid that it will happen in a very short time.

    You might already know that recently Gentoo made a strong move in this context and drafted their AI policy:

    - https://wiki.gentoo.org/wiki/Project:Council/AI_policy
    - https://www.mail-archive.com/gentoo-dev@lists.gentoo.org/msg99042.html

    I personally agree with the author's rationale on the aspects pointed out (copyright, quality and ethical ones). But at this point I guess we might have more questions than answers, that's why I think it'd be helpful to have some input before suggesting any concrete proposals. Perhaps the most important step now is to get an idea of how Debian folks actually feels about this matter.
    And how we feel about moving in a similar direction to what the gentoo project did.

    If things move in a somewhat consensual direction, I intend to bring the discussion to -vote in order to discuss a possible GR.

    Bests,

    --
    tvaz

    -----BEGIN PGP SIGNATURE-----

    iQIzBAABCgAdFiEEOAYLMZqeqHbTW+jfgVxjVIQAXyEFAmYz1PAACgkQgVxjVIQA XyEkXA/8CMKeO13+Ddln0QUWP9IXSO8VTteGJz15YXPwfBCCV6jPws/JrwXc5kSD C498yN5RgmkYOhfrwG+I5YtTVAbGof4QbbBzBwLXB9CcOzvXXzJpC3Y60RyU/Vds d1IqKSxOVOflivvGqgBgMjnOlMbNZvBKCRtNXT8tIhKeBCEKcx+pJlRoGS6vs4U1 j2woPn6w5yoaW+RFDLbwwVldOqgqgIMIcBOUBGwqrmebaLObNg6bLEoqeOmNif/B uEJiK4PIwzVda9hiNLHYlft0NJFqjV4AJkql0hQMd/5yhHPFle3m5YnaUp2PdLX8 saXLNJKTQShcog1zrwxccwopmKsm76GxOnManu4a66iR6hox45yg+ZXcU1HvmL8J ARyNr+HkyPvwOyBBbv2J16Z0x4/ZPHtbdtRjRziklyuWgHfzjFqA5ZnBL3VH0WpH B1r1QuDvBj1N5p+kQWzAO4x6UUIwl9Vb0QYNSn1f6lQHJ0A99+oSzIBDW4D9lq3m WoZ2TUpadh0xS4GHE3KR8nIjYkjamNco1GTkLJVmKFqLVSb1mjYcpd5uKSa5gOI7 XyTedVWlXJEBQ7bUqAjNBZwlxdj5mQyxNMwTHlI7Lip0ND9VROo2qy55kA8hhtFp SPtNAxT+Fqj2RtM82CiUJOkvg0UqU39yg7/LaNsrpY3itczYxsI=
    =t7vP
    -----END PGP SIGNATURE-----

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ansgar =?UTF-8?Q?=F0=9F=99=80?=@21:1/5 to Tiago Bortoletto Vaz on Thu May 2 20:40:01 2024
    Hi,

    On Thu, 2024-05-02 at 14:01 -0400, Tiago Bortoletto Vaz wrote:
    I would like Debian to discuss and decide on the usage of AI-
    generated content within the project.

    It's just another tool that might or might not be non-free like people
    using Photoshop, Google Chrome, Gmail, Windows, ... to make
    contributions. Or a spamfilter to filter out some.

    You might already know that recently Gentoo made a strong move in
    this context and drafted their AI policy:

    - https://wiki.gentoo.org/wiki/Project:Council/AI_policy
    - https://www.mail-archive.com/gentoo-
    dev@lists.gentoo.org/msg99042.html


    I think that is a bad: policy we don't ban Tor because it can be used
    copyright violations, terrorism, stalking, hacking or other unethical
    things. We don't have a general ban on human contributions due to
    quality concerns. I don't see why AI as yet another tool should be
    different.

    Ansgar

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dominik George@21:1/5 to All on Thu May 2 21:00:01 2024
    Hi,

    It's just another tool that might or might not be non-free like people
    using Photoshop, Google Chrome, Gmail, Windows, ... to make
    contributions. Or a spamfilter to filter out some.

    That's entirely not the point.

    It is not about **the tool** being non-free, but the result of its use being non-free.

    Generative AI tools **produce** derivatives of other people's copyrighted works.

    That said, we already have the necessary policies in place:

    * d/copyright must be accurate
    * all sources must be reproducible from their preferred form of modification

    Both are not possible using generative AI.

    -nik

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Sam Hartman@21:1/5 to All on Thu May 2 22:20:02 2024
    "Ansgar" == Ansgar 🙀 <ansgar@43-1.org> writes:

    Ansgar> Hi,
    Ansgar> On Thu, 2024-05-02 at 14:01 -0400, Tiago Bortoletto Vaz wrote:
    >> I would like Debian to discuss and decide on the usage of AI-
    >> generated content within the project.

    Ansgar> It's just another tool that might or might not be non-free like people
    Ansgar> using Photoshop, Google Chrome, Gmail, Windows, ... to make
    Ansgar> contributions. Or a spamfilter to filter out some.

    I tend to agree with the above. AI is just another tool, and I trust
    DDs to use it appropriately.

    I probably would not use AI to write large blocks of code, because I
    find that auditing the quality of AI generated code is harder than just
    writing the code in most cases.

    I might:

    * use debgpt to guess answers to questions about packaging that I could
    verify in some manner.

    * Use generative AI to suggest names of projects, help improve
    descriptions, summarize content, etc.

    * See if generative AI could help producing a message with a desired
    tone.

    --Sam

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Mo Zhou@21:1/5 to Tiago Bortoletto Vaz on Thu May 2 22:20:02 2024
    On 5/2/24 14:01, Tiago Bortoletto Vaz wrote:
    You might already know that recently Gentoo made a strong move in this context
    and drafted their AI policy:

    - https://wiki.gentoo.org/wiki/Project:Council/AI_policy
    - https://www.mail-archive.com/gentoo-dev@lists.gentoo.org/msg99042.html

    People might not already know I wrote this 4 years ago https://salsa.debian.org/deeplearning-team/ml-policy/-/blob/master/ML-Policy.rst

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Mo Zhou@21:1/5 to Dominik George on Thu May 2 22:20:02 2024
    On 5/2/24 14:47, Dominik George wrote:
    That's entirely not the point.

    It is not about **the tool** being non-free, but the result of its use being non-free.

    Generative AI tools **produce** derivatives of other people's copyrighted works.

    Yes. That includes the case where LLMs generates copyrighted contents with large portions of overlap. For instance, https://www.nytimes.com/2023/12/27/business/media/new-york-times-open-ai-microsoft-lawsuit.html
    because those copyrighted contents are a part of their original training dataset.
    Apart from the LLMs (large language models), the image generation models and other generative AIs will also do something similar, partly copying
    their copyrighted
    training data to the generated results, to some extent.

    That said, we already have the necessary policies in place:

    * d/copyright must be accurate
    * all sources must be reproducible from their preferred form of modification

    Both are not possible using generative AI.

    Both are possible.

    For example, if a developer uses LLM to aid programming, and the LLM copied some code from a copyrighted source. But the developer is very unlikely able
    to tell whether the generated code contains verbatim copy of copyrighted contents, lest the source of that copyrighted parts if any.

    Namely, if we look at new software projects, we do not know whether the code files are purely human-written, or with some aid from AI.

    Similar things happens with other file types, such as images. For
    instance, you
    may ask a generative AI to generate a logo, or some artworks as a part
    of a software
    project. And those generated results, with or without further
    modifications, can be
    stored in .ico, .jpg, and .png formats, etc.

    Now, the problem is, FTP masters will not question the reproducibility of
    a code file, or a .png file. If the upstream author does not acknowledge
    the use
    of AI during the development process, it is highly likely that nobody
    else on
    the earth will know that.

    This does not sound like a situation where we can take any action to
    improve.
    My only opinion towards this is to trust the upstream authors' acknowledgements.

    BTW, ML-Policy has foreseen such issue and covered it to some extent: https://salsa.debian.org/deeplearning-team/ml-policy/-/blob/master/ML-Policy.rst
    See the "Generated Artifacts" section.

    It seems that the draft Open Source AI Definition does not cover
    contents generated
    by AI models yet: https://discuss.opensource.org/t/draft-v-0-0-8-of-the-open-source-ai-definition-is-available-for-comments/315

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Russ Allbery@21:1/5 to Tiago Bortoletto Vaz on Thu May 2 22:20:02 2024
    Tiago Bortoletto Vaz <tiago@debian.org> writes:

    I personally agree with the author's rationale on the aspects pointed
    out (copyright, quality and ethical ones). But at this point I guess we
    might have more questions than answers, that's why I think it'd be
    helpful to have some input before suggesting any concrete proposals.
    Perhaps the most important step now is to get an idea of how Debian
    folks actually feels about this matter. And how we feel about moving in
    a similar direction to what the gentoo project did.

    I'm dubious of the Gentoo approach because it is (as they admit)
    unenforceable, which to me means that it's not a great policy. A position statement, maybe, but that's a different sort of thing.

    I also agree in part with Ansgar: we don't make policies against what
    tools people use locally for developing software.

    I think the piece that has the most direct impact on Debian is if the
    output from the AI software is found to be a copyright infringement and therefore something that Debian does not have permission to redistribute
    or that violates the DFSG. But we're going to be facing that problem with upstreams as well, so the scope of that problem goes far beyond the
    question of direct contributions to Debian, and I don't think direct contributions to Debian will be the most significant part of that problem.

    This is going to be a tricky and unsettled problem for some time, since
    it's both legal (in multiple distributions) and moral, and it's quite
    possible that the legal judgments will not align with moral judgments.
    (Around copyright, this is often the case.) I'm dubious of our ability to
    get ahead of the legal process on this, given that it's unlikely that
    we'll even be able to *detect* if upstreams are using AI. I think this is
    a place where it's better to plan on being reactive than to attempt to be proactive. If we get credible reports that software in Debian is not redistributable under the terms of the DFSG, we should deal with that like
    we would with any other DFSG violation. That may involve making judgment
    calls about the legality of AI-generated content, but hopefully this will
    have settled out a bit in broader society before we're forced to make a decision on a specific case.

    I also doubt that there is much alignment within Debian about the morality
    of copyright infringement in general. We're a big-tent project from that perspective. Our project includes people who believe all software
    copyright is an ill-advised legal construction that limits people's
    freedom, and people who believe strongly in moral rights expressed through copyright and in the right of an author to control how their work is used.
    We could try to reach some sort of project consensus on the moral issues
    here, but I'm a bit dubious we would be successful.

    At the moment, my biggest concern about the practical impact of AI is that
    most of the output is low-quality garbage and, because it's now automated,
    the volume of that low-quality garbage can be quite high. (I am
    repeatedly assured by AI advocates that this will improve rapidly. I
    suppose we will see. So far, the evidence that I've seen has just led me
    to question the standards and taste of AI advocates.) But I don't think dealing with this requires any new *policies*. I think it's a fairly
    obvious point of Debian collaboration that no one should deluge their
    fellow project members in low-quality garbage, and if that starts
    happening, I think we have adequate mechanisms to complain and ask that it
    stop without making new policy.

    About the only statement that I've wanted to make so far is to say that
    anyone relying on AI to summarize important project resources like Debian Policy or the Developers Guide or whatnot is taking full responsibility
    for any resulting failures. If you ask an AI to read Policy for you and
    it spits out nonsense or lies, this is not something the Policy Editors
    have any time or bandwidth to deal with.

    --
    Russ Allbery (rra@debian.org) <https://www.eyrie.org/~eagle/>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tiago Bortoletto Vaz@21:1/5 to Dominik George on Fri May 3 00:10:01 2024
    On Thu, May 02, 2024 at 08:47:31PM +0200, Dominik George wrote:
    Hi,

    It's just another tool that might or might not be non-free like people >using Photoshop, Google Chrome, Gmail, Windows, ... to make
    contributions. Or a spamfilter to filter out some.

    That's entirely not the point.

    It is not about **the tool** being non-free, but the result of its use being non-free.

    Generative AI tools **produce** derivatives of other people's copyrighted works.

    That said, we already have the necessary policies in place:

    * d/copyright must be accurate
    * all sources must be reproducible from their preferred form of modification

    Both are not possible using generative AI.

    That sounds right, but those policies are very related to Debian packages, while Debian will also release a fair amount of other kinds of content.

    Bests,

    --
    tvaz

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Sam Hartman@21:1/5 to All on Fri May 3 01:30:01 2024
    "Dominik" == Dominik George <natureshadow@debian.org> writes:


    Dominik> Generative AI tools **produce** derivatives of other people's copyrighted works.

    Dominik> That said, we already have the necessary policies in place:

    Russ pointed out this is a fairly complicated claim.

    It is absolutely true that generative AI models have produced output
    that contains copyrighted text.

    The questions of whether that text is an infringing derivative of those copyrighted works are making their way through a number of law suits in
    my country at least.
    And as Russ points out the moral issues are going to be even harder to
    figure out.

    I don't think it is as simple as you write above, and I agree with
    Russ's thoughts on the situation.

    --Sam

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tiago Bortoletto Vaz@21:1/5 to Russ Allbery on Fri May 3 03:00:01 2024
    Hi Russ,

    On Thu, May 02, 2024 at 11:59:10AM -0700, Russ Allbery wrote:
    Tiago Bortoletto Vaz <tiago@debian.org> writes:

    I personally agree with the author's rationale on the aspects pointed
    out (copyright, quality and ethical ones). But at this point I guess we might have more questions than answers, that's why I think it'd be
    helpful to have some input before suggesting any concrete proposals. Perhaps the most important step now is to get an idea of how Debian
    folks actually feels about this matter. And how we feel about moving in
    a similar direction to what the gentoo project did.

    I'm dubious of the Gentoo approach because it is (as they admit) unenforceable, which to me means that it's not a great policy. A position statement, maybe, but that's a different sort of thing.

    Right, note that they acknowledged this policy is a working in progress. Not perfect, but 'something needed to be done, quickly'. It's hard to find a balance here, but I kind of share this sense of urgency.

    Also, I agree that a statement is indeed a more appropriate tool for the circumstance. Although I mention Gentoo's policy, I acknowledge that I should have worded better the title of my first message, as proposing a policy (in Debian terms a policy is really a *Policy* and has huge implications!) didn't reflect very well my intentions.

    [...]

    About the only statement that I've wanted to make so far is to say that anyone relying on AI to summarize important project resources like Debian Policy or the Developers Guide or whatnot is taking full responsibility
    for any resulting failures. If you ask an AI to read Policy for you and
    it spits out nonsense or lies, this is not something the Policy Editors
    have any time or bandwidth to deal with.

    This point resonates with problems we might be facing already, for instance
    in the NM process and also in Debconf submissions (there's no point of going into details here because so far we can't proof anything, and even if we could, of course we wouldn't bring any of the involved to the public arena). So I'm actually more concerned about LLM being mindlessly applied in our communication processes (NM, bts, debconf, irc, planet, wiki, website, debian.net stuff, etc) than one using some AI-assisted code in our infra, at least for now.

    Again, I correct myself and emphasize that I would rather discuss a possible statement than a policy for Debian on this matter.

    Bests,

    --
    tvaz

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Charles Plessy@21:1/5 to All on Fri May 3 06:30:01 2024
    Le Thu, May 02, 2024 at 02:01:20PM -0400, Tiago Bortoletto Vaz a écrit :

    I would like Debian to discuss and decide on the usage of AI-generated content
    within the project.

    Hi Tiago,

    as a Debian developer I refrain from using commercial AI to genereate
    code used in my packaging work or native packages, because I think that
    these systems are copyright laundering machines that allow to suck the
    energy invested in Free Sofware and transfer it in proprietary works
    (and to a lower extent to un-GPL works).

    If I would hear that other Debian developers use them in that context, I
    would seriously question whether there is any value to spend my
    volunteer time in keeping debian/copyright files accurate to the level
    of details our Policy asks for. When the world and ourselves will have
    given up on respecting Free Software copyrights and passing attribution,
    I will not see the point spending time doing more than the bare minimum,
    for instance like in Anaconda, where you just get License: MIT and the
    right to download the sources and check yourself the year of attribution
    and names of contributors.

    This said, I have not found time to try debgpt and feel guilty about
    this. If there will be a tool that is trained with source code for
    which the authors gave their consent, where the license terms were
    compatible, and provides its output under the most viral terms (probably
    AGPL), I would love to use it and give attribution to community of
    contributors to the software.

    So in summary, I probably would vote for a GR calling against the use of
    the current commercial AI for generating Debian packaging, native, or infrastructure code, unless of course good arguments are further
    provided against. This said, I think that we can not and should not
    control for people not respecting the call.

    Have a nice day,

    Charles

    --
    Charles Plessy Nagahama, Yomitan, Okinawa, Japan
    Debian Med packaging team http://www.debian.org/devel/debian-med Tooting from work, https://fediscience.org/@charles_plessy Tooting from home, https://framapiaf.org/@charles_plessy

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ansgar =?UTF-8?Q?=F0=9F=99=80?=@21:1/5 to Dominik George on Fri May 3 07:50:01 2024
    On Thu, 2024-05-02 at 20:47 +0200, Dominik George wrote:
    t's just another tool that might or might not be non-free like people
    using Photoshop, Google Chrome, Gmail, Windows, ... to make
    contributions. Or a spamfilter to filter out some.

    That's entirely not the point.

    It is not about **the tool** being non-free, but the result of its use being non-free.

    Generative AI tools **produce** derivatives of other people's copyrighted works.

    They *can* do that, but so can humans (and will). Humans look at a
    product or code and write new code that sometimes resembles the
    original very much.

    The claim "everything a generative AI tool outputs is a derivative
    work" would be rather bold.

    That said, we already have the necessary policies in place:

    * d/copyright must be accurate
    * all sources must be reproducible from their preferred form of modification

    Both are not possible using generative AI.

    They are, just as they are for human results. The preferred form of modification can be based on the output from a generative AI,
    especially if it is further edited.

    But this is not something new: a camera or microphone records data, but
    we use the captured data (and not the original source) as the preferred
    form of modification. Sometimes even after generous preprocessing by
    (non-free) firmware.

    (We don't include human neural network data as part of the preferred
    form of modification either ;-))

    Ansgar

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dominik George@21:1/5 to All on Fri May 3 11:50:01 2024
    Hi,

    Generative AI tools **produce** derivatives of other people's copyrighted works.

    They *can* do that, but so can humans (and will). Humans look at a
    product or code and write new code that sometimes resembles the
    original very much.

    Can I ask the LLM where it was probably inspired?

    Can I show the LLM another work and ask it whether there might be a chance theu got inspired by it (and get a different answer than that it probably sucked in everything, so yes)?

    Is there a chance that the LLM did not only read some source code, but also had friendly interactions with the author?

    Admittedly, at that point, we get into philosophical questions, which I don't consider any less important for free software.

    -nik

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Andrey Rakhmatullin@21:1/5 to Charles Plessy on Fri May 3 14:30:01 2024
    On Fri, May 03, 2024 at 01:04:29PM +0900, Charles Plessy wrote:
    If I would hear that other Debian developers use them in that context, I would seriously question whether there is any value to spend my
    volunteer time in keeping debian/copyright files accurate to the level
    of details our Policy asks for.
    There is a popular old opinion unrelated to AI that there is not.

    --
    WBR, wRAR

    -----BEGIN PGP SIGNATURE-----

    iQJhBAABCgBLFiEEolIP6gqGcKZh3YxVM2L3AxpJkuEFAmY01EAtFIAAAAAAFQAP cGthLWFkZHJlc3NAZ251cGcub3Jnd3JhckBkZWJpYW4ub3JnAAoJEDNi9wMaSZLh 6loP/jAfs9gSbSxfHDLPlNxO6xjq7iu5xcbxHzTn7jfoJJ7wjt+5iCtveJP+yKc0 MRiCv5BJ3gsfqZpqACOo6nSopIh/Vto1RmugounXYcfq/4rY1vdhDv6OygtMjw8o mBorrZAL5nL+b4YKtVyExAdZaBZDO+59q0Z7lmUZnijWIkMdCzpPfFsZOx3ydyL7 whVXfSlyHuGoVFpZpBpDvH9yi5cHkWzsTJ6w2kLR2UwYahPTBGklNyr+TgHkGwsP TA9jPdy+UozRhN+0QeKORyVrT3rYLtIsJrqMNUhPeo2Y3gKs9ujkcDUtYR95t4qc xJX2XxDywsz9/26y42WSOLH5td9ZAiMOhW+mQpIq3hKSLYkTqCXE91anPR36xBzP BACJjJYqtEGMwkAC+rNDdLdxeXAi7cPQCVWitwDicHbsqUGf09UkdZAgWrwVhuAg CYGUucNjxJ5DO9AUZwpyq75FmNrSUd2ZMdkgZDKwOwVtwrceDF5ZgRASEydASFrQ 1qnPxtElRK2petYeNvrKHym5rRh4MIaGshFzx45ZYOK8raQGhdQZvq1fRK81Yzmj cPmtzVsR156S3Ru1cCZEjI0JnkNzJOJjHyEtjL55VCHBjSycWs6yffilmNTTer5R C5wbaGUP7lugUWUtgpYJCThZPdHtlFesLXcUj++YEeo5eAoN
    =NuKZ
    -----END PGP SIGNATURE-----

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Jose-Luis Rivas@21:1/5 to Tiago Bortoletto Vaz on Fri May 3 16:10:01 2024
    On Thu May 2, 2024 at 9:21 PM -03, Tiago Bortoletto Vaz wrote:
    Right, note that they acknowledged this policy is a working in progress. Not perfect, but 'something needed to be done, quickly'. It's hard to find a balance here, but I kind of share this sense of urgency.

    [...]

    This point resonates with problems we might be facing already, for instance in the NM process and also in Debconf submissions (there's no point of going into details here because so far we can't proof anything, and even if we could,
    of course we wouldn't bring any of the involved to the public arena). So I'm actually more concerned about LLM being mindlessly applied in our communication
    processes (NM, bts, debconf, irc, planet, wiki, website, debian.net stuff, etc)
    than one using some AI-assisted code in our infra, at least for now.


    Hi Tiago,

    It seems you have more context than the rest which provides a sense of
    urgency for you, where others do not have this same information and
    can't share this sense of urgency.

    If I were to assume based on the little context you shared, I would say
    there's someone doing a NM application using LLM, answering stuff with
    LLM and passing all their communications through LLMs.

    In that case, there's even less point in making a policy about it, in my opinion. Since as you stated: you can't prove anything, and ultimately
    it would land in the hands of the people approving submissions or NMs to
    judge if the person is qualified or not. And you can't block
    communications from LLM generated content when you can't even prove it's
    LLM generated content. How to enforce it?

    And I doubt a statement would do much, as well. What would be
    communicated? "Communications produced by LLMs are troublesome"? I don't
    know if there's much substance to have a statement of that sort.

    OTOH, LLM-assisted rewrite of your own content may help non-native
    English speakers to write better and improve communication
    effectiveness. Hence, saying "communications produced by LLMs are
    troublesome" would be troublesome itself, since how can you as a
    receiver differentiate if it's their own content or other's content.

    Some may say "a statement could at least be used as a pointer to say
    'these are our expectations regarding use of AI'", but ultimately is in
    the hands of those judging to filter out or not. And if those judging
    can't even prove if AI was used, what's the point?

    I can't see the point of "something needs to be done" without a clear
    reasoning of the expectations out of that being done.

    --Jose

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefano Zacchiroli@21:1/5 to Tiago Bortoletto Vaz on Fri May 3 18:30:02 2024
    On Thu, May 02, 2024 at 08:21:28PM -0400, Tiago Bortoletto Vaz wrote:
    So I'm actually more concerned about LLM being mindlessly applied in
    our communication processes (NM, bts, debconf, irc, planet, wiki,
    website, debian.net stuff, etc) than one using some AI-assisted code
    in our infra, at least for now.

    On that front, useful "related work" are the policies that scientific
    journals and conferences (which are exposed *a lot* to this, given their
    main activity is vetting textual documents) have put in place about
    this.

    The general policy usually contains two main points (paraphrased below):

    (1) You are free to use AI tools to *improve* your content, but not to
    create it from scratch for you.

    This point is particular important for non-native English speakers,
    who can benefit a lot more than natives from tool support for tasks
    like proofreading/editing. I suspect the Debian community might be
    particularly sensible to this argument. (And note that on this one
    the barrier between ChatGPT-based proofreading and other grammar/
    style checkers will become more and more blurry in the future.)

    (2) You need to disclose the fact you have used AI tools, and how you
    have used them.

    Exactly as in your case, Tiago, people managing scientific journals and conferences have absolutely no way of checking if these rules are
    respected or not. (They have access to large-scale plagiarism detection
    tools, which is a related but different concern.) They just ask people
    to *state* they followed this policy upon submission, but that's it.

    If your main concern is people using LLMs or the like in some of the
    processes you mention, a checkbox requiring such a statement upon
    submission might go a longer way than a project-wide statement (which
    will sit in d-d-a unknown to n-m applicants a few years from now).

    Cheers
    --
    Stefano Zacchiroli . zack@upsilon.cc . https://upsilon.cc/zack _. ^ ._
    Full professor of Computer Science o o o \/|V|\/ Télécom Paris, Polytechnic Institute of Paris o o o </> <\> Co-founder & CTO Software Heritage o o o o /\|^|/\ https://twitter.com/zacchiro . https://mastodon.xyz/@zacchiro '" V "'

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tiago Bortoletto Vaz@21:1/5 to Jose-Luis Rivas on Fri May 3 19:00:01 2024
    Hi Jose,

    Thanks for you input, I have a few comments:

    On Fri, May 03, 2024 at 11:02:47AM -0300, Jose-Luis Rivas wrote:
    On Thu May 2, 2024 at 9:21 PM -03, Tiago Bortoletto Vaz wrote:
    Right, note that they acknowledged this policy is a working in progress. Not
    perfect, but 'something needed to be done, quickly'. It's hard to find a balance here, but I kind of share this sense of urgency.

    [...]

    This point resonates with problems we might be facing already, for instance in the NM process and also in Debconf submissions (there's no point of going
    into details here because so far we can't proof anything, and even if we could,
    of course we wouldn't bring any of the involved to the public arena). So I'm
    actually more concerned about LLM being mindlessly applied in our communication
    processes (NM, bts, debconf, irc, planet, wiki, website, debian.net stuff, etc)
    than one using some AI-assisted code in our infra, at least for now.


    Hi Tiago,

    It seems you have more context than the rest which provides a sense of urgency for you, where others do not have this same information and
    can't share this sense of urgency.

    Yes.

    If I were to assume based on the little context you shared, I would say there's someone doing a NM application using LLM, answering stuff with
    LLM and passing all their communications through LLMs.

    In that case, there's even less point in making a policy about it, in my opinion. Since as you stated: you can't prove anything, and ultimately
    it would land in the hands of the people approving submissions or NMs to judge if the person is qualified or not. And you can't block
    communications from LLM generated content when you can't even prove it's
    LLM generated content. How to enforce it?

    Hmm I tend to disagree here. Proving by investigation isn't the only way to get some truth about the situation. We can get it by simply asking the person if they used LLM to generate their work (be it an answer to NM questions, or a contribution to Debian website, or an email to this mailing list...). In that scenario, having a policy, a position statement or even a gentle guideline would make a huge difference in the ongoing exchange.

    And I doubt a statement would do much, as well. What would be
    communicated? "Communications produced by LLMs are troublesome"? I don't
    know if there's much substance to have a statement of that sort.

    Just to set the scene a little on how I think about the issue: when I brought up this discussion, I didn't have in mind someone evil attempting to use AI to deliberately disrupt the project. We know already that policies or statements are never sufficient to deal with people in this category. Rather, I see many people (mostly younger contributors) who're getting to use LLMs in their daily life in a quite mindless way -- which of course is not our business if they do so in their private life. However, the issues that can arise using this kind of technology without much consideration in a community like Debian are not obvious to everyone, and I don't expect every Debian contributor to have a sufficiently good understanding of the matter, or maturity, at the moment they start contributing to the project. We can draw some analogy here in relation to the CoC and the Diversity Statement. They might seem quite obvious to some, and less so to others.

    So far I've felt a certain resistance to adopting something as sharp as Gentoo did (which I've already agreed with). However, I still have the feeling that a position in the form of a statement or even a guideline could help us both avoid and mitigate possible problems in the future.

    Bests,

    --
    tvaz

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Russ Allbery@21:1/5 to Stefano Zacchiroli on Fri May 3 19:50:01 2024
    Stefano Zacchiroli <zack@debian.org> writes:

    (1) You are free to use AI tools to *improve* your content, but not to
    create it from scratch for you.

    This point is particular important for non-native English speakers,
    who can benefit a lot more than natives from tool support for tasks
    like proofreading/editing. I suspect the Debian community might be
    particularly sensible to this argument. (And note that on this one
    the barrier between ChatGPT-based proofreading and other grammar/
    style checkers will become more and more blurry in the future.)

    This underlines a key point to me, which is that "AI" is a marketing term,
    not a technical classification. Even LLMs, a more technical
    classification, can be designed to do different things, and I expect
    hybrid models to become more widespread as the limitations of trying to do literally everything via an LLM become more apparent.

    Grammar checkers, automated translation, and autocorrect are all useful
    tools in their appropriate place. Some people have moral concerns about
    how they're constructed and other people don't. I'm not sure we'll have a consensus on that. So far, at least, there don't seem to be the sort of
    legal challenges for those types of applications that there are for the
    "write completely new text based on a prompt" tyle of LLM.

    Just on a personal note, I do want to make a plea to non-native English speakers to not feel like you need to replace your prose with something generated by an LLM.

    I don't want to understate the benefits of grammar checking, translation,
    and other tools, and I don't want to underestimate the frustration and difficulties in communicating in a non-native language. I think ethical
    tools to assist with that are great. But I would much rather puzzle out
    odd or less-than-fluent English, extend assumptions of good will, and work through the occasional misunderstanding, if that means I can interact with
    a real human voice.

    I know, I know, supposedly this is all getting better, but so much of the
    text produced by ChatGPT and similar tools today sounds like a McKinsey consultant trying to sell war crimes to a marketing executive. Yes, it's precisely grammatical and well-structured English. It's also sociopathic, completely soulless, and almost impossible to concentrate on because it's
    full of the sort of slippery phrases and opaque verbosity of a politician trying to distract from some sort of major scandal. I want to talk to
    you, another human being, not to an LLM trained to sound like a corporate
    web site.

    --
    Russ Allbery (rra@debian.org) <https://www.eyrie.org/~eagle/>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Sam Hartman@21:1/5 to All on Fri May 3 20:00:02 2024
    "Tiago" == Tiago Bortoletto Vaz <tiago@debian.org> writes:

    Tiago> Hi Jose,
    Tiago> Thanks for you input, I have a few comments:

    Tiago> On Fri, May 03, 2024 at 11:02:47AM -0300, Jose-Luis Rivas wrote:
    >> On Thu May 2, 2024 at 9:21 PM -03, Tiago Bortoletto Vaz wrote:
    >> > Right, note that they acknowledged this policy is a working in progress. Not
    >> > perfect, but 'something needed to be done, quickly'. It's hard to find a
    >> > balance here, but I kind of share this sense of urgency.
    >> >
    >> > [...]
    >> >
    >> > This point resonates with problems we might be facing already, for instance
    >> > in the NM process and also in Debconf submissions (there's no point of going
    >> > into details here because so far we can't proof anything, and even if we could,
    >> > of course we wouldn't bring any of the involved to the public arena). So I'm
    >> > actually more concerned about LLM being mindlessly applied in our communication
    >> > processes (NM, bts, debconf, irc, planet, wiki, website, debian.net stuff, etc)
    >> > than one using some AI-assisted code in our infra, at least for now.
    >> >
    >>
    >> Hi Tiago,
    >>
    >> It seems you have more context than the rest which provides a sense of
    >> urgency for you, where others do not have this same information and
    >> can't share this sense of urgency.

    Tiago> Yes.

    Oh, wow, I had no idea that your argument for urgency came from the NM
    case.

    I actually think that NM is not benefitted from a policy here.
    We already have a fairly good standard: did you prove to your
    application manager, your advocates, and the reviewers (FD or DAM as appropriate) that you can be trusted and that you have the necessary
    technical and other skills to be a DD.

    I think it's fairly clear that using an LLM to answer questions in the
    NM process does not show that you have the technical skills.
    (Using it instead of reading a man page for similar results and then
    going and doing the work might be fine, but cutting and pasting an
    answer to an application question into the message you send to your AM
    clearly doesn't demonstrate your own technical skill.)

    I as an AM would find that an applicant using an LLM as more than a
    possibly incorrect man page without telling me would violate trust. I
    don't need a policy to come to that conclusion. I don't think I would
    have any trouble convincing DAM or FD to back my decision.

    I think coming up with a policy for this situation is going to be
    tricky.

    Do I mind an applicant asking an LLM to refresh their memory on how to
    import a new upstream version?
    No, not at all.
    Do they need to cite the LLM in their answer?
    If it really is a memory refresh and they know the material well enough
    to have confidence that the LLM answer is correct, I do not think they
    need to cite.
    If they don't know the material well enough to know the LLM is correct,
    then and LLM is a bad choice.

    But the same is true of a human I might ask.
    If I asked you to remind me something about importing a new upstream,
    and it really was just a reminder, I would not cite your contribution
    unless I used a significant chunk of text you had written.
    If you gave me bad info and I didn't catch it, then we learn I probably
    should not be trusted to pick good sources for my education.

    --Sam

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Mo Zhou@21:1/5 to Stefano Zacchiroli on Sat May 4 04:00:01 2024
    On 5/3/24 12:10, Stefano Zacchiroli wrote:
    On that front, useful "related work" are the policies that scientific journals and conferences (which are exposed *a lot* to this, given their
    main activity is vetting textual documents) have put in place about
    this.
    Indeed. Here are some examples:
    Nature: https://www.nature.com/nature-portfolio/editorial-policies/ai
    ICML: https://icml.cc/Conferences/2023/llm-policy
    CVPR: https://cvpr.thecvf.com/Conferences/2024/ReviewerGuidelines
              https://cvpr.thecvf.com/Conferences/2024/AuthorGuidelines

    Some additional points to the two from Stefano:
    1. Nature does not allow LLM to be an author.
    2. CVPR holds the author who used LLM responsible for all LLM's fault.
    3. CVPR agrees that the paper reviewers skipping their work with LLM
        is harming the community.
    The general policy usually contains two main points (paraphrased below):

    (1) You are free to use AI tools to *improve* your content, but not to
    create it from scratch for you.
    Polishing language is the case where I find LLMs most useful. But in fact,
    as an author, when I really care about the quality of whatever I wrote,
    I will find the state-of-the-art LLMs (such as ChatGPT4) poor in logic,
    poor in understanding my deep insight. They eventually turn into a
    smart language tutor to me.
    (2) You need to disclose the fact you have used AI tools, and how you
    have used them.
    Yes, It is commonly encouraged to acknowledge the use of AI tools.
    Exactly as in your case, Tiago, people managing scientific journals and conferences have absolutely no way of checking if these rules are
    respected or not. (They have access to large-scale plagiarism detection tools, which is a related but different concern.) They just ask people
    to *state* they followed this policy upon submission, but that's it.
    If the cheater who use LLM is lazy enough, not editing the LLM outputs
    at all --- you will find it super easy to identify whether a chunk of text
    is produced by LLM on your own. For example, I use ChatGPT basically
    everyday in
    March, and its answers always feel like being organized in the same
    format. No human answers questions in the same boring format all the time.
    If your main concern is people using LLMs or the like in some of the processes you mention, a checkbox requiring such a statement upon
    submission might go a longer way than a project-wide statement (which
    will sit in d-d-a unknown to n-m applicants a few years from now).
    For the long run, there is no way to enforce a ban on the use of AI over
    this project. What is doable, from my point of view, is to confirm that
    a person acknowledges the issues, potential risk and implications of
    the use of AI tools, and hold people who use AI to be responsible for
    AI's fault.

    Afterall, it's easy to identify one's intention of using AI -- it is either
    for good or bad. If the NM applicants can easily get the answer of an
    NM question, maybe it is time to refresh the question? Afterall nobody
    can stop one from learning from AI outputs when they need suggestion
    or reference answers -- and they are responsible for the wrong answer
    if AI is wrong.

    Apart from deliberately conducting bad acts using AIs, one thing that seems benign but harmful to the community is slacking off and skipping important
    work with AIs. But still, this can be covered by a single rule as well --
    "Let the person who use AI to be responsible for AI's fault."

    Simple, and doable.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Jack Warkentin@21:1/5 to Mo Zhou on Mon May 6 21:40:01 2024
    Hi Everybody

    I am an 86-year-old long time user of Debian GNU/Linux (at least 20
    years). And I subscribe to the debian-project mailing list as well as a
    couple of other debian mailing lists. I sometimes have problems
    understanding abbreviations and acronymns used on these lists and
    occasionally in package documentation.

    While reading this thread I could not understand the "NM" acronymn (and
    some other abbreviations as well). I finally found out NM's meaning by
    looking at https://www.debian.org/sitemap and reading "Debian New
    Members Corner".

    It would be helpful if Debian would create (and keep up-to-date) a web
    page of acronymns and abbreviations used by Debian literati. Or is there already such a one, but not listed on the site map?

    Regards

    Jack

    Jack Warkentin, phone 902-404-0457, email jwrk@eastlink.ca
    39 Inverness Avenue, Halifax, Nova Scotia, Canada, B3P 1X6

    Mo Zhou wrote:
    On 5/3/24 12:10, Stefano Zacchiroli wrote:
    On that front, useful "related work" are the policies that scientific
    journals and conferences (which are exposed *a lot* to this, given their
    main activity is vetting textual documents) have put in place about
    this.
    Indeed. Here are some examples:
    Nature: https://www.nature.com/nature-portfolio/editorial-policies/ai
    ICML: https://icml.cc/Conferences/2023/llm-policy
    CVPR: https://cvpr.thecvf.com/Conferences/2024/ReviewerGuidelines
              https://cvpr.thecvf.com/Conferences/2024/AuthorGuidelines

    Some additional points to the two from Stefano:
    1. Nature does not allow LLM to be an author.
    2. CVPR holds the author who used LLM responsible for all LLM's fault.
    3. CVPR agrees that the paper reviewers skipping their work with LLM
        is harming the community.
    The general policy usually contains two main points (paraphrased below):

    (1) You are free to use AI tools to *improve* your content, but not to
         create it from scratch for you.
    Polishing language is the case where I find LLMs most useful. But in fact,
    as an author, when I really care about the quality of whatever I wrote,
    I will find the state-of-the-art LLMs (such as ChatGPT4) poor in logic,
    poor in understanding my deep insight. They eventually turn into a
    smart language tutor to me.
    (2) You need to disclose the fact you have used AI tools, and how you
         have used them.
    Yes, It is commonly encouraged to acknowledge the use of AI tools.
    Exactly as in your case, Tiago, people managing scientific journals and
    conferences have absolutely no way of checking if these rules are
    respected or not. (They have access to large-scale plagiarism detection
    tools, which is a related but different concern.) They just ask people
    to *state* they followed this policy upon submission, but that's it.
    If the cheater who use LLM is lazy enough, not editing the LLM outputs
    at all --- you will find it super easy to identify whether a chunk of text
    is produced by LLM on your own. For example, I use ChatGPT basically
    everyday in
    March, and its answers always feel like being organized in the same
    format. No human answers questions in the same boring format all the time.
    If your main concern is people using LLMs or the like in some of the
    processes you mention, a checkbox requiring such a statement upon
    submission might go a longer way than a project-wide statement (which
    will sit in d-d-a unknown to n-m applicants a few years from now).
    For the long run, there is no way to enforce a ban on the use of AI over
    this project. What is doable, from my point of view, is to confirm that
    a person acknowledges the issues, potential risk and implications of
    the use of AI tools, and hold people who use AI to be responsible for
    AI's fault.

    Afterall, it's easy to identify one's intention of using AI -- it is either for good or bad. If the NM applicants can easily get the answer of an
    NM question, maybe it is time to refresh the question? Afterall nobody
    can stop one from learning from AI outputs when they need suggestion
    or reference answers -- and they are responsible for the wrong answer
    if AI is wrong.

    Apart from deliberately conducting bad acts using AIs, one thing that seems benign but harmful to the community is slacking off and skipping important work with AIs. But still, this can be covered by a single rule as well -- "Let the person who use AI to be responsible for AI's fault."

    Simple, and doable.



    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tiago Bortoletto Vaz@21:1/5 to All on Wed May 8 19:10:01 2024
    This is much more of a general note regarding this thread.

    Apparently we are far from a consensus on an official Debian position
    regarding the use of generative AI as a whole in the project.
    We're therefore content to use the resources we have and let each team
    handle the content using their own criteria -- though not what I
    expected... expectations adjusted and all is fine :-)

    I'd be particularly happy to incorporate suggestions from Zack and
    others in the areas I work on in Debian. Thanks anyway to everyone for
    the input, especially Mo Zhou, Russ and Zack. I hope this debate will
    come up again at a time when we better understand the consequences of
    all this.

    On 2024-05-03 21:32, Mo Zhou wrote:
    [...]

    Bests,

    --
    tvaz

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)