• Re: Re: Proposal -- Interpretation of DFSG on Artificial Intelligence (

    From Stephan =?ISO-8859-1?Q?Verb=FCcheln@21:1/5 to All on Mon Apr 28 17:50:02 2025
    XPost: linux.debian.vote

    Is the change technical or legal/philosophical? You could call this
    a Turing test for copyright.
    This is not a new issue at all. I remember that back in the day in
    order to legally reverse engineer a computer program, companies had to
    set up two separate teams of developers.
    One team reads the code and writes documentation. The second team reads
    the documentation and writes the new code. It was crucial that no
    member of the second team sees the original code in order to rule out
    any copyright issues.

    Processing of experiences into expert opinion is IMHO not directly
    comparable with compilation of source to a binary.
    DSFG does not only apply to programming languages and program binaries.
    For all data blobs in Debian packages, it is preferred to include the
    scripts that generate it, for images it is preferred to have the SVG
    code over the generated pixel graphics, etc.

    For a reason, the relevant licenses do not define “source code” by
    being in a programming language readable by humans. They define it like
    this (example from GPLv3):

    The “source code” for a work means the preferred form of the work for making modifications to it.

    In that definition, training data is quite obviously relevant. No one
    tweaks neural network model weights manually.

    Compare this to the previously mentioned example of S-boxes in
    cryptography. They are small and usually created manually.

    Regards
    Stephan

    -----BEGIN PGP SIGNATURE-----

    iHUEABYKAB0WIQRB1rjSpCJd8a7h6mNgNUJZCjx8YgUCaA+i0wAKCRBgNUJZCjx8 YhAlAP0S6RiRSBO5HWAgrz3qxtAy7rZtKMQ6SD+iUwfrv97RHQEAsVty3X0BDav8 PGf0grE7hfxUse5WKVB9aa/MPpMlJwM=
    =wMch
    -----END PGP SIGNATURE-----

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Gunnar Wolf@21:1/5 to All on Tue Apr 29 02:10:01 2025
    XPost: linux.debian.vote

    Stephan Verbücheln dijo [Mon, Apr 28, 2025 at 03:46:27PM +0000]:
    (...)
    The “source code” for a work means the preferred form of the work
    for making modifications to it.

    In that definition, training data is quite obviously relevant. No one
    tweaks neural network model weights manually.

    Compare this to the previously mentioned example of S-boxes in
    cryptography. They are small and usually created manually.

    I understand that, when you consider trained models as the "thing" to be modified, the preferred form of modification is the model itself: What RAG
    does is to have a base trained LLM (confering the "mastery" of language),
    and training over it with the domain-specific knowledge.

    -----BEGIN PGP SIGNATURE-----

    wr0EABYKAG8FgmgQFn0JEOL2O0NT9FmJRxQAAAAAAB4AIHNhbHRAbm90YXRpb25z LnNlcXVvaWEtcGdwLm9yZ5SbuobM6rOE6ACYpp9B2PYPdCzFsgjyiGEszl0ZbPt4 FiEEYLMJPZYQjly5cULv4vY7Q1P0WYkAAGanAP97QEe9VlzhRQRpN6KKzPrzN/vr Sg3a28DcbTWyLjiu1AD/RlY8Io3/pzZP0Cf/PgajfEleZw8J6ZsOJHb4dtmi/Ak=
    =77Je
    -----END PGP SIGNATURE-----

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Aigars Mahinovs@1:229/2 to All on Mon Apr 28 18:30:01 2025
    XPost: linux.debian.vote
    From: aigarius@gmail.com

    On Mon, 28 Apr 2025 at 17:46, Stephan Verbücheln <verbuecheln@posteo.de> wrote:

    Is the change technical or legal/philosophical? You could call this
    a Turing test for copyright.
    This is not a new issue at all. I remember that back in the day in
    order to legally reverse engineer a computer program, companies had to
    set up two separate teams of developers.
    One team reads the code and writes documentation. The second team reads
    the documentation and writes the new code. It was crucial that no
    member of the second team sees the original code in order to rule out
    any copyright issues.


    But, does it? If we consider the product of trained knowledge to be a derivative work of the training input, then the documentation produced by
    the first team would also be tainted by the copyright of the original code.
    So such interpretation also defeats the whole two-teams process.

    And many modern LLMs are actually often trained in stages - there is a very large model that is trained on the source data and then there are compact models that are actually trained by the first model. It's called model distillation.

    And then there are other methods of getting new information into already trained models at runtime via RAG technique - with that a LLM may only
    contain fundamental information and then reach out to load additional data sources, relevant to the specific query. Like an expert going online and checking prices and availability of various products before advising you
    what to choose for your planned build. At this point the LLM+RAG is just a smart web browser.

    (Sadly, I am *not* an expert on modern AI technologies)

    --
    Best regards,
    Aigars Mahinovs

    <div dir="ltr"><div class="gmail_quote gmail_quote_container"><div dir="ltr" class="gmail_attr">On Mon, 28 Apr 2025 at 17:46, Stephan Verbücheln &lt;<a href="mailto:verbuecheln@posteo.de">verbuecheln@posteo.de</a>&gt; wrote:<br></div><blockquote class="
    gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">&gt; Is the change technical or legal/philosophical? You could call this<br>
    &gt; a Turing test for copyright.<br>
    This is not a new issue at all. I remember that back in the day in<br>
    order to legally reverse engineer a computer program, companies had to<br>
    set up two separate teams of developers.<br>
    One team reads the code and writes documentation. The second team reads<br>
    the documentation and writes the new code. It was crucial that no<br>
    member of the second team sees the original code in order to rule out<br>
    any copyright issues.</blockquote><div><br></div><div>But, does it? If we consider the pr