• Some fun facts about AI and a few thoughts about the software ecosystem

    From M. Zhou@21:1/5 to All on Wed Mar 29 10:40:01 2023
    Hi folks,

    I seem to be good at starting lengthy mailing list threads. That said, I encountered
    some fun facts while reading papers today. I think this is also a good chance for
    me write a batch of some other relevant thoughts.

    TLDR; well, I wrote too much again. I have thrown my mail to ChatGPT for a brief
    summary. The following are two versions. They are plain copies without my edits.


    <<< begin TLDR version 1 generated by chatgpt
    The email consists of various interesting points related to AI and licensing issues in software. The first point
    highlights how some state-of-the-art LLMs avoid using GPL code. The author discusses the potential audiences for a
    license revision or a brand new license. The second point mentions the importance of reproducibility of LLMs. The third
    point explains how AI's impact on the software ecosystem is increasing, which is inevitable, and it is challenging to
    enforce the declaration of AI usage. Finally, the author discusses some ongoing and future works on the Debian side
    related to AI and licensing issues.
    end TLDR version 1 generated by chatgpt

    <<< begin TLDR version 2 generated by chatgpt
    The email discusses some fun facts and relevant thoughts related to licensing issues and the impact of AI in the
    software ecosystem. The author mentions that the licensing issue of trained deep neural networks and their outputs is
    complicated, and some AI software projects may avoid GPL code usage in their training data to prevent potential
    licensing issues. The author also discusses the increasing impact of AI in the software ecosystem, and its potential use
    in generating code, images, and texts. The email concludes by mentioning ongoing and future works on the Debian side.
    end TLDR version 2 generated by chatgpt


    --[[ Fun Fact 1: GPL code usage may be avoided in state-of-the-art LLM [2]


    LLaMA [3] is one of the state-of-the-art LLMs that you can download and deploy on a local machine. Its training data involves Github, but the authors only use the software projects licensed under Apache-2, BSD, and MIT.

    The licensing issue of trained deep neural networks, as well as the outputs
    of neural networks (such as generated texts, generated code, generated images, etc) is already a mess. That said, at least a part of the research community surely knows the complicated implication of using GPL code for training.
    Or they don't have to avoid using a pile of high quality code.

    People mentioned some potential licensing work in the previous related thread [1].
    But I don't see a clear and practical goal for free software community to reach.
    There are two types of potential audiences for a license revision or a brand new license.

    (1) the first type is the free software authors. If the authors do not want their
    code become a part of the super AI that will destroy the world someday[4],
    some special licenses or some special clauses can be used to prevent the
    AI training dataset usage. But, isn't it funny that "training a neural network"
    is excluded from software freedom?

    Meanwhile, excluding these code from the training datasets won't hurt
    the LLM trainers because a large portion of differently licensed projects
    are still usable.

    (2) The second type of potential audience is AI software upstream. In my opinion,
    I'd say there is almost nothing to do for free software communities.
    If we write some license terms that look funny to the AI software upstreams,
    they will simply not play with these licenses.


    --[[ Fun Fact 2: Reproducibility of LLMs


    The LLaMA paper [3] emphasized that the training set of these models only involve
    publically available datasets (no proprietary hidden datasets, no undocumented datasets).
    I can see that before the downstream software communities complain about the reproducibility, the research community will complain about the same thing far in advance.


    --[[ Recall 1: ML-Policy


    If I have to trim the ML-policy into one single sentence, then it will be the definition
    of "toxic candy" -- A pre-trained neural network, that somehow (very likely incorrectly
    licensed under an open source software license is still very likely problematic.

    This will be more and more useful, as long as more software projects try to integrate
    neural networks for interesting applications. It works as a warning when you see
    a giant binary blob (sometimes the network can be small... only several megabytes or so)
    in the upstream source regardless of its license.


    --[[ Fun Fact 3: AI's Impact to software ecosystem is increasing


    Even if the licensing of neural networks, as well as the copyright/licensing issue
    of neural networks is still a mess, the trend is not stoppable. If you kept an eye on the github trending list, you will see the ratio of ai software climbing.
    Even if we hesitate to introduce some AI software into our archive, the impact of AI will gradually flow into our free archive, inevitably:

    (1) a code snippet might be generated by AI, and modified by the upstream
    author without declaring the participation of AI.
    (2) documentation texts might be generated by AI. With the state of the art
    LLM, you can simply throw your undocumented code snippet and let
    it explain what the piece of code does.
    (3) pictures, icons, svgs, generated by AI.
    (4) ...

    It is impossible to enforce the declaration of AI usage everywhere applicable. Even worse, detecting the AI generated results is largely a deadend --
    the goal of generative AI is exactly to produce indistinguishable results.
    As long as the AI is strong enough, detecting it will be nearly impossible. There are some papers about the detection, but I refrain from excessively expanding this.


    --[[ Recall 2: SIMDebian


    This is a deprecated attempt that tries to bump the ISA baseline for using
    the modern CPU intrinsics. One of my motivations for proposing this is -- neural network computation can be brutal. Bumping the ISA baseline
    will significantly help if you run it on CPU.

    Just for reference. However, as long as the user has GPU, running neural network on CPU is almost nothing beyond a waste of time.


    --[[ Recall 3: Debian User Package Repository


    This is a deprecated attempt that tries to create a ebuild-like source-based distribution for .deb packages. One of the motivations for proposing this
    is -- redistributing AI software with neural networks through archive is problematic ... but it is ok if the neural network is downloaded by the
    user through the script locally, and the package is built locally bu the
    end user. As for the problematic licensing issue... anyway the software
    works, and the components in question are not distributed by us.

    Just for reference. This is not important now. Surely there are too many non-standard ways to install software.


    --[[ Some ongoing and future works on the Debian side


    Debian always provides a solid base system [5], upon which some upper
    layer application collections like pypi, anaconda, and docker worked very
    well. Due to many intricate reasons, such as the clearly limited volunteer bandwidth, Debian archive is not suitable as an alternative to these ecosystems.
    I'll refrain from expanding this to avoid going off topic. Please request
    if you want to read more on this.

    That said, we can still incorporate some of the most important software infrastructure in our archive, such as deep learning frameworks, and the
    neural network acceleration libraries. The upper layer applications are
    not discussed.

    PyTorch is currently the most prevalent deep learning framework. It is in
    good shape in our archive as well. A random trending AI project on github
    will largely be based on PyTorch nowadays. I have just uploaded the
    CUDA version of pytorch to NEW queue recently. While I can still handle
    this package on my own, its compilation and testing is brutal [7]. Welcome
    to join me for the maintainance if you are interested in it...

    In my opinion, TensorFlow will gradually fade away for Jax[6].
    I really don't suggest anyone to pursue Tensorflow packaging as of 2023.
    I have already orphaned the whole tensorflow dependency tree under my name.

    (I acknowledge that I'm a PyTorch user and I have bias about TensorFlow's
    obscure API and terrible documentations.)


    See below if you want to get involved.


    --[[ Team Advertisement


    Debian Deep Learning Team <debian-ai@lists.debian.org> welcomes
    new contributors. The mailing list is currently abused for general discussion and two tracks of development works:

    1. https://salsa.debian.org/deeplearning-team
    Deep Learning frameworks

    2. https://salsa.debian.org/rocm-team
    ROCm is AMD's free software counterpart to Nvidia's proprietary CUDA.
    (I wouldn't bother to create this team if it were non-free)

    There could be an Intel/SYCL team in the future. But intel is not yet ready
    to upstream their SYCL implementation into llvm. I'll only try this by my
    self when pytorch starts to support intel/sycl.

    Thanks for reading the long mail.
    Hope you find some interesting topics and inspirations here.


    [1] https://lists.debian.org/debian-project/2023/02/msg00017.html
    [2] LLM = Large Language Model, such as GPT-3, GPT-4, etc.
    [3] https://arxiv.org/pdf/2302.13971.pdf
    [4] Yes, please write as many bugs as possible in your code. Your bugs could
    be herotic if it chokes a super AI trying to destroy the world. (I'm not serious)
    [5] IIRC, one of the UNIX philosophy goes, "do one thing, and do it well".
    [6] https://github.com/google/jax
    [7] Debomatic-amd64 has got an Xeon E5-2697v3 (IIRC). It takes ~3 hours
    for a full build and checks for the CPU version of pytorch without ccache.
    The CUDA version will only take longer time.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)