• Aw: Re: Community renewal and project obsolescence

    From =?UTF-8?Q?Steffen_M=C3=B6ller?=@21:1/5 to All on Fri Dec 29 13:40:01 2023
    Gesendet: Donnerstag, 28. Dezember 2023 um 20:02 Uhr
    Von: "Mo Zhou" <lumin@debian.org>
    An: debian-project@lists.debian.org
    Betreff: Re: Community renewal and project obsolescence

    On 12/28/23 10:34, Rafael Laboissière wrote:

    * M. Zhou <lumin@debian.org> [2023-12-27 19:00]:

    Thanks for the code and the figure. Indeed, the trend is confirmed by fitting a linear model count ~ year to the new members list. The coefficient is -1.39 member/year, which is significantly different
    from zero (F[1,22] = 11.8, p < 0.01). Even when we take out the data
    from year 2001, that could be interpreted as an outlier, the trend is still siginificant, with a drop of 0.98 member/year (F[1,21] = 8.48, p
    < 0.01).

    I thought about to use some models for population statistics, so we can
    get the data about DD birth rate and DD retire/leave rate, as well as a prediction. But since the descendants of DDs are not naturally new DDs,
    the typical population models are not likely going to work well. The
    birth of DD is more likely mutation, sort of.

    Anyway, we do not need sophisticated math models to draw the conclusion
    that Debian is an aging community. And yet, we don't seem to have a good
    way to reshape the curve using Debian's funds. -- this is one of the key problems behind the data.

    What hypothese do we have on what influences the number of active individuals?

    Positive factors
    * Location of DebConf (with many or not so many devs affording to attend)
    * Popular platforms like the Raspberry Pi working with Debian derivative
    * Debian packaging teams on salsa
    * self-education
    * Impression the DD status makes on outsiders/your next employer
    * Pleasant interactions on mailing lists with current or past team members
    * Team building with other DDs on projects of interest

    Negative factors
    * Advent of homebrew+conda
    * Containers
    * Increasing workloads as one ages and does not give packages up
    * Work-life-balance
    * Migrating to upstream
    * Delay between what upstream releases and what is available in our distro
    * Unpleasant interactions on mailing lists with current or past team members

    Do you have a better list?
    I keep thinking about what the last significant change in Debian may have been - to mind came salsa.debian.org. Do I miss anything?
    And I think the change I would like to see the most is a variant of brew/salsa for Debian, preferably in some mostly automated way, so we have some way to install the very latest with Debian all the time.

    Best,
    Steffen

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Charles Plessy@21:1/5 to All on Sat Dec 30 21:30:01 2023
    Le Fri, Dec 29, 2023 at 01:14:29PM +0100, Steffen Möller a écrit :

    What hypothese do we have on what influences the number of active individuals?

    When I was a kid I was playing with a lot of pirate copy of Amiga and
    then PC games, and I had a bit of melancholy thinking that what appeared
    to be golden days took place when I was still busy learning to walk and
    speak. I wondered if I was born too late. Then I was introduced to
    Linux and Debian. That was a big thing, a big challenge for me to learn
    it, and a big reward to be part of it. At that time I never imagined
    that the next big thing was diversity, inclusion and justice, but being
    part of Debian unexpectedly connected me to it. Now when I look back I
    do not worry being born too late. I would like to say to young people
    that joining a thriving community is the best way to journey beyond
    one's imagination.

    Of course, we need to show how we are thriving. On my wishlist for
    2024, there is of course AI. Can we have a DebGPT that will allow us to interact with our mailing list archives using natural language? Can
    that DebGPT produce code that we know derives from a training set that
    only includes works for which peole really consented that their
    copyrights and licenses will be dissolved? Can it be the single entry
    point for our whole infrastructure? I wish I could say "DebGPT, please
    accept all these loongarch64 patches and upload the packages now", or
    "DebGPT, update debian/copyright now and show me the diff". I am not
    able to develop DebGPT and confess I am not investing my time in
    learning to do it. But can we attract the people who want to tinker in
    this direction? Not because we are the best AI team, but because we are
    one of the hearts of software freedom, and that freedom is deeply
    connected to everybodys futures.

    Well, it is too late for invoking Santa Claus, but this said, best
    wishes for 2024 !

    Charles

    --
    Charles Plessy Nagahama, Yomitan, Okinawa, Japan
    Debian Med packaging team http://www.debian.org/devel/debian-med Tooting from work, https://fediscience.org/@charles_plessy Tooting from home, https://framapiaf.org/@charles_plessy

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Mo Zhou@21:1/5 to Charles Plessy on Sun Dec 31 04:00:01 2023
    This is a multi-part message in MIME format.
    On 12/30/23 15:06, Charles Plessy wrote:

    Le Fri, Dec 29, 2023 at 01:14:29PM +0100, Steffen Möller a écrit :
    What hypothese do we have on what influences the number of active individuals?
    When I was a kid I was playing with a lot of pirate copy of Amiga and
    then PC games, and I had a bit of melancholy thinking that what appeared
    to be golden days took place when I was still busy learning to walk and speak. I wondered if I was born too late. Then I was introduced to
    Linux and Debian.

    If you don't mind to share more of your story -- how are you introduced
    to Linux and Debian? Can we reproduce it?

    For me this is not reproducible. The beginning of my story is similar to
    yours. Differently, at that time Windows is the only PC operating system
    I'm aware of. And I suffered a lot from it and its ecosystem: aggressive reboots, aggressive pop-up windows and ads completely out of my control, enormous difficulty to learn and understand its internals given very
    limited budget for books, enormous difficulty to learn C programming
    language based on it. Visual studio did a great job to confuse me with a
    huge amount of irrelevant details and complicated user interface when I
    want try the code from the K&R C book as a newbie (without any
    educational resource available or affordable). I forgot why I chose this
    book but it was a correct one to buy.

    One day, out of curiosity I searched for "free of charge operating
    systems" so that I can get rid of Windows. Then I got Ubuntu 11.10. Its frequent "internal errors" drove me to try other linux distros in
    virtualbox, including Debian squeeze and Fedora. While squeeze is the
    ugliest among them all in terms of desktop environment, it crashes significantly less than the rest. I was happy with my choice. Linux does
    not reboot unless I decide to do so. It does not pop-up ads because the malwares (while being useful) are not available under linux. It does not prevent me from trying to understand how it works, even if I can hardly
    grasp the source code. And, `gcc hello-world.c` is ridiculously easy for learning programming compared to using visual studio.

    I was confused again -- why is all of those free of charge? I tried to
    learn more until the Debian Social Contract, DFSG and the stuff wrote by
    FSF (mostly Stallman) completely blown up my mind. With the source code
    within my reach, I'm able to really tame my computer. The day I realized
    that is the day when I added "becoming a DD" to my dream list.

    That was a big thing, a big challenge for me to learn
    it, and a big reward to be part of it. At that time I never imagined
    that the next big thing was diversity, inclusion and justice, but being
    part of Debian unexpectedly connected me to it. Now when I look back I
    do not worry being born too late. I would like to say to young people
    that joining a thriving community is the best way to journey beyond
    one's imagination.

    Ideally yes, but people's mind is also affected by economy.

    In developing countries where most people are still struggling to
    survive and feeding a family, unpaid volunteer work is respected in most
    of the time, but seldom well-understood. One needs to build up a very
    strong motivation before taking actions to override the barrier of
    societal bias.

    That's partly the one of the reasons why the number of Chinese DDs is so
    scarce while China has a very large number of population. And in
    contrast, most DDs are from developed countries.

    I like the interpretations on how human society works from the book
    "Sapiens: a brief history of humankind". Basically, what connects people
    all over the world, forming this community is a commonly believed simple
    story -- we want to build a free and universal operating system. (I'm
    sad to see this sentence being removed from debian.org) The common
    belief is the ground on which we build trust and start collaboration.

    So, essentially, renewing the community is to spread the simply story,
    to the young people who seek for something that Debian/FOSS can provide.
    I don't know how to achieve it. But I do know that my story is
    completely unreproducible.

    Of course, we need to show how we are thriving. On my wishlist for
    2024, there is of course AI.

    In case people interested in this topic does not know we have a
    dedicated ML for that:

    https://lists.debian.org/debian-ai/


    The key word GPT successfully toggled my "write-a-long-response" button.
    Here we go.

    Can we have a DebGPT that will allow us to
    interact with our mailing list archives using natural language?

    I've ever tried to ask ChatGPT about Debian related questions. While
    ChatGPT is very good at general linux questions, it turns that its
    training data does not contain much about Debian-specific knowledge. The quality of training data really matters for LLM's performance,
    especially the amount of book-quality data. The Debian ML is too noisy
    compared to wikipedia dump and books.

    While the training set of the proprietary ChatGPT is a secret, you can
    have a peek in the Pile dataset frequently used by many "open-source"
    LLMs. BTW, the formal definition of "open-source AI" is still a work-in-progress by OSI. I'll get back to Debian when OSI makes the
    draft public for comments at some time in 2024.

    https://en.wikipedia.org/wiki/The_Pile_(dataset)

    The dataset contains "Ubuntu Freenode IRC" logs, but not any dump from
    Debian servers.

    Thus, technically, in order to build the DebGPT, there are two
    straightforward solutions:

    (1) adopt an "open-source" LLM that supports very large context length.
    And directly fed the dump of debian knowledge into its context. This is
    known as the "In-Context Learning" capability of LLMs. It enabled a wide
    range of prompt engineering methods without any further training of the
    model. In case you are interested in this, you can read OpenAI's
    InstructGPT paper as a start.

    (2) fine-tune an "open-source" LLM on the debian knowledge dump with
    LoRA. This will greatly reduce the requirement of training hardware.
    According to the LoRA paper, the training or full fine-tuning of GPT-3
    (175B) requires 1.2TB GPU memory in total (while the best consumer grade
    GPU provides merely 24GB). LoRA reduced it to 350GB without loosing
    model performance (in terms of generation quality). That said, an 7B
    parameter LLM is much easier with cheaper to deal with with LoRA.

    All the prevelant "large" language models are Decoder-only Transformers.
    And the training objective is simply next word prediction. So the debian mailing lists can be organized into tree structure containing mail
    nodes, and the training objective is to predict the next mail node,
    following the next-word-prediction paradigm.

    How can one download the Debian public mailing list dumps?

    Can
    that DebGPT produce code that we know derives from a training set that
    only includes works for which peole really consented that their
    copyrights and licenses will be dissolved?

    Tough cutting-edge research issue. But first, let's wait and see the
    result for the lawsuite of New York Times against OpenAI+Microsoft:

    https://www.nytimes.com/2023/12/27/business/media/new-york-times-open-ai-microsoft-lawsuit.html

    The result of this lawsuite might be a milestone in this area. It will definitely impact future lawsuites on LLMs + copyrighted code usage.

    Can it be the single entry
    point for our whole infrastructure? I wish I could say "DebGPT, please accept all these loongarch64 patches and upload the packages now", or "DebGPT, update debian/copyright now and show me the diff".

    The training data will be the Salsa dump. What you described is actually doable.

    For each git commit, the first part of the prompt is the files before modification. The user instruction is the git commit message. The
    expected prediction result is the git diff.

    Debian Deep Learning team (debian-ai@l.d.o) have some AMD GPUs in the unofficial infrastructures. It is not far before we can really do
    something. AMD GPUs with ROCm (open-source) allows us to train neural
    networks in decent speed without the proprietary CUDA. The team is still working on packaging the missing dependencies for the ROCm variant of
    PyTorch. The CPU variant (python3-torch) and the CUDA variant (python3-torch-cuda) is already in unstable. The python3-torch-rocm is
    in my todo list.

    PyTorch is already the most widely used training tool. Please forget
    tensorflow in this aspect. JAX replaces tensorflow but the user number
    of PyTorch is still overwhelming.

    I am not
    able to develop DebGPT and confess I am not investing my time in
    learning to do it. But can we attract the people who want to tinker in
    this direction?

    Debian funds should be able to cover the hardware requirement and
    training expenses even if they are slightly expensive. The more
    expensive thing is the time of domain experts. I can train such a model
    but clearly I do not have bandwidth for that.

    Please help the Debian community to spread its common belief to more
    domain experts.

    Not because we are the best AI team, but because we are
    one of the hearts of software freedom, and that freedom is deeply
    connected to everybodys futures.
    The academia is working hard on making large generative models (not
    limited to text generation) easier to customize. I'm optimistic about
    the future.
    Well, it is too late for invoking Santa Claus, but this said, best
    wishes for 2024 !
    Best wishes for 2024!
    <!DOCTYPE html>
    <html>
    <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
    </head>
    <body>
    <p>On 12/30/23 15:06, Charles Plessy wrote:</p>
    <blockquote type="cite" cite="mid:ZZB4VtHszm_l63UW@bubu.igloo">
    <pre class="moz-quote-pre" wrap="">Le Fri, Dec 29, 2023 at 01:14:29PM +0100, Steffen Möller a écrit :
    </pre>
    <blockquote type="cite">
    <pre class="moz-quote-pre" wrap="">
    What hypothese do we have on what influences the number of active individuals? </pre>
    </blockquote>
    <pre class="moz-quote-pre" wrap="">
    When I was a kid I was playing with a lot of pirate copy of Amiga and
    then PC games, and I had a bit of melancholy thinking that what appeared
    to be golden days took place when I was still busy learning to walk and
    speak. I wondered if I was born too late. Then I was introduced to
    Linux and Debian.</pre>
    </blockquote>
    <p>If you don't mind to share more of your story -- how are you
    introduced to Linux and Debian? Can we reproduce it?</p>
    <p>For me this is not reproducible. The beginning of my story is
    similar to yours. Differently, at that time Windows is the only PC
    operating system I'm aware of. And I suffered a lot from it and
    its ecosystem: aggressive reboots, aggressive pop-up windows and
    ads completely out of my control, enormous difficulty to learn and
    understand its internals given very limited budget for books,
    enormous difficulty to learn C programming language based on it.
    Visual studio did a great job to confuse me with a huge amount of
    irrelevant details and complicated user interface when I want try
    the code from the K&amp;R C book as a newbie (without any
    educational resource available or affordable). I forgot why I
    chose this book but it was a correct one to buy.</p>
    <p>One day, out of curiosity I searched for "free of charge
    operating systems" so that I can get rid of Windows. Then I got
    Ubuntu 11.10. Its frequent "internal errors" drove me to try other
    linux distros in virtualbox, including Debian squeeze and Fedora.
    While squeeze is the ugliest among them all in terms of desktop
    environment, it crashes significantly less than the rest. I was
    happy with my choice. Linux does not reboot unless I decide to do
    so. It does not pop-up ads because the malwares (while being
    useful) are not available under linux. It does not prevent me from
    trying to understand how it works, even if I can hardly grasp the
    source code. And, `gcc hello-world.c` is ridiculously easy for
    learning programming compared to using visual studio.<br>
    </p>
    <p>I was confused again -- why is all of those free of charge? I
    tried to learn more until the Debian Social Contract, DFSG and the
    stuff wrote by FSF (mostly Stallman) completely blown up my mind.
    With the source code within my reach, I'm able to really tame my
    computer. The day I realized that is the day when I added
    "becoming a DD" to my dream list.<br>
    </p>
    <blockquote type="cite" cite="mid:ZZB4VtHszm_l63UW@bubu.igloo">
    <pre class="moz-quote-pre" wrap="">That was a big thing, a big challenge for me to learn
    it, and a big reward to be part of it. At that time I never imagined
    that the next big thing was diversity, inclusion and justice, but being
    part of Debian unexpectedly connected me to it. Now when I look back I
    do not worry being born too late. I would like to say to young people
    that joining a thriving community is the best way to journey beyond
    one's imagination. </pre>
    </blockquote>
    <p>Ideally yes, but people's mind is also affected by economy.</p>
    <p>In developing countries where most people are still struggling to
    survive and feeding a family, unpaid volunteer work is respected
    in most of the time, but seldom well-understood. One needs to
    build up a very strong motivation before taking actions to
    override the barrier of societal bias.<br>
    </p>
    <p>That's partly the one of the reasons why the number of Chinese
    DDs is so scarce while China has a very large number of
    population. And in contrast, most DDs are from developed
    countries.</p>
    <p>I like the interpretations on how human society works from the
    book "Sapiens: a brief history of humankind". Basically, what
    connects people all over the world, forming this community is a
    commonly believed simple story -- we want to build a free and
    universal operating system. (I'm sad to see this sentence being
    removed from debian.org) The common belief is the ground on which
    we build trust and start collaboration.</p>
    <p>So, essentially, renewing the community is to spread the simply
    story, to the young people who seek for something that Debian/FOSS
    can provide. I don't know how to achieve it. But I do know that my
    story is completely unreproducible.<span
    style="white-space: pre-wrap">
    </span></p>
    <blockquote type="cite" cite="mid:ZZB4VtHszm_l63UW@bubu.igloo">
    <pre class="moz-quote-pre" wrap="">Of course, we need to show how we are thriving. On my wishlist for
    2024, there is of course AI.</pre>
    </blockquote>
    <p>In case people interested in this topic does not know we have a
    dedicated ML for that:</p>
    <p><a class="moz-txt-link-freetext" href="https://lists.debian.org/debian-ai/">https://lists.debian.org/debian-ai/</a></p>
    <p><br>
    </p>
    <p>The key word GPT successfully toggled my "write-a-long-response"
    button. Here we go.<br>
    </p>
    <blockquote type="cite" cite="mid:ZZB4VtHszm_l63UW@bubu.igloo">
    <pre class="moz-quote-pre" wrap=""> Can we have a DebGPT that will allow us to
    interact with our mailing list archives using natural language?</pre>
    </blockquote>
    <p>I've ever tried to ask ChatGPT about Debian related questions.
    While ChatGPT is very good at general linux questions, it turns
    that its training data does not contain much about Debian-specific
    knowledge. The quality of training data really matters for LLM's
    performance, especially the amount of book-quality data. The
    Debian ML is too noisy compared to wikipedia dump and books.</p>
    <p>While the training set of the proprietary ChatGPT is a secret,
    you can have a peek in the Pile dataset frequently used by many
    "open-source" LLMs. BTW, the formal definition of "open-source AI"
    is still a work-in-progress by OSI. I'll get back to Debian when
    OSI makes the draft public for comments at some time in 2024.<br>
    </p>
    <p><a class="moz-txt-link-freetext" href="https://en.wikipedia.org/wiki/The_Pile_(dataset)">https://en.wikipedia.org/wiki/The_Pile_(dataset)</a></p>
    <p>The dataset contains "Ubuntu Freenode IRC" logs, but not any dump
    from Debian servers.<br>
    </p>
    <p>Thus, technically, in order to build the DebGPT, there are two
    straightforward solutions:</p>
    <p>(1) adopt an "open-source" LLM that supports very large context
    length. And directly fed the dump of debian knowledge into its
    context. This is known as the "In-Context Learning" capability of
    LLMs. It enabled a wide range of prompt engineering methods
    without any further training of the model. In case you are
    interested in this, you can read OpenAI's InstructGPT paper as a
    start.</p>
    <p>(2) fine-tune an "open-source" LLM on the debian knowledge dump
    with LoRA. This will greatly reduce the requirement of training
    hardware. According to the LoRA paper, the training or full
    fine-tuning of GPT-3 (175B) requires 1.2TB GPU memory in total
    (while the best consumer grade GPU provides merely 24GB). LoRA
    reduced it to 350GB without loosing model performance (in terms of
    generation quality). That said, an 7B parameter LLM is much easier
    with cheaper to deal with with LoRA.<br>
    </p>
    <p>All the prevelant "large" language models are Decoder-only
    Transformers. And the training objective is simply next word
    prediction. So the debian mailing lists can be organized into tree
    structure containing mail nodes, and the training objective is to
    predict the next mail node, following the next-word-prediction
    paradigm.</p>
    <p>How can one download the Debian public mailing list dumps?<br>
    </p>
    <blockquote type="cite" cite="mid:ZZB4VtHszm_l63UW@bubu.igloo">
    <pre class="moz-quote-pre" wrap=""> Can
    that DebGPT produce code that we know derives from a training set that
    only includes works for which peole really consented that their
    copyrights and licenses will be dissolved? </pre>
    </blockquote>
    <p>Tough cutting-edge research issue. But first, let's wait and see
    the result for the lawsuite of New York Times against
    OpenAI+Microsoft:</p>
    <p><a class="moz-txt-link-freetext" href="https://www.nytimes.com/2023/12/27/business/media/new-york-times-open-ai-microsoft-lawsuit.html">https://www.nytimes.com/2023/12/27/business/media/new-york-times-open-ai-microsoft-lawsuit.html</a></p>
    <p>The result of this lawsuite might be a milestone in this area. It
    will definitely impact future lawsuites on LLMs + copyrighted code
    usage.<br>
    </p>
    <blockquote type="cite" cite="mid:ZZB4VtHszm_l63UW@bubu.igloo">
    <pre class="moz-quote-pre" wrap="">Can it be the single entry
    point for our whole infrastructure? I wish I could say "DebGPT, please
    accept all these loongarch64 patches and upload the packages now", or
    "DebGPT, update debian/copyright now and show me the diff".</pre>
    </blockquote>
    <p>The training data will be the Salsa dump. What you described is
    actually doable.</p>
    <p>For each git commit, the first part of the prompt is the files
    before modification. The user instruction is the git commit
    message. The expected prediction result is the git diff.</p>
    <p>Debian Deep Learning team (<a class="moz-txt-link-abbreviated" href="mailto:debian-ai@l.d.o">debian-ai@l.d.o</a>) have some AMD GPUs in
    the unofficial infrastructures. It is not far before we can really
    do something. AMD GPUs with ROCm (open-source) allows us to train
    neural networks in decent speed without the proprietary CUDA. The
    team is still working on packaging the missing dependencies for
    the ROCm variant of PyTorch. The CPU variant (python3-torch) and
    the CUDA variant (python3-torch-cuda) is already in unstable. The
    python3-torch-rocm is in my todo list.<br>
    </p>
    <p>PyTorch is already the most widely used training tool. Please
    forget tensorflow in this aspect. JAX replaces tensorflow but the
    user number of PyTorch is still overwhelming.<br>
    </p>
    <blockquote type="cite" cite="mid:ZZB4VtHszm_l63UW@bubu.igloo">
    <pre class="moz-quote-pre" wrap="">I am not
    able to develop DebGPT and confess I am not investing my time in
    learning to do it. But can we attract the people who want to tinker in
    this direction? </pre>
    </blockquote>
    <p>Debian funds should be able to cover the hardware requirement and
    training expenses even if they are slightly expensive. The more
    expensive thing is the time of domain experts. I can train such a
    model but clearly I do not have bandwidth for that.</p>
    <p>Please help the Debian community to spread its common belief to
    more domain experts.<br>
    </p>
    <blockquote type="cite" cite="mid:ZZB4VtHszm_l63UW@bubu.igloo">
    <pre class="moz-quote-pre" wrap=""> Not because we are the best AI team, but because we are
    one of the hearts of software freedom, and that freedom is deeply
    connected to everybodys futures.</pre>
    </blockquote>
    The academia is working hard on making large generative models (not
    limited to text generation) easier to customize. I'm optimistic
    about the future.<br>
    <blockquote type="cite" cite="mid:ZZB4VtHszm_l63UW@bubu.igloo">
    <pre class="moz-quote-pre" wrap="">
    Well, it is too late for invoking Santa Claus, but this said, best
    wishes for 2024 !
    </pre>
    </blockquote>
    Best wishes for 2024!<br>
    </body>
    </html>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Mo Zhou@21:1/5 to Mo Zhou on Sun Dec 31 05:30:01 2023
    On 12/30/23 21:40, Mo Zhou wrote:

    I am not
    able to develop DebGPT and confess I am not investing my time in
    learning to do it. But can we attract the people who want to tinker in
    this direction?

    Debian funds should be able to cover the hardware requirement and
    training expenses even if they are slightly expensive. The more
    expensive thing is the time of domain experts. I can train such a
    model but clearly I do not have bandwidth for that.

    No. I changed my mind.

    I can actually quickly wrap some debian-specific prompts with an
    existing chatting LLM. This is easy and does not need expensive hardware (although it may still require 1~2 GPUs with 24GB memory for inference),
    nor any training procedure.

    The project repo is created here https://salsa.debian.org/deeplearning-team/debgpt

    I have enabled issues. And maybe people interested in this can redirect
    the detailed discussions to the repo issues.

    I'm sure it is already possible to let LLM read the long policy
    document, or debhelper man pages for us, and provide some suggestions or patches. The things I'm uncertain is (1) how well a smaller LLM, like 7B
    or 13B ones can do compared to proprietary LLMs in this case; (2) how
    well a smaller LLM can be when it is quantized to int8 or even int4 for laptops.

    Oh, BTW, the dependencies needed by the project are not complete in
    debian archive.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Jeremy Stanley@21:1/5 to Mo Zhou on Sun Dec 31 19:10:01 2023
    On 2023-12-30 21:40:03 -0500 (-0500), Mo Zhou wrote:
    [...]
    How can one download the Debian public mailing list dumps?
    [...]

    I think you'd have to scrape the HTML (MHonArc) archives. The last
    update I remember is that the listmasters are intentionally not
    providing raw archives, though perhaps that 15 year old decision
    could be revisited if there's new compelling reasons:

    https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=161440#39

    Alternatively, I suppose a DD with access to the raw archive data on
    the server could (perhaps after some discussion with the
    listmasters) perform LLM training on those, but would probably need
    to sanitize it and weed out the spam when doing so.
    --
    Jeremy Stanley

    -----BEGIN PGP SIGNATURE-----

    iQKTBAABCgB9FiEEl65Jb8At7J/DU7LnSPmWEUNJWCkFAmWRn09fFIAAAAAALgAo aXNzdWVyLWZwckBub3RhdGlvbnMub3BlbnBncC5maWZ0aGhvcnNlbWFuLm5ldDk3 QUU0OTZGQzAyREVDOUZDMzUzQjJFNzQ4Rjk5NjExNDM0OTU4MjkACgkQSPmWEUNJ WCnMEBAAnbr6mWroAtcQRd7Z/aKql6MsmGR494LTf/x2G+TPxjxVX80WQpX8AEsf KWnGo4LNwJlSiseWg+O+mpcjEQyIGYIWrU47WXCmOg1bdXDg0Zp9+l7BIuXqUbvU damIe2oofU3IfQDxkKTXLCwKBs22Sm/xSyTCr5hCDZRRcMOvSdBhiDqNocXNnQJe HC5a08TOfS1zjH370Ja56yhwCiJxIH+qwH6DROWhFQr142sE//SWZXKjlYlIs5Zs +kF46eXkexXiJnbygIE7O3AIPVb+dQ8MumVBSPi6myy73Avl0QXT4UW2YXzOBxii G3T+Wsi2IRb/J4aWwtO9Cat71zAObru7pLbVrp/5+6s05BV2PdDKpvETdRT/d3lC 3Nx7tzwvOqbDN3cB6AhHMENiT0GGSwMrGDxQp1W5O/ehscV6wSo7YtMRmh9U8ro0 ZqgdxkWhG/qRdVqV/3kd0RTO4OmZWcPQDCcx+4wBZZkUefjUpGFIoWaHbVLmyX6l VSbLJ+0PRKCSAntLp+vBJpOueWNc2/XccT0u0zN06ItBVrWSxEaR9szCF2+ZVpFM 42mIjOPNZqs9FxIINai9rzyK/BgrHmTP2zhEpDxz6hXTpsFSVdJ+eXZH8LDTYjpb jhP/kE72TLkkBLfCXI5xxxm6rrrnVAUDmcXgZJ4zSCycjnELqJk=
    =Ec5O
    -----END PGP SIGNATURE-----

    --- SoupGate-Win32