This is a multi-part message in MIME format.
On 12/30/23 15:06, Charles Plessy wrote:
Le Fri, Dec 29, 2023 at 01:14:29PM +0100, Steffen Möller a écrit :
What hypothese do we have on what influences the number of active individuals?
When I was a kid I was playing with a lot of pirate copy of Amiga and
then PC games, and I had a bit of melancholy thinking that what appeared
to be golden days took place when I was still busy learning to walk and speak. I wondered if I was born too late. Then I was introduced to
Linux and Debian.
If you don't mind to share more of your story -- how are you introduced
to Linux and Debian? Can we reproduce it?
For me this is not reproducible. The beginning of my story is similar to
yours. Differently, at that time Windows is the only PC operating system
I'm aware of. And I suffered a lot from it and its ecosystem: aggressive reboots, aggressive pop-up windows and ads completely out of my control, enormous difficulty to learn and understand its internals given very
limited budget for books, enormous difficulty to learn C programming
language based on it. Visual studio did a great job to confuse me with a
huge amount of irrelevant details and complicated user interface when I
want try the code from the K&R C book as a newbie (without any
educational resource available or affordable). I forgot why I chose this
book but it was a correct one to buy.
One day, out of curiosity I searched for "free of charge operating
systems" so that I can get rid of Windows. Then I got Ubuntu 11.10. Its frequent "internal errors" drove me to try other linux distros in
virtualbox, including Debian squeeze and Fedora. While squeeze is the
ugliest among them all in terms of desktop environment, it crashes significantly less than the rest. I was happy with my choice. Linux does
not reboot unless I decide to do so. It does not pop-up ads because the malwares (while being useful) are not available under linux. It does not prevent me from trying to understand how it works, even if I can hardly
grasp the source code. And, `gcc hello-world.c` is ridiculously easy for learning programming compared to using visual studio.
I was confused again -- why is all of those free of charge? I tried to
learn more until the Debian Social Contract, DFSG and the stuff wrote by
FSF (mostly Stallman) completely blown up my mind. With the source code
within my reach, I'm able to really tame my computer. The day I realized
that is the day when I added "becoming a DD" to my dream list.
That was a big thing, a big challenge for me to learn
it, and a big reward to be part of it. At that time I never imagined
that the next big thing was diversity, inclusion and justice, but being
part of Debian unexpectedly connected me to it. Now when I look back I
do not worry being born too late. I would like to say to young people
that joining a thriving community is the best way to journey beyond
one's imagination.
Ideally yes, but people's mind is also affected by economy.
In developing countries where most people are still struggling to
survive and feeding a family, unpaid volunteer work is respected in most
of the time, but seldom well-understood. One needs to build up a very
strong motivation before taking actions to override the barrier of
societal bias.
That's partly the one of the reasons why the number of Chinese DDs is so
scarce while China has a very large number of population. And in
contrast, most DDs are from developed countries.
I like the interpretations on how human society works from the book
"Sapiens: a brief history of humankind". Basically, what connects people
all over the world, forming this community is a commonly believed simple
story -- we want to build a free and universal operating system. (I'm
sad to see this sentence being removed from debian.org) The common
belief is the ground on which we build trust and start collaboration.
So, essentially, renewing the community is to spread the simply story,
to the young people who seek for something that Debian/FOSS can provide.
I don't know how to achieve it. But I do know that my story is
completely unreproducible.
Of course, we need to show how we are thriving. On my wishlist for
2024, there is of course AI.
In case people interested in this topic does not know we have a
dedicated ML for that:
https://lists.debian.org/debian-ai/
The key word GPT successfully toggled my "write-a-long-response" button.
Here we go.
Can we have a DebGPT that will allow us to
interact with our mailing list archives using natural language?
I've ever tried to ask ChatGPT about Debian related questions. While
ChatGPT is very good at general linux questions, it turns that its
training data does not contain much about Debian-specific knowledge. The quality of training data really matters for LLM's performance,
especially the amount of book-quality data. The Debian ML is too noisy
compared to wikipedia dump and books.
While the training set of the proprietary ChatGPT is a secret, you can
have a peek in the Pile dataset frequently used by many "open-source"
LLMs. BTW, the formal definition of "open-source AI" is still a work-in-progress by OSI. I'll get back to Debian when OSI makes the
draft public for comments at some time in 2024.
https://en.wikipedia.org/wiki/The_Pile_(dataset)
The dataset contains "Ubuntu Freenode IRC" logs, but not any dump from
Debian servers.
Thus, technically, in order to build the DebGPT, there are two
straightforward solutions:
(1) adopt an "open-source" LLM that supports very large context length.
And directly fed the dump of debian knowledge into its context. This is
known as the "In-Context Learning" capability of LLMs. It enabled a wide
range of prompt engineering methods without any further training of the
model. In case you are interested in this, you can read OpenAI's
InstructGPT paper as a start.
(2) fine-tune an "open-source" LLM on the debian knowledge dump with
LoRA. This will greatly reduce the requirement of training hardware.
According to the LoRA paper, the training or full fine-tuning of GPT-3
(175B) requires 1.2TB GPU memory in total (while the best consumer grade
GPU provides merely 24GB). LoRA reduced it to 350GB without loosing
model performance (in terms of generation quality). That said, an 7B
parameter LLM is much easier with cheaper to deal with with LoRA.
All the prevelant "large" language models are Decoder-only Transformers.
And the training objective is simply next word prediction. So the debian mailing lists can be organized into tree structure containing mail
nodes, and the training objective is to predict the next mail node,
following the next-word-prediction paradigm.
How can one download the Debian public mailing list dumps?
Can
that DebGPT produce code that we know derives from a training set that
only includes works for which peole really consented that their
copyrights and licenses will be dissolved?
Tough cutting-edge research issue. But first, let's wait and see the
result for the lawsuite of New York Times against OpenAI+Microsoft:
https://www.nytimes.com/2023/12/27/business/media/new-york-times-open-ai-microsoft-lawsuit.html
The result of this lawsuite might be a milestone in this area. It will definitely impact future lawsuites on LLMs + copyrighted code usage.
Can it be the single entry
point for our whole infrastructure? I wish I could say "DebGPT, please accept all these loongarch64 patches and upload the packages now", or "DebGPT, update debian/copyright now and show me the diff".
The training data will be the Salsa dump. What you described is actually doable.
For each git commit, the first part of the prompt is the files before modification. The user instruction is the git commit message. The
expected prediction result is the git diff.
Debian Deep Learning team (debian-ai@l.d.o) have some AMD GPUs in the unofficial infrastructures. It is not far before we can really do
something. AMD GPUs with ROCm (open-source) allows us to train neural
networks in decent speed without the proprietary CUDA. The team is still working on packaging the missing dependencies for the ROCm variant of
PyTorch. The CPU variant (python3-torch) and the CUDA variant (python3-torch-cuda) is already in unstable. The python3-torch-rocm is
in my todo list.
PyTorch is already the most widely used training tool. Please forget
tensorflow in this aspect. JAX replaces tensorflow but the user number
of PyTorch is still overwhelming.
I am not
able to develop DebGPT and confess I am not investing my time in
learning to do it. But can we attract the people who want to tinker in
this direction?
Debian funds should be able to cover the hardware requirement and
training expenses even if they are slightly expensive. The more
expensive thing is the time of domain experts. I can train such a model
but clearly I do not have bandwidth for that.
Please help the Debian community to spread its common belief to more
domain experts.
Not because we are the best AI team, but because we are
one of the hearts of software freedom, and that freedom is deeply
connected to everybodys futures.
The academia is working hard on making large generative models (not
limited to text generation) easier to customize. I'm optimistic about
the future.
Well, it is too late for invoking Santa Claus, but this said, best
wishes for 2024 !
Best wishes for 2024!
<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<p>On 12/30/23 15:06, Charles Plessy wrote:</p>
<blockquote type="cite" cite="mid:
ZZB4VtHszm_l63UW@bubu.igloo">
<pre class="moz-quote-pre" wrap="">Le Fri, Dec 29, 2023 at 01:14:29PM +0100, Steffen Möller a écrit :
</pre>
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">
What hypothese do we have on what influences the number of active individuals? </pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">
When I was a kid I was playing with a lot of pirate copy of Amiga and
then PC games, and I had a bit of melancholy thinking that what appeared
to be golden days took place when I was still busy learning to walk and
speak. I wondered if I was born too late. Then I was introduced to
Linux and Debian.</pre>
</blockquote>
<p>If you don't mind to share more of your story -- how are you
introduced to Linux and Debian? Can we reproduce it?</p>
<p>For me this is not reproducible. The beginning of my story is
similar to yours. Differently, at that time Windows is the only PC
operating system I'm aware of. And I suffered a lot from it and
its ecosystem: aggressive reboots, aggressive pop-up windows and
ads completely out of my control, enormous difficulty to learn and
understand its internals given very limited budget for books,
enormous difficulty to learn C programming language based on it.
Visual studio did a great job to confuse me with a huge amount of
irrelevant details and complicated user interface when I want try
the code from the K&R C book as a newbie (without any
educational resource available or affordable). I forgot why I
chose this book but it was a correct one to buy.</p>
<p>One day, out of curiosity I searched for "free of charge
operating systems" so that I can get rid of Windows. Then I got
Ubuntu 11.10. Its frequent "internal errors" drove me to try other
linux distros in virtualbox, including Debian squeeze and Fedora.
While squeeze is the ugliest among them all in terms of desktop
environment, it crashes significantly less than the rest. I was
happy with my choice. Linux does not reboot unless I decide to do
so. It does not pop-up ads because the malwares (while being
useful) are not available under linux. It does not prevent me from
trying to understand how it works, even if I can hardly grasp the
source code. And, `gcc hello-world.c` is ridiculously easy for
learning programming compared to using visual studio.<br>
</p>
<p>I was confused again -- why is all of those free of charge? I
tried to learn more until the Debian Social Contract, DFSG and the
stuff wrote by FSF (mostly Stallman) completely blown up my mind.
With the source code within my reach, I'm able to really tame my
computer. The day I realized that is the day when I added
"becoming a DD" to my dream list.<br>
</p>
<blockquote type="cite" cite="mid:
ZZB4VtHszm_l63UW@bubu.igloo">
<pre class="moz-quote-pre" wrap="">That was a big thing, a big challenge for me to learn
it, and a big reward to be part of it. At that time I never imagined
that the next big thing was diversity, inclusion and justice, but being
part of Debian unexpectedly connected me to it. Now when I look back I
do not worry being born too late. I would like to say to young people
that joining a thriving community is the best way to journey beyond
one's imagination. </pre>
</blockquote>
<p>Ideally yes, but people's mind is also affected by economy.</p>
<p>In developing countries where most people are still struggling to
survive and feeding a family, unpaid volunteer work is respected
in most of the time, but seldom well-understood. One needs to
build up a very strong motivation before taking actions to
override the barrier of societal bias.<br>
</p>
<p>That's partly the one of the reasons why the number of Chinese
DDs is so scarce while China has a very large number of
population. And in contrast, most DDs are from developed
countries.</p>
<p>I like the interpretations on how human society works from the
book "Sapiens: a brief history of humankind". Basically, what
connects people all over the world, forming this community is a
commonly believed simple story -- we want to build a free and
universal operating system. (I'm sad to see this sentence being
removed from debian.org) The common belief is the ground on which
we build trust and start collaboration.</p>
<p>So, essentially, renewing the community is to spread the simply
story, to the young people who seek for something that Debian/FOSS
can provide. I don't know how to achieve it. But I do know that my
story is completely unreproducible.<span
style="white-space: pre-wrap">
</span></p>
<blockquote type="cite" cite="mid:
ZZB4VtHszm_l63UW@bubu.igloo">
<pre class="moz-quote-pre" wrap="">Of course, we need to show how we are thriving. On my wishlist for
2024, there is of course AI.</pre>
</blockquote>
<p>In case people interested in this topic does not know we have a
dedicated ML for that:</p>
<p><a class="moz-txt-link-freetext" href="
https://lists.debian.org/debian-ai/">https://lists.debian.org/debian-ai/</a></p>
<p><br>
</p>
<p>The key word GPT successfully toggled my "write-a-long-response"
button. Here we go.<br>
</p>
<blockquote type="cite" cite="mid:
ZZB4VtHszm_l63UW@bubu.igloo">
<pre class="moz-quote-pre" wrap=""> Can we have a DebGPT that will allow us to
interact with our mailing list archives using natural language?</pre>
</blockquote>
<p>I've ever tried to ask ChatGPT about Debian related questions.
While ChatGPT is very good at general linux questions, it turns
that its training data does not contain much about Debian-specific
knowledge. The quality of training data really matters for LLM's
performance, especially the amount of book-quality data. The
Debian ML is too noisy compared to wikipedia dump and books.</p>
<p>While the training set of the proprietary ChatGPT is a secret,
you can have a peek in the Pile dataset frequently used by many
"open-source" LLMs. BTW, the formal definition of "open-source AI"
is still a work-in-progress by OSI. I'll get back to Debian when
OSI makes the draft public for comments at some time in 2024.<br>
</p>
<p><a class="moz-txt-link-freetext" href="
https://en.wikipedia.org/wiki/The_Pile_(dataset)">
https://en.wikipedia.org/wiki/The_Pile_(dataset)</a></p>
<p>The dataset contains "Ubuntu Freenode IRC" logs, but not any dump
from Debian servers.<br>
</p>
<p>Thus, technically, in order to build the DebGPT, there are two
straightforward solutions:</p>
<p>(1) adopt an "open-source" LLM that supports very large context
length. And directly fed the dump of debian knowledge into its
context. This is known as the "In-Context Learning" capability of
LLMs. It enabled a wide range of prompt engineering methods
without any further training of the model. In case you are
interested in this, you can read OpenAI's InstructGPT paper as a
start.</p>
<p>(2) fine-tune an "open-source" LLM on the debian knowledge dump
with LoRA. This will greatly reduce the requirement of training
hardware. According to the LoRA paper, the training or full
fine-tuning of GPT-3 (175B) requires 1.2TB GPU memory in total
(while the best consumer grade GPU provides merely 24GB). LoRA
reduced it to 350GB without loosing model performance (in terms of
generation quality). That said, an 7B parameter LLM is much easier
with cheaper to deal with with LoRA.<br>
</p>
<p>All the prevelant "large" language models are Decoder-only
Transformers. And the training objective is simply next word
prediction. So the debian mailing lists can be organized into tree
structure containing mail nodes, and the training objective is to
predict the next mail node, following the next-word-prediction
paradigm.</p>
<p>How can one download the Debian public mailing list dumps?<br>
</p>
<blockquote type="cite" cite="mid:
ZZB4VtHszm_l63UW@bubu.igloo">
<pre class="moz-quote-pre" wrap=""> Can
that DebGPT produce code that we know derives from a training set that
only includes works for which peole really consented that their
copyrights and licenses will be dissolved? </pre>
</blockquote>
<p>Tough cutting-edge research issue. But first, let's wait and see
the result for the lawsuite of New York Times against
OpenAI+Microsoft:</p>
<p><a class="moz-txt-link-freetext" href="
https://www.nytimes.com/2023/12/27/business/media/new-york-times-open-ai-microsoft-lawsuit.html">https://www.nytimes.com/2023/12/27/business/media/new-york-times-open-ai-microsoft-lawsuit.html</a></p>
<p>The result of this lawsuite might be a milestone in this area. It
will definitely impact future lawsuites on LLMs + copyrighted code
usage.<br>
</p>
<blockquote type="cite" cite="mid:
ZZB4VtHszm_l63UW@bubu.igloo">
<pre class="moz-quote-pre" wrap="">Can it be the single entry
point for our whole infrastructure? I wish I could say "DebGPT, please
accept all these loongarch64 patches and upload the packages now", or
"DebGPT, update debian/copyright now and show me the diff".</pre>
</blockquote>
<p>The training data will be the Salsa dump. What you described is
actually doable.</p>
<p>For each git commit, the first part of the prompt is the files
before modification. The user instruction is the git commit
message. The expected prediction result is the git diff.</p>
<p>Debian Deep Learning team (<a class="moz-txt-link-abbreviated" href="mailto:debian-ai@l.d.o">debian-ai@l.d.o</a>) have some AMD GPUs in
the unofficial infrastructures. It is not far before we can really
do something. AMD GPUs with ROCm (open-source) allows us to train
neural networks in decent speed without the proprietary CUDA. The
team is still working on packaging the missing dependencies for
the ROCm variant of PyTorch. The CPU variant (python3-torch) and
the CUDA variant (python3-torch-cuda) is already in unstable. The
python3-torch-rocm is in my todo list.<br>
</p>
<p>PyTorch is already the most widely used training tool. Please
forget tensorflow in this aspect. JAX replaces tensorflow but the
user number of PyTorch is still overwhelming.<br>
</p>
<blockquote type="cite" cite="mid:
ZZB4VtHszm_l63UW@bubu.igloo">
<pre class="moz-quote-pre" wrap="">I am not
able to develop DebGPT and confess I am not investing my time in
learning to do it. But can we attract the people who want to tinker in
this direction? </pre>
</blockquote>
<p>Debian funds should be able to cover the hardware requirement and
training expenses even if they are slightly expensive. The more
expensive thing is the time of domain experts. I can train such a
model but clearly I do not have bandwidth for that.</p>
<p>Please help the Debian community to spread its common belief to
more domain experts.<br>
</p>
<blockquote type="cite" cite="mid:
ZZB4VtHszm_l63UW@bubu.igloo">
<pre class="moz-quote-pre" wrap=""> Not because we are the best AI team, but because we are
one of the hearts of software freedom, and that freedom is deeply
connected to everybodys futures.</pre>
</blockquote>
The academia is working hard on making large generative models (not
limited to text generation) easier to customize. I'm optimistic
about the future.<br>
<blockquote type="cite" cite="mid:
ZZB4VtHszm_l63UW@bubu.igloo">
<pre class="moz-quote-pre" wrap="">
Well, it is too late for invoking Santa Claus, but this said, best
wishes for 2024 !
</pre>
</blockquote>
Best wishes for 2024!<br>
</body>
</html>
--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)