I'm interested in package popularity. I'm aware of popcon (https://popcon.debian.org/), but I'm more interested in actual
downloads.
I'm interested in package popularity. I'm aware of popcon (https://popcon.debian.org/), but I'm more interested in actual
downloads.
I'm interested in package popularity. I'm aware of popcon
(https://popcon.debian.org/), but I'm more interested in actual
downloads.
I am also interested in usage statistics. I feel it is much more
meaningful to work on packages that I know how have a lot of users.
While neither popcon of download stats are accurate, they still show
trends and relative numbers which can be used to make useful
conclusions. I would be glad to see if people could share ideas on
what stats we could collect and publish instead of just pointing out
flaws in various stats.
I am also interested in usage statistics. I feel it is much more+1
meaningful to work on packages that I know how have a lot of users.
While neither popcon of download stats are accurate, they still show
trends and relative numbers which can be used to make useful
conclusions. I would be glad to see if people could share ideas on
what stats we could collect and publish instead of just pointing out
flaws in various stats.
I suspect that compliance with GDPR would require the data to be
stored minimally.
It seems reasonable to me that a 24-hour window would reduce most repeat-downloads.
If you stream the request log and reduce to (ip,package,version), it
will be minimal.
I think it would fit into memory, e.g. 10 million unique IP adresses
x 100 packages x 40 bytes = 40 GB
The problem is that we currently do not want to retain this data.
It'd require a clear measure of usefulness, not just a "it would be
nice if we had it". And there would need to be actual criteria of what
we would be interested in. Raw download count? Some measure of
bucketing by source IP or not? What about container/hermetic builders fetching the same ancient package over and over again from snapshot?
Does the version matter?
There will be lots of packages that are rarely downloaded and still important.
Back of the envelope math says that'd be 600 GB/d of raw syslog log
traffic.
misguided popularity
to just download the files many times to increase the popularity
I presume do some misguided popularity ranking like pypi does, by counting the
number of downloads.
It works terribly because large organizations that actually download it many times will set up internal mirrors, so there is no chance for the value to have any meaning.
Also on pypi and similar there's an incentive to just download the files many times to increase the popularity (I provide a very nice tool to do that without consuming too much bandwidth, on my codeberg).
Plus of course, how would we even aggregate all the download counts from all the mirrors?
Best
--
Salvo Tomaselli
"Io non mi sento obbligato a credere che lo stesso Dio che ci ha dotato di senso, ragione ed intelletto intendesse che noi ne facessimo a meno."
-- Galileo Galilei
https://ltworf.codeberg.page/
~ 40 B seems fair?But you could also just write to disk. It'll wear out an SSD though,
Where has 100 packages come from here?That would be the average number of downloaded packages per IP per
On Sat, 2025-05-03 at 11:16 +0200, Erik Schulz wrote:
I suspect that compliance with GDPR would require the data to be
stored minimally.
It seems reasonable to me that a 24-hour window would reduce most repeat-downloads.
If you stream the request log and reduce to (ip,package,version), it
will be minimal.
I think it would fit into memory, e.g. 10 million unique IP adresses
x 100 packages x 40 bytes = 40 GB
Where has 100 packages come from here? There are 34 *thousand* source packages in bookworm, i.e. over 100 times your quoted estimate.
You also seem to have underestimated quite a bit if you believe that
you can fit an IPv6 address, a package name and a package version into
40 bytes in most cases, yet alone all.
(As an aside, the RAM allocation on the logging hosts is currently
2GB.)
Regards,
Adam
On 2025-05-03 03:35, Otto Kekäläinen wrote:
I'm interested in package popularity. I'm aware of popcon
(https://popcon.debian.org/), but I'm more interested in actual
downloads.
I am also interested in usage statistics. I feel it is much more
meaningful to work on packages that I know how have a lot of users.
While neither popcon of download stats are accurate, they still show
trends and relative numbers which can be used to make useful
conclusions. I would be glad to see if people could share ideas on
what stats we could collect and publish instead of just pointing out
flaws in various stats.
The problem is that we currently do not want to retain this data. It'd require a clear measure of usefulness, not just a "it would be nice if
we had it". And there would need to be actual criteria of what we would
be interested in. Raw download count? Some measure of bucketing by
source IP or not? What about container/hermetic builders fetching the
same ancient package over and over again from snapshot? Does the version matter?
In the end there would probably need to be a proof of concept of a log processor that's privacy-friendly and gives us the metrics that we
actually want. Hence my question what these metrics are for, except for
a fuzzy feeling of "working on the right priorities". There will be lots
of packages that are rarely downloaded and still important.
Everyone can ask "please just retain all logs and we will do analysis on
them later". Right now it'd be infeasible to get the statistics from the mirrors, and we could at most get statistics for deb.d.o. To give a
sense of scale: We are sampling 1% of cache hits and all errors right
now. That's 6.7 GB/d uncompressed (500 M/d compressed). Back of the
envelope math says that'd be 600 GB/d of raw syslog log traffic. We
should have a very good reason for collecting this much data.
Kind regards
Philipp Kern
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 546 |
Nodes: | 16 (0 / 16) |
Uptime: | 169:01:33 |
Calls: | 10,385 |
Calls today: | 2 |
Files: | 14,057 |
Messages: | 6,416,551 |