Forum: >>> Magnum BBS <<<

Re: Package statistics by downloads

From Philipp Kern@21:1/5 to Erik Schulz on Wed Apr 23 11:20:01 2025

On 2025-04-23 10:08, Erik Schulz wrote:

I'm interested in package popularity. I'm aware of popcon (https://popcon.debian.org/), but I'm more interested in actual
downloads.

What would this be useful for? You only described technical details, not
why we would want to do this.

Kind regards
Philipp Kern

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Salvo Tomaselli@21:1/5 to All on Fri May 2 01:30:01 2025

I presume do some misguided popularity ranking like pypi does, by counting the number of downloads.

It works terribly because large organizations that actually download it many times will set up internal mirrors, so there is no chance for the value to
have any meaning.

Also on pypi and similar there's an incentive to just download the files many times to increase the popularity (I provide a very nice tool to do that
without consuming too much bandwidth, on my codeberg).

Plus of course, how would we even aggregate all the download counts from all the mirrors?

Best

--
Salvo Tomaselli

"Io non mi sento obbligato a credere che lo stesso Dio che ci ha dotato di senso, ragione ed intelletto intendesse che noi ne facessimo a meno."
-- Galileo Galilei

https://ltworf.codeberg.page/

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From =?UTF-8?B?T3R0byBLZWvDpGzDpGluZW4=?@21:1/5 to All on Sat May 3 03:40:01 2025

I'm interested in package popularity. I'm aware of popcon (https://popcon.debian.org/), but I'm more interested in actual
downloads.

I am also interested in usage statistics. I feel it is much more
meaningful to work on packages that I know how have a lot of users.

While neither popcon of download stats are accurate, they still show
trends and relative numbers which can be used to make useful
conclusions. I would be glad to see if people could share ideas on
what stats we could collect and publish instead of just pointing out
flaws in various stats.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Philipp Kern@21:1/5 to All on Sat May 3 09:50:02 2025

On 2025-05-03 03:35, Otto Kekäläinen wrote:

I'm interested in package popularity. I'm aware of popcon
(https://popcon.debian.org/), but I'm more interested in actual
downloads.

I am also interested in usage statistics. I feel it is much more
meaningful to work on packages that I know how have a lot of users.

While neither popcon of download stats are accurate, they still show
trends and relative numbers which can be used to make useful
conclusions. I would be glad to see if people could share ideas on
what stats we could collect and publish instead of just pointing out
flaws in various stats.

The problem is that we currently do not want to retain this data. It'd
require a clear measure of usefulness, not just a "it would be nice if
we had it". And there would need to be actual criteria of what we would
be interested in. Raw download count? Some measure of bucketing by
source IP or not? What about container/hermetic builders fetching the
same ancient package over and over again from snapshot? Does the version matter?

In the end there would probably need to be a proof of concept of a log processor that's privacy-friendly and gives us the metrics that we
actually want. Hence my question what these metrics are for, except for
a fuzzy feeling of "working on the right priorities". There will be lots
of packages that are rarely downloaded and still important.

Everyone can ask "please just retain all logs and we will do analysis on
them later". Right now it'd be infeasible to get the statistics from the mirrors, and we could at most get statistics for deb.d.o. To give a
sense of scale: We are sampling 1% of cache hits and all errors right
now. That's 6.7 GB/d uncompressed (500 M/d compressed). Back of the
envelope math says that'd be 600 GB/d of raw syslog log traffic. We
should have a very good reason for collecting this much data.

Kind regards
Philipp Kern

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Peter B@21:1/5 to All on Sat May 3 11:30:02 2025

On 03/05/2025 02:35, Otto Kekäläinen wrote:

I am also interested in usage statistics. I feel it is much more
meaningful to work on packages that I know how have a lot of users.

+1

While neither popcon of download stats are accurate, they still show
trends and relative numbers which can be used to make useful
conclusions. I would be glad to see if people could share ideas on
what stats we could collect and publish instead of just pointing out
flaws in various stats.

i was disappointed when Ubuntu dropped publishing popcon data.

My understanding is that popcon is set up to report data to an address
that is distro dependent. Do any of our downstreams actually harvest
this info?

Maybe instead the downstream data could come to Debian with the distro
as an attribute? Without factoring in the downstream data, desktop package usage overall is likely to be understated.

Cheers,
Peter

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Adam D. Barratt@21:1/5 to Erik Schulz on Sat May 3 11:40:01 2025

On Sat, 2025-05-03 at 11:16 +0200, Erik Schulz wrote:

I suspect that compliance with GDPR would require the data to be
stored minimally.
It seems reasonable to me that a 24-hour window would reduce most repeat-downloads.
If you stream the request log and reduce to (ip,package,version), it
will be minimal.
I think it would fit into memory, e.g. 10 million unique IP adresses
x 100 packages x 40 bytes = 40 GB

Where has 100 packages come from here? There are 34 *thousand* source
packages in bookworm, i.e. over 100 times your quoted estimate.

You also seem to have underestimated quite a bit if you believe that
you can fit an IPv6 address, a package name and a package version into
40 bytes in most cases, yet alone all.

(As an aside, the RAM allocation on the logging hosts is currently
2GB.)

Regards,

Adam

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From =?UTF-8?Q?Julien_Plissonneau_Duqu=C@21:1/5 to All on Mon May 26 11:20:01 2025

Hi,

I would be interested in per-package-and-version download statistics and
trends as well.

Le 2025-05-03 09:28, Philipp Kern a écrit :

The problem is that we currently do not want to retain this data.

You're absolutely right here, there is no point in retaining the raw
data, it gets stale pretty fast anyway. It has to be processed with
minimal delay and then fed into some kind of time-series database.

It'd require a clear measure of usefulness, not just a "it would be
nice if we had it". And there would need to be actual criteria of what
we would be interested in. Raw download count? Some measure of
bucketing by source IP or not? What about container/hermetic builders fetching the same ancient package over and over again from snapshot?
Does the version matter?

It would help (as an additional data input) when having to make
decisions about keeping or removing packages, especially those with very
low popcons. I would also expect the download counts to have an
interesting significance (for the sake of estimating the installed base)
right after releasing a package update.

Having the count of total successful downloads and the count of unique
IPs for a given package+version couple (= URI) within a given time
interval would be a good start. Further refinements could be implemented
later, like segregating counts by geographical area and
consumer/corporate address range. With these schemes there are no
privacy issues as IP addresses are not retained at all in the TSDB (not
even pseudonymized/anonymized). Time resolution could be hourly for
starting, and then maybe down to the minute for recent history depending
on the required processing power and storage.

There will be lots of packages that are rarely downloaded and still important.

Indeed. That's just additional data to help making decisions in cases
where we have doubts.

Back of the envelope math says that'd be 600 GB/d of raw syslog log
traffic.

I don't think that regular syslog is a reasonable way to retrieve that
amount of data from distant hosts. I don't know what are the options
with the current cache provider, but transferring already compressed
data every hour (or a shorter interval, or streaming compressed data)
sounds better. That would amount to ~2 GiB compressed (~25 GiB
uncompressed) data every hour on average, which seems workable.

Is there any way I could get a copy of a log file (current ones with 1% sampling) for experimenting?

Cheers,

--
Julien Plissonneau Duquène

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Erik Schulz@1:229/2 to All on Wed Apr 23 10:10:01 2025

From: erikschulz184@gmail.com

I'm interested in package popularity. I'm aware of popcon (https://popcon.debian.org/), but I'm more interested in actual
downloads.

Do the debian mirrors track unique downloads (e.g. by hashed IP
address), and if no, why not? I can understand the privacy argument,
but arguably package downloads aren't particularly revealing? And data
could be aggregated daily, thus limiting exposure.

Boyuan Yang pointed out in the debian-www list that the "repository
mirrors" often use third-party CDNs. I assume it uses DNS response load-balancing.
There's potential for the request log to be biased geographically, but
it might add interesting data. Another bias would be people using a
VPN, but they'd only be counted once per exit node (so you'd have some
IPs using an extreme number of packages).

Parsing the request logs could be fairly trivial:
1. reduce to unique pairs every 24h: (ip, package)
2. sum by package

--- SoupGate-Win32 v1.05
* Origin: you cannot sedate... all the things you hate (1:229/2)

From Erik Schulz@1:229/2 to tiposchi@tiscali.it on Fri May 2 23:20:02 2025

From: erikschulz184@gmail.com

misguided popularity

I would argue a more objective description is that the measurement has bias. I.e.
- repeat-download bias.
- external-download bias, when using mirrors.
- false-download bias, when malicious actors try to manipulate the
value, for example using many IPs.

I agree that installation-counting popcon avoids the first two, but
also suffers from 'willing-participant' bias. I have no idea how
severe this bias is.
So I have to theorize: maybe server installations are heavily
underrepresented; it also doesn't count the privacy conscious, and
maybe basic users that don't understand what it does and just say no.
I.e. entire classes are missing.

So the download-counting popcon may at least provide some new insights.

to just download the files many times to increase the popularity

I assume this attack applies to popcon as well? It would be trivial to
push false numbers. I'm not familiar with how it works, but if it just
pushes a list of installed packages, then it is even more trivial to
manipulate the numbers.

The unfortunate conclusion is that we can't rely on these numbers to
track actual popularity, only whether a package is likely being used.
I.e. very low numbers may be given lower priority on mirrors, which
can be relevant for long term archives (e.g. all packages ever used in
Debian for 20 years) that may prefer to only archive packages that are
more than 0.001% likely to be used.
If servers grossly underrepresented in the sample, the data may be
unreliable for this use case.
And download-counting popcon would at-worst include unused packages.
In the worst-case, someone fake-downloading every single package would
render this statistic very hard to use.

On Fri, May 2, 2025 at 1:28 AM Salvo Tomaselli <tiposchi@tiscali.it> wrote:

I presume do some misguided popularity ranking like pypi does, by counting the
number of downloads.

It works terribly because large organizations that actually download it many times will set up internal mirrors, so there is no chance for the value to have any meaning.

Also on pypi and similar there's an incentive to just download the files many times to increase the popularity (I provide a very nice tool to do that without consuming too much bandwidth, on my codeberg).

Plus of course, how would we even aggregate all the download counts from all the mirrors?

Best

--
Salvo Tomaselli

"Io non mi sento obbligato a credere che lo stesso Dio che ci ha dotato di senso, ragione ed intelletto intendesse che noi ne facessimo a meno."
-- Galileo Galilei

https://ltworf.codeberg.page/

--- SoupGate-Win32 v1.05
* Origin: you cannot sedate... all the things you hate (1:229/2)

From Erik Schulz@1:229/2 to adam@adam-barratt.org.uk on Sat May 3 13:30:01 2025

From: erikschulz184@gmail.com

Memory usage approximations:
per tuple:
ipv6 = 16
package pointer = 3 (assuming <16777216 packages)
version pointer = 2 (assuming <65536 distinct version names)
+ some overhead

~ 40 B seems fair?

But you could also just write to disk. It'll wear out an SSD though,
and random r/w on a harddrive is slow.

Where has 100 packages come from here?

That would be the average number of downloaded packages per IP per
day. I assume some would just download a single package, while others
are installing an entire system of +1000 packages, but 100 on average
seems a fair ballpark number.
What number do you suggest?

On Sat, May 3, 2025 at 11:39 AM Adam D. Barratt
<adam@adam-barratt.org.uk> wrote:

On Sat, 2025-05-03 at 11:16 +0200, Erik Schulz wrote:

I suspect that compliance with GDPR would require the data to be
stored minimally.
It seems reasonable to me that a 24-hour window would reduce most repeat-downloads.
If you stream the request log and reduce to (ip,package,version), it
will be minimal.
I think it would fit into memory, e.g. 10 million unique IP adresses
x 100 packages x 40 bytes = 40 GB

Where has 100 packages come from here? There are 34 *thousand* source packages in bookworm, i.e. over 100 times your quoted estimate.

You also seem to have underestimated quite a bit if you believe that
you can fit an IPv6 address, a package name and a package version into
40 bytes in most cases, yet alone all.

(As an aside, the RAM allocation on the logging hosts is currently
2GB.)

Regards,

Adam

--- SoupGate-Win32 v1.05
* Origin: you cannot sedate... all the things you hate (1:229/2)

From Erik Schulz@1:229/2 to phil@philkern.de on Sat May 3 11:20:01 2025

From: erikschulz184@gmail.com

I suspect that compliance with GDPR would require the data to be
stored minimally.
It seems reasonable to me that a 24-hour window would reduce most repeat-downloads.
If you stream the request log and reduce to (ip,package,version), it
will be minimal.
I think it would fit into memory, e.g. 10 million unique IP adresses x
100 packages x 40 bytes = 40 GB
The program code could ideally be generalized and used by other distros as well.

On Sat, May 3, 2025 at 10:43 AM Philipp Kern <phil@philkern.de> wrote:

On 2025-05-03 03:35, Otto Kekäläinen wrote:

I'm interested in package popularity. I'm aware of popcon
(https://popcon.debian.org/), but I'm more interested in actual
downloads.

I am also interested in usage statistics. I feel it is much more
meaningful to work on packages that I know how have a lot of users.

While neither popcon of download stats are accurate, they still show
trends and relative numbers which can be used to make useful
conclusions. I would be glad to see if people could share ideas on
what stats we could collect and publish instead of just pointing out
flaws in various stats.

The problem is that we currently do not want to retain this data. It'd require a clear measure of usefulness, not just a "it would be nice if
we had it". And there would need to be actual criteria of what we would
be interested in. Raw download count? Some measure of bucketing by
source IP or not? What about container/hermetic builders fetching the
same ancient package over and over again from snapshot? Does the version matter?

In the end there would probably need to be a proof of concept of a log processor that's privacy-friendly and gives us the metrics that we
actually want. Hence my question what these metrics are for, except for
a fuzzy feeling of "working on the right priorities". There will be lots
of packages that are rarely downloaded and still important.

Everyone can ask "please just retain all logs and we will do analysis on
them later". Right now it'd be infeasible to get the statistics from the mirrors, and we could at most get statistics for deb.d.o. To give a
sense of scale: We are sampling 1% of cache hits and all errors right
now. That's 6.7 GB/d uncompressed (500 M/d compressed). Back of the
envelope math says that'd be 600 GB/d of raw syslog log traffic. We
should have a very good reason for collecting this much data.

Kind regards
Philipp Kern

--- SoupGate-Win32 v1.05
* Origin: you cannot sedate... all the things you hate (1:229/2)

Who's Online
Recent Visitors
- Plume
  Sun Sep 14 09:34:52 2025
  from Uk via Raw
- Gretchiie
  Sun Sep 14 06:07:30 2025
  from Derry, Nh via Telnet
- Thlc
  Sat Sep 13 17:11:34 2025
  from Rognac, France via Telnet
- Thlc
  Sat Sep 13 17:04:03 2025
  from Rognac, France via Telnet
- Thlc
  Sat Sep 13 16:32:19 2025
  from Rognac, France via SSH
- Thlc
  Sat Sep 13 15:41:11 2025
  from Rognac, France via SSH
- Thlc
  Sat Sep 13 07:56:03 2025
  from Rognac, France via SSH
- Gretchiie
  Sat Sep 13 07:22:10 2025
  from Derry, Nh via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	546
Nodes:	16 (0 / 16)
Uptime:	169:01:33
Calls:	10,385
Calls today:	2
Files:	14,057
Messages:	6,416,551

Re: Package statistics by downloads

Who's Online

Recent Visitors

System Info