• Re: Package statistics by downloads

    From Philipp Kern@21:1/5 to Erik Schulz on Wed Apr 23 11:20:01 2025
    On 2025-04-23 10:08, Erik Schulz wrote:
    I'm interested in package popularity. I'm aware of popcon (https://popcon.debian.org/), but I'm more interested in actual
    downloads.

    What would this be useful for? You only described technical details, not
    why we would want to do this.

    Kind regards
    Philipp Kern

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Salvo Tomaselli@21:1/5 to All on Fri May 2 01:30:01 2025
    I presume do some misguided popularity ranking like pypi does, by counting the number of downloads.

    It works terribly because large organizations that actually download it many times will set up internal mirrors, so there is no chance for the value to
    have any meaning.

    Also on pypi and similar there's an incentive to just download the files many times to increase the popularity (I provide a very nice tool to do that
    without consuming too much bandwidth, on my codeberg).

    Plus of course, how would we even aggregate all the download counts from all the mirrors?

    Best


    --
    Salvo Tomaselli

    "Io non mi sento obbligato a credere che lo stesso Dio che ci ha dotato di senso, ragione ed intelletto intendesse che noi ne facessimo a meno."
    -- Galileo Galilei

    https://ltworf.codeberg.page/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From =?UTF-8?B?T3R0byBLZWvDpGzDpGluZW4=?@21:1/5 to All on Sat May 3 03:40:01 2025
    I'm interested in package popularity. I'm aware of popcon (https://popcon.debian.org/), but I'm more interested in actual
    downloads.

    I am also interested in usage statistics. I feel it is much more
    meaningful to work on packages that I know how have a lot of users.

    While neither popcon of download stats are accurate, they still show
    trends and relative numbers which can be used to make useful
    conclusions. I would be glad to see if people could share ideas on
    what stats we could collect and publish instead of just pointing out
    flaws in various stats.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Philipp Kern@21:1/5 to All on Sat May 3 09:50:02 2025
    On 2025-05-03 03:35, Otto Kekäläinen wrote:
    I'm interested in package popularity. I'm aware of popcon
    (https://popcon.debian.org/), but I'm more interested in actual
    downloads.

    I am also interested in usage statistics. I feel it is much more
    meaningful to work on packages that I know how have a lot of users.

    While neither popcon of download stats are accurate, they still show
    trends and relative numbers which can be used to make useful
    conclusions. I would be glad to see if people could share ideas on
    what stats we could collect and publish instead of just pointing out
    flaws in various stats.

    The problem is that we currently do not want to retain this data. It'd
    require a clear measure of usefulness, not just a "it would be nice if
    we had it". And there would need to be actual criteria of what we would
    be interested in. Raw download count? Some measure of bucketing by
    source IP or not? What about container/hermetic builders fetching the
    same ancient package over and over again from snapshot? Does the version matter?

    In the end there would probably need to be a proof of concept of a log processor that's privacy-friendly and gives us the metrics that we
    actually want. Hence my question what these metrics are for, except for
    a fuzzy feeling of "working on the right priorities". There will be lots
    of packages that are rarely downloaded and still important.

    Everyone can ask "please just retain all logs and we will do analysis on
    them later". Right now it'd be infeasible to get the statistics from the mirrors, and we could at most get statistics for deb.d.o. To give a
    sense of scale: We are sampling 1% of cache hits and all errors right
    now. That's 6.7 GB/d uncompressed (500 M/d compressed). Back of the
    envelope math says that'd be 600 GB/d of raw syslog log traffic. We
    should have a very good reason for collecting this much data.

    Kind regards
    Philipp Kern

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Peter B@21:1/5 to All on Sat May 3 11:30:02 2025
    On 03/05/2025 02:35, Otto Kekäläinen wrote:
    I am also interested in usage statistics. I feel it is much more
    meaningful to work on packages that I know how have a lot of users.
    +1

    While neither popcon of download stats are accurate, they still show
    trends and relative numbers which can be used to make useful
    conclusions. I would be glad to see if people could share ideas on
    what stats we could collect and publish instead of just pointing out
    flaws in various stats.

    i was disappointed when Ubuntu dropped publishing popcon data.

    My understanding is that popcon is set up to report data to an address
    that is distro dependent. Do any of our downstreams actually harvest
    this info?

    Maybe instead the downstream data could come to Debian with the distro
    as an attribute?  Without factoring in the downstream data, desktop package usage overall is likely to be understated.


    Cheers,
    Peter

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Adam D. Barratt@21:1/5 to Erik Schulz on Sat May 3 11:40:01 2025
    On Sat, 2025-05-03 at 11:16 +0200, Erik Schulz wrote:
    I suspect that compliance with GDPR would require the data to be
    stored minimally.
    It seems reasonable to me that a 24-hour window would reduce most repeat-downloads.
    If you stream the request log and reduce to (ip,package,version), it
    will be minimal.
    I think it would fit into memory, e.g. 10 million unique IP adresses
    x 100 packages x 40 bytes = 40 GB

    Where has 100 packages come from here? There are 34 *thousand* source
    packages in bookworm, i.e. over 100 times your quoted estimate.

    You also seem to have underestimated quite a bit if you believe that
    you can fit an IPv6 address, a package name and a package version into
    40 bytes in most cases, yet alone all.

    (As an aside, the RAM allocation on the logging hosts is currently
    2GB.)

    Regards,

    Adam

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From =?UTF-8?Q?Julien_Plissonneau_Duqu=C@21:1/5 to All on Mon May 26 11:20:01 2025
    Hi,

    I would be interested in per-package-and-version download statistics and
    trends as well.

    Le 2025-05-03 09:28, Philipp Kern a écrit :

    The problem is that we currently do not want to retain this data.

    You're absolutely right here, there is no point in retaining the raw
    data, it gets stale pretty fast anyway. It has to be processed with
    minimal delay and then fed into some kind of time-series database.

    It'd require a clear measure of usefulness, not just a "it would be
    nice if we had it". And there would need to be actual criteria of what
    we would be interested in. Raw download count? Some measure of
    bucketing by source IP or not? What about container/hermetic builders fetching the same ancient package over and over again from snapshot?
    Does the version matter?

    It would help (as an additional data input) when having to make
    decisions about keeping or removing packages, especially those with very
    low popcons. I would also expect the download counts to have an
    interesting significance (for the sake of estimating the installed base)
    right after releasing a package update.

    Having the count of total successful downloads and the count of unique
    IPs for a given package+version couple (= URI) within a given time
    interval would be a good start. Further refinements could be implemented
    later, like segregating counts by geographical area and
    consumer/corporate address range. With these schemes there are no
    privacy issues as IP addresses are not retained at all in the TSDB (not
    even pseudonymized/anonymized). Time resolution could be hourly for
    starting, and then maybe down to the minute for recent history depending
    on the required processing power and storage.

    There will be lots of packages that are rarely downloaded and still important.

    Indeed. That's just additional data to help making decisions in cases
    where we have doubts.

    Back of the envelope math says that'd be 600 GB/d of raw syslog log
    traffic.

    I don't think that regular syslog is a reasonable way to retrieve that
    amount of data from distant hosts. I don't know what are the options
    with the current cache provider, but transferring already compressed
    data every hour (or a shorter interval, or streaming compressed data)
    sounds better. That would amount to ~2 GiB compressed (~25 GiB
    uncompressed) data every hour on average, which seems workable.

    Is there any way I could get a copy of a log file (current ones with 1% sampling) for experimenting?

    Cheers,

    --
    Julien Plissonneau Duquène

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)