• Re: Good hash for pointers

    From Richard Harnden@21:1/5 to Malcolm McLean on Thu May 23 15:37:44 2024
    On 23/05/2024 12:11, Malcolm McLean wrote:

    What is a good hash function for pointers to use in portable ANSI C?

    All your pointers from malloc are going to be unique.
    All of their low bits are going to be zero, because they need to align
    on some n-bit boundary.
    All of their high bit are going to be the same, because that's just how
    it works.

    So just take the middle: hash = ((intptr_t) ptr) >> 4 & 0xffff;

    Good enough?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kaz Kylheku@21:1/5 to Malcolm McLean on Thu May 23 20:34:13 2024
    On 2024-05-23, Malcolm McLean <malcolm.arthur.mclean@gmail.com> wrote:

    What is a good hash function for pointers to use in portable ANSI C?

    I don't think that it's practical to try to do this in some portable
    way, let alone to an old dialect of C from 1989.

    Let's say we have a type uintptr_t to which we can convert a void *
    pointer.

    In the TXR Lisp run time, here is what I did (paraphrased):

    switch (CHAR_BIT * sizeof (void *)) {
    case 32:
    return ((uintptr_t) ptr) >> 4;
    case 64:
    return ((uintptr_t) ptr) >> 5;
    }

    The idea is that on 32 bits, we suspect that pointers from malloc may be aligned on 16 bits, so the least significant four bits don't tell us
    anything useful; they are likely all zero, which is bad. On 64 bits, we
    throw away five bits.

    This is actually an important hashing function because it's used
    for hashing objects that use eq equality even when compared using
    equal.

    For instance, all structs for which an equality substitute is not
    defined.

    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @Kazinator@mstdn.ca

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Malcolm McLean on Thu May 23 15:49:42 2024
    Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:

    What is a good hash function for pointers to use in portable
    ANSI C?

    I have a preliminary question. Do you really mean ANSI C, or
    is C99 acceptable?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Malcolm McLean on Thu May 23 16:52:03 2024
    Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:

    On 23/05/2024 23:49, Tim Rentsch wrote:

    Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:

    What is a good hash function for pointers to use in portable
    ANSI C?

    I have a preliminary question. Do you really mean ANSI C, or
    is C99 acceptable?

    C89 is better.
    But the pass has been sold.

    I'm not asking which you think is better. I'm asking about
    what your requirements are.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Malcolm McLean on Thu May 23 18:39:56 2024
    Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:

    On 24/05/2024 00:52, Tim Rentsch wrote:

    Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:

    On 23/05/2024 23:49, Tim Rentsch wrote:

    Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:

    What is a good hash function for pointers to use in portable
    ANSI C?

    I have a preliminary question. Do you really mean ANSI C, or
    is C99 acceptable?

    C89 is better.
    But the pass has been sold.

    I'm not asking which you think is better. I'm asking about
    what your requirements are.

    C 89.
    I don't want to pull in C99 types and so on just for a hash function.

    In that case I think you are stuck with using a half-baked
    solution. The standard integer types available in C89 just
    aren't a good fit in a 64-bit world.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jak@21:1/5 to All on Fri May 24 04:09:41 2024
    Chris M. Thomasson ha scritto:
    On 5/23/2024 5:32 PM, Chris M. Thomasson wrote:
    On 5/23/2024 4:11 AM, Malcolm McLean wrote:

    What is a good hash function for pointers to use in portable ANSI C?

    The pointers are nodes of a tree, which are read only, and I want to
    associate read/write data with them. So potentially a lage number of
    pointers,and they might be consecutively ordered if they are taken
    from an array, or they might be returned from repeared calls to
    malloc() with small allocations. Obviously I have no control over
    pointer size or internal representation.


    I kind of remember an old one for 32-bit pointers. It was something
    like multiply by a prime and shift to the right by 7 or 8. Shit, I
    think. It was way back on comp.programming.threads.

    There was an old paper from microsoft that had a pointer hash. It was
    used to index into a table of locks. I need to find it.

    <https://wiki.osdev.org/Global_Descriptor_Table>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Malcolm McLean on Fri May 24 09:41:54 2024
    On 24/05/2024 02:28, Malcolm McLean wrote:
    On 24/05/2024 00:52, Tim Rentsch wrote:
    Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:

    On 23/05/2024 23:49, Tim Rentsch wrote:

    Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:

    What is a good hash function for pointers to use in portable
    ANSI C?

    I have a preliminary question.  Do you really mean ANSI C, or
    is C99 acceptable?

    C89 is better.
    But the pass has been sold.

    I'm not asking which you think is better.  I'm asking about
    what your requirements are.

    C 89.
    I don't want to pull in C99 types and so on just for a hash function.


    Tim is right that the <stdint.h> types are /vastly/ better for this sort
    of thing. And you won't find any C89/C90 compilers that don't support <stdint.h> even in C89/C90 mode. You might choose to avoid other C99
    features, in case you have any users that insist on C89/C90, but you
    can't get anywhere with just the standard C89/C90 types. On a platform
    such as 64-bit Windows, pointers are 64-bit but the biggest C89/C90
    integer type is "unsigned long", which is 32-bit on that platform. You
    can't get at the upper bits of the pointer without compiler-specific
    extensions - and if you allow extensions, you might as well allow
    <stdint.h> and make life far simpler for everyone.

    If you insist on using C89/C90, make sure that the code is compatible
    (with the same semantics) for more modern standards. There are few
    fanatics left who think C89/C90 is "better" than C99 or newer standards,
    so the majority of potential users will want C99, C11, or - perhaps most
    likely - a compiler-specific variation of one of these. It would be a
    shame if your code relied on pre-C99 integer promotion rules, or used
    implicit int, or had identifiers that conflict with newer standards.
    These would undoubtedly be far worse for users than any purely imagined problems caused by #include <stdint.h>.

    (As a pedantic quibble, I believe ANSI invariably re-publish ISO C
    standards as updated ANSI C standards. So "ANSI C" currently refers to
    C17. It is best, IMHO, to explicitly say C89 and/or C90 rather than
    "ANSI C".)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From bart@21:1/5 to Tim Rentsch on Fri May 24 11:14:36 2024
    On 24/05/2024 02:39, Tim Rentsch wrote:
    Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:

    On 24/05/2024 00:52, Tim Rentsch wrote:

    Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:

    On 23/05/2024 23:49, Tim Rentsch wrote:

    Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:

    What is a good hash function for pointers to use in portable
    ANSI C?

    I have a preliminary question. Do you really mean ANSI C, or
    is C99 acceptable?

    C89 is better.
    But the pass has been sold.

    I'm not asking which you think is better. I'm asking about
    what your requirements are.

    C 89.
    I don't want to pull in C99 types and so on just for a hash function.

    In that case I think you are stuck with using a half-baked
    solution. The standard integer types available in C89 just
    aren't a good fit in a 64-bit world.

    I assume the C89 implementation is one that can target current 64 bit
    machines.

    Then char, short, int, long long will almost certainly have widths of 8,
    16, 32 and 64 bits respectively.

    (I don't know if 'long long' was part of C89, but it sounds like Malcolm
    just doesn't want to be bothered with stdint.h, and any compiler used is
    like to support it.

    I can't stand it either. Just today I wrote these two lines:

    typedef unsigned long long u64;
    typedef unsigned char byte;

    to avoid stdint.h and its ugly set of types.)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to bart on Fri May 24 06:18:15 2024
    bart <bc@freeuk.com> writes:

    On 24/05/2024 02:39, Tim Rentsch wrote:

    Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:

    On 24/05/2024 00:52, Tim Rentsch wrote:

    Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:

    On 23/05/2024 23:49, Tim Rentsch wrote:

    Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:

    What is a good hash function for pointers to use in portable
    ANSI C?

    I have a preliminary question. Do you really mean ANSI C, or
    is C99 acceptable?

    C89 is better.
    But the pass has been sold.

    I'm not asking which you think is better. I'm asking about
    what your requirements are.

    C 89.
    I don't want to pull in C99 types and so on just for a hash
    function.

    In that case I think you are stuck with using a half-baked
    solution. The standard integer types available in C89 just
    aren't a good fit in a 64-bit world.

    I assume the C89 implementation is one that can target current 64
    bit machines.

    Then char, short, int, long long will almost certainly have widths
    of 8, 16, 32 and 64 bits respectively.

    C89 doesn't have long long.

    (I don't know if 'long long' was part of C89, but it sounds like
    Malcolm just doesn't want to be bothered with stdint.h, and any
    compiler used is like to support it.

    What he said was C89. He didn't mention stdint.h. I take
    him at his word. If what he wants is something different,
    he should say clearly what it is, and not make people guess
    about it. (To be clear this recommendation is intended for
    every questioner, not just Malcolm.)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Malcolm McLean on Fri May 24 14:51:47 2024
    Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:
    On 24/05/2024 14:18, Tim Rentsch wrote:
    bart <bc@freeuk.com> writes:


    cJSON.c is C89. And these additions to the resource compiler were
    inspired by a menntion of cJSON.c here.

    So a C89 hash for a pointer to an unsigned int would be ideal. However
    it might be impossible to write one which is both efficient in terms of >machine instructions and a good hash function in that it distributes the >hashes evenly given an uneven distribution of keys. And pointers
    returned from repeated calls to malloc() are going to have an odd >distribution of values.

    Just stick the bool in your DOM[*] nodes. Forget about const.


    [*] Document Object Model (i.e. tree structure).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to bart on Fri May 24 17:00:16 2024
    On 24/05/2024 12:14, bart wrote:
    On 24/05/2024 02:39, Tim Rentsch wrote:
    Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:

    On 24/05/2024 00:52, Tim Rentsch wrote:

    Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:

    On 23/05/2024 23:49, Tim Rentsch wrote:

    Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:

    What is a good hash function for pointers to use in portable
    ANSI C?

    I have a preliminary question.  Do you really mean ANSI C, or
    is C99 acceptable?

    C89 is better.
    But the pass has been sold.

    I'm not asking which you think is better.  I'm asking about
    what your requirements are.

    C 89.
    I don't want to pull in C99 types and so on just for a hash function.

    In that case I think you are stuck with using a half-baked
    solution.  The standard integer types available in C89 just
    aren't a good fit in a 64-bit world.

    I assume the C89 implementation is one that can target current 64 bit machines.


    Targeting 64-bit machines does not imply support for 64-bit types using standard C89 integer types. You probably find that most, if not all,
    C89 compilers that target 64-bit processors but have 32-bit "long" will
    have extensions for a 64-bit type. (It won't be an "extended integer
    type", as defined by the C standards, but it will be good enough for the
    job.) But if you use such extensions then you are not using standard C89.

    Then char, short, int, long long will almost certainly have widths of 8,
    16, 32 and 64 bits respectively.

    (I don't know if 'long long' was part of C89, but it sounds like Malcolm
    just doesn't want to be bothered with stdint.h, and any compiler used is
    like to support it.


    C89 does not have "long long". And "long" is 32-bit on 64-bit windows.
    There is no way in C89 to pick an integer type that is at least or
    exactly 64 bits. There is also no way to refer an integer type that is
    big enough to support an integer conversion from a pointer. (In C99, in <stdint.h>, you have "uintptr_t" and "intptr_t" which can be used if and
    only if the platform supports an integer type big enough for the integer conversion of pointers.)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Malcolm McLean on Fri May 24 19:27:48 2024
    On Fri, 24 May 2024 17:10:45 +0100
    Malcolm McLean <malcolm.arthur.mclean@gmail.com> wrote:

    So if we use uintptr_t the code might break.
    And if we use unsigned long we have to assume maximum 32 bits.
    However it's unlikely anyone will have an xml file with more than a
    billion nodes.


    It's not about # of nodes. It's about good hash functions tend to use
    64-bit multiplication.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From bart@21:1/5 to Malcolm McLean on Sat May 25 00:54:34 2024
    On 24/05/2024 19:57, Malcolm McLean wrote:
    On 24/05/2024 19:28, Bonita Montero wrote:
    Am 23.05.2024 um 13:11 schrieb Malcolm McLean:

    What is a good hash function for pointers to use in portable ANSI C?

    The pointers are nodes of a tree, which are read only, and I want to
    associate read/write data with them. So potentially a lage number of
    pointers,and they might be consecutively ordered if they are taken
    from an array, or they might be returned from repeared calls to
    malloc() with small allocations. Obviously I have no control over
    pointer size or internal representation.


    Use FNV.


    Here's an attempt.

    /* FNV hash of a pointer */
    static unsigned int hash(void *address)
    {
        int i;
        unsigned long answer = 2166136261;
        unsigned char *byte = (unsigned char *) &address;

        for (i = 0; i < sizeof(void *); i++)
        {
            answer *= 16777619;
            answer ^= byte[i];
        }
        return (unsigned int) (answer & 0xFFFFFFFF);
    }

    Now what will compilers make of that?

    Compiler, or performance?

    I tried this function with the test program shown below. I used it to
    populate a hash table of 64K entries with pointers from successive calls
    to malloc.

    Results, in terms of clashes, for different numbers N of entries
    populated out of 64K were:

    10000 1100
    30000 12000
    50000 67000
    60000 216000
    65535 5500000 (largest N possible)

    Result were rather variable as malloc produces different patterns of
    pointers on different runs. These were simply the result from the first run.

    Was this good? I'd no idea. But as a comparison, I used my own hash
    function, normally used to hash identifiers, shown below the main
    program as the function 'bchash'.

    If this is subsituted instead, the results were:

    10000 230
    30000 3800
    50000 10300
    60000 50300
    65535 2700000

    Hash tables need a certain amount of free capacity to stay efficient, so
    3/4 full (about 50K/64K) is about right.

    Again I don't know if these figures are good, they are just better than
    from your hash() function, for the inputs I used, within this test, and
    for this size of table.

    No doubt there are much better ones.



    ------------------------------------------
    #include <stdio.h>
    #include <stdlib.h>

    static unsigned int hash(void *address)
    {
    int i;
    unsigned long answer = 2166136261;
    unsigned char *byte = (unsigned char *) &address;

    for (i = 0; i < sizeof(void *); i++)
    {
    answer *= 16777619;
    answer ^= byte[i];
    }
    return (unsigned int) (answer & 0xFFFFFFFF);
    }

    void* table[65536];

    int main(void) {
    void* p;

    int clashes=0, wrapped;
    int j;

    for (int i=0; i<30000; ++i) {
    p = malloc(1);
    j = hash(p) & 65535;

    wrapped=0;
    while (table[j]) {
    ++clashes;
    ++j;
    if (j>65535) {
    if (wrapped) { puts("Table full"); exit(1);}
    wrapped=1;
    j=0;
    }
    }
    table[j] = p;

    }
    printf("Clashes %d\n", clashes);
    }


    ------------------------------------------

    static unsigned int bchash(void *address)
    {
    int i;
    unsigned long hsum = 0;
    unsigned char *byte = (unsigned char *) &address;

    for (i = 0; i < sizeof(void *); i++) {
    hsum = (hsum<<4) - hsum + byte[i];
    }
    return (hsum<<5) - hsum;
    }

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to bart on Sat May 25 02:12:28 2024
    bart <bc@freeuk.com> writes:

    On 24/05/2024 19:57, Malcolm McLean wrote:

    On 24/05/2024 19:28, Bonita Montero wrote:

    Am 23.05.2024 um 13:11 schrieb Malcolm McLean:

    What is a good hash function for pointers to use in portable ANSI C?

    The pointers are nodes of a tree, which are read only, and I want
    to associate read/write data with them. So potentially a lage
    number of pointers,and they might be consecutively ordered if they
    are taken from an array, or they might be returned from repeared
    calls to malloc() with small allocations. Obviously I have no
    control over pointer size or internal representation.

    Use FNV.

    Here's an attempt.

    /* FNV hash of a pointer */
    static unsigned int hash(void *address)
    {
    int i;
    unsigned long answer = 2166136261;
    unsigned char *byte = (unsigned char *) &address;

    for (i = 0; i < sizeof(void *); i++)
    {
    answer *= 16777619;
    answer ^= byte[i];
    }
    return (unsigned int) (answer & 0xFFFFFFFF);
    }

    Now what will compilers make of that?

    Compiler, or performance?

    I tried this function with the test program shown below. I used it to populate a hash table of 64K entries with pointers from successive
    calls to malloc.

    Results, in terms of clashes, for different numbers N of entries
    populated out of 64K were:

    10000 1100
    30000 12000
    50000 67000
    60000 216000
    65535 5500000 (largest N possible)

    Result were rather variable as malloc produces different patterns of
    pointers on different runs. These were simply the result from the
    first run.

    Was this good? I'd no idea. But as a comparison, I used my own hash function, normally used to hash identifiers, shown below the main
    program as the function 'bchash'.

    If this is subsituted instead, the results were:

    10000 230
    30000 3800
    50000 10300
    60000 50300
    65535 2700000

    Hash tables need a certain amount of free capacity to stay efficient,
    so 3/4 full (about 50K/64K) is about right.

    Again I don't know if these figures are good, they are just better
    than from your hash() function, for the inputs I used, within this
    test, and for this size of table.

    No doubt there are much better ones.



    ------------------------------------------
    #include <stdio.h>
    #include <stdlib.h>

    static unsigned int hash(void *address)
    {
    int i;
    unsigned long answer = 2166136261;
    unsigned char *byte = (unsigned char *) &address;

    for (i = 0; i < sizeof(void *); i++)
    {
    answer *= 16777619;
    answer ^= byte[i];
    }
    return (unsigned int) (answer & 0xFFFFFFFF);
    }

    void* table[65536];

    int main(void) {
    void* p;

    int clashes=0, wrapped;
    int j;

    for (int i=0; i<30000; ++i) {
    p = malloc(1);
    j = hash(p) & 65535;

    wrapped=0;
    while (table[j]) {
    ++clashes;
    ++j;
    if (j>65535) {
    if (wrapped) { puts("Table full"); exit(1);}
    wrapped=1;
    j=0;
    }
    }
    table[j] = p;

    }
    printf("Clashes %d\n", clashes);
    }


    ------------------------------------------

    static unsigned int bchash(void *address)
    {
    int i;
    unsigned long hsum = 0;
    unsigned char *byte = (unsigned char *) &address;

    for (i = 0; i < sizeof(void *); i++) {
    hsum = (hsum<<4) - hsum + byte[i];
    }
    return (hsum<<5) - hsum;
    }

    It looks like your hash function was tuned for this testing
    setup. With different choices for testing it does much
    worse.

    The testing is done with malloc() blocks all of the same
    requested size, and that size is 1. Typical workloads
    are likely to be both larger and more variable.

    When adding a new entry finds a collision with an old
    entry, linear probing is used to find a free slot. It's
    well understood that linear probing suffers badly as the
    load factor goes up. Better to take a few high bits of
    the hash value -- as few as five or six is fine -- to
    have the reprobe sequences have different strides.

    Your hash function is expensive to compute, moreso even
    than the "FNV" function shown earlier. In a case like
    this one where the compares are cheap, it's better to
    have a dumb-but-fast hash function that might need a
    few more looks to find an open slot, because the cost
    of looking is so cheap compared to computing the hash
    function.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Malcolm McLean on Sat May 25 02:49:00 2024
    Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:

    On 24/05/2024 14:18, Tim Rentsch wrote:

    bart <bc@freeuk.com> writes:

    On 24/05/2024 02:39, Tim Rentsch wrote:

    Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:

    On 24/05/2024 00:52, Tim Rentsch wrote:

    Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:

    On 23/05/2024 23:49, Tim Rentsch wrote:

    Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:

    What is a good hash function for pointers to use in portable >>>>>>>>> ANSI C?

    I have a preliminary question. Do you really mean ANSI C, or
    is C99 acceptable?

    C89 is better.
    But the pass has been sold.

    I'm not asking which you think is better. I'm asking about
    what your requirements are.

    C 89.
    I don't want to pull in C99 types and so on just for a hash
    function.

    In that case I think you are stuck with using a half-baked
    solution. The standard integer types available in C89 just
    aren't a good fit in a 64-bit world.

    I assume the C89 implementation is one that can target current 64
    bit machines.

    Then char, short, int, long long will almost certainly have widths
    of 8, 16, 32 and 64 bits respectively.

    C89 doesn't have long long.

    (I don't know if 'long long' was part of C89, but it sounds like
    Malcolm just doesn't want to be bothered with stdint.h, and any
    compiler used is like to support it.

    What he said was C89. He didn't mention stdint.h. I take
    him at his word. If what he wants is something different,
    he should say clearly what it is, and not make people guess
    about it. (To be clear this recommendation is intended for
    every questioner, not just Malcolm.)

    cJSON.c is C89. And these additions to the resource compiler were
    inspired by a menntion of cJSON.c here.

    So a C89 hash for a pointer to an unsigned int would be ideal. However
    it might be impossible to write one which is both efficient in terms
    of machine instructions and a good hash function in that it
    distributes the hashes evenly given an uneven distribution of
    keys. And pointers returned from repeated calls to malloc() are going
    to have an odd distribution of values.

    I suggest you look for a way to write and use a hash function that
    uses unsigned long long without that causing cJSON.c to need
    compiling by a post-C89 ruleset.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From bart@21:1/5 to Tim Rentsch on Sat May 25 12:28:49 2024
    On 25/05/2024 10:12, Tim Rentsch wrote:
    bart <bc@freeuk.com> writes:


    It looks like your hash function was tuned for this testing
    setup. With different choices for testing it does much
    worse.

    It wasn't tuned at all; it was a copy of what I've longed used within my lexers.

    It is designed to give a good spread when inputs are similar or change sequentially. (Well, not exactly designed; more trial and error.)

    The testing is done with malloc() blocks all of the same
    requested size, and that size is 1. Typical workloads
    are likely to be both larger and more variable.

    When adding a new entry finds a collision with an old
    entry, linear probing is used to find a free slot. It's
    well understood that linear probing suffers badly as the
    load factor goes up. Better to take a few high bits of
    the hash value -- as few as five or six is fine -- to
    have the reprobe sequences have different strides.

    Your hash function is expensive to compute, moreso even
    than the "FNV" function shown earlier.

    You mean Malcolm's version? That also uses a loop. Neither need to
    process all 8 bytes of a 64-bit pointer; 6 will do, and probably just 4.
    And such a loop can be trivially unrolled.

    In a case like
    this one where the compares are cheap, it's better to
    have a dumb-but-fast hash function that might need a
    few more looks to find an open slot, because the cost
    of looking is so cheap compared to computing the hash
    function.

    My first thought in this thread was to just make use of the pointer
    value directly. My version was written only to compare against Malcolm's
    FNV code.

    If I try the other suggestions, RH's and KK's which simply shift the
    pointer value (similar to my idea), the results I got were interesting: sometimes there were zero clashes (almost as though bits 4 to 19 of the
    pointer were a perfect hash), but sometimes there were millions.

    But when I tried to mix up the input values, then all methods seemed to converge to similar results.

    In that case you might as well use the simplest and fastest method.

    However it really needs testing within the actual application.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Bonita Montero on Sat May 25 10:40:22 2024
    Bonita Montero <Bonita.Montero@gmail.com> writes:

    Am 25.05.2024 um 11:12 schrieb Tim Rentsch:

    Your hash function is expensive to compute, moreso even
    than the "FNV" function shown earlier. In a case like
    this one where the compares are cheap, it's better to
    have a dumb-but-fast hash function that might need a
    few more looks to find an open slot, because the cost
    of looking is so cheap compared to computing the hash
    function.

    A (size_t)pointer * LARGE_PRIME can be sufficient,
    ignoring the overflow.

    Plenty fast but the output quality is poor. I tested
    this scheme against four other hash functions, and in
    four out of five workloads it was always dead last, by
    a noticeable margin.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to bart on Sat May 25 11:12:43 2024
    bart <bc@freeuk.com> writes:

    On 25/05/2024 10:12, Tim Rentsch wrote:

    bart <bc@freeuk.com> writes:
    [...]
    It looks like your hash function was tuned for this testing
    setup. With different choices for testing it does much
    worse.

    It wasn't tuned at all; it was a copy of what I've longed used
    within my lexers.

    It is designed to give a good spread when inputs are similar or
    change sequentially. (Well, not exactly designed; more trial
    and error.)

    The trial and error process you mention had the effect of tuning
    your function for the testing setup used, whether you meant it to
    or not.


    The testing is done with malloc() blocks all of the same
    requested size, and that size is 1. Typical workloads
    are likely to be both larger and more variable.

    When adding a new entry finds a collision with an old
    entry, linear probing is used to find a free slot. It's
    well understood that linear probing suffers badly as the
    load factor goes up. Better to take a few high bits of
    the hash value -- as few as five or six is fine -- to
    have the reprobe sequences have different strides.

    Your hash function is expensive to compute, moreso even
    than the "FNV" function shown earlier.

    You mean Malcolm's version? That also uses a loop. Neither
    need to process all 8 bytes of a 64-bit pointer; 6 will do, and
    probably just 4.
    And such a loop can be trivially unrolled.

    Both your function and the function posted by Malcolm are slow.
    It's just that yours is slower. It isn't hard to produce a
    hash function of equal quality that runs more than twice as
    fast.

    In a case like
    this one where the compares are cheap, it's better to
    have a dumb-but-fast hash function that might need a
    few more looks to find an open slot, because the cost
    of looking is so cheap compared to computing the hash
    function.

    My first thought in this thread was to just make use of the
    pointer value directly. My version was written only to compare
    against Malcolm's FNV code.

    If I try the other suggestions, RH's and KK's which simply
    shift the pointer value (similar to my idea), the results I got
    were interesting: sometimes there were zero clashes (almost as
    though bits 4 to 19 of the pointer were a perfect hash), but
    sometimes there were millions.

    But when I tried to mix up the input values, then all methods
    seemed to converge to similar results.

    In that case you might as well use the simplest and fastest
    method.

    However it really needs testing within the actual application.

    Let me tell you a little secret. Hash functions are like random
    number generators: good ones never do exceptionally well or
    exceptionally poorly. The reason for this is simple - whenever
    there are scenarios where one does well then inevitably there are
    other scenarios where it does poorly. A really high quality hash
    function will never do badly, no matter what mix of inputs is
    given. The key takeaway here is to try as many different input
    sets as you can, but don't look for candidates that do well;
    rather, throw away any that do poorly on any input mix. The ones
    that are left are all, with high probability, just fine no matter
    what application they are used in. That is not to say there is
    no need to test within the intended application, it's always good
    to do a sanity check, but the point of the sanity check is just
    to make sure the earlier process didn't miss something; it isn't
    the right way to compare alternatives.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Malcolm McLean on Sat May 25 11:23:55 2024
    Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:

    On 25/05/2024 18:40, Tim Rentsch wrote:

    Bonita Montero <Bonita.Montero@gmail.com> writes:

    Am 25.05.2024 um 11:12 schrieb Tim Rentsch:

    Your hash function is expensive to compute, moreso even
    than the "FNV" function shown earlier. In a case like
    this one where the compares are cheap, it's better to
    have a dumb-but-fast hash function that might need a
    few more looks to find an open slot, because the cost
    of looking is so cheap compared to computing the hash
    function.

    A (size_t)pointer * LARGE_PRIME can be sufficient,
    ignoring the overflow.

    Plenty fast but the output quality is poor. I tested
    this scheme against four other hash functions, and in
    four out of five workloads it was always dead last, by
    a noticeable margin.

    The lower bits of a pointer are often all zeroes. And mutlipying
    by any integer will not set them. And that is catastrophic for a
    hash.

    It can be but it doesn't have to be. It depends on how the hash
    value is used to determine a place or places to look in the
    table.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From bart@21:1/5 to Tim Rentsch on Sat May 25 20:31:22 2024
    On 25/05/2024 19:12, Tim Rentsch wrote:
    bart <bc@freeuk.com> writes:

    On 25/05/2024 10:12, Tim Rentsch wrote:

    bart <bc@freeuk.com> writes:
    [...]
    It looks like your hash function was tuned for this testing
    setup. With different choices for testing it does much
    worse.

    It wasn't tuned at all; it was a copy of what I've longed used
    within my lexers.

    It is designed to give a good spread when inputs are similar or
    change sequentially. (Well, not exactly designed; more trial
    and error.)

    The trial and error process you mention had the effect of tuning
    your function for the testing setup used, whether you meant it to
    or not.


    The testing is done with malloc() blocks all of the same
    requested size, and that size is 1. Typical workloads
    are likely to be both larger and more variable.

    When adding a new entry finds a collision with an old
    entry, linear probing is used to find a free slot. It's
    well understood that linear probing suffers badly as the
    load factor goes up. Better to take a few high bits of
    the hash value -- as few as five or six is fine -- to
    have the reprobe sequences have different strides.

    Your hash function is expensive to compute, moreso even
    than the "FNV" function shown earlier.

    You mean Malcolm's version? That also uses a loop. Neither
    need to process all 8 bytes of a 64-bit pointer; 6 will do, and
    probably just 4.
    And such a loop can be trivially unrolled.

    Both your function and the function posted by Malcolm are slow.
    It's just that yours is slower.

    With unoptimised code, mine is up to 50% faster (suggesting an
    inherently faster method).

    Optimised via gcc -O3, it's 17% slower.

    If I duplicate the tests in my language, which has a meagre optimiser,
    then mine is 3 times the speed (I haven't looked at why; something about
    the MM code that keeps locals out of registers I guess):

    Unopt gcc-O3 My lang opt (uses 64 bits)

    MM hash 2.7 0.34 1.9 seconds
    BC hash 1.7 0.4 0.6

    There are anyway any number of tweaks that can be made; no need to do
    all 64 bits for a start, and the loops can be manually unrolled.
    Suggesting it is slower is premature.

    It isn't hard to produce a
    hash function of equal quality that runs more than twice as
    fast.

    With or without the aid of an optimising compiler? You can't always tell
    with the latter whether it's you who's clever, or the compiler.

    Let me tell you a little secret. Hash functions are like random
    number generators: good ones never do exceptionally well or
    exceptionally poorly. The reason for this is simple - whenever
    there are scenarios where one does well then inevitably there are
    other scenarios where it does poorly. > The ones
    that are left are all, with high probability, just fine no matter
    what application they are used in. That is not to say there is
    no need to test within the intended application, it's always good
    to do a sanity check, but the point of the sanity check is just
    to make sure the earlier process didn't miss something; it isn't
    the right way to compare alternatives.

    So what's a good one for use within the identifier lookup of a
    compiler's lexer?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Malcolm McLean on Sat May 25 23:13:31 2024
    On 25.05.2024 19:56, Malcolm McLean wrote:
    On 25/05/2024 18:40, Tim Rentsch wrote:
    Bonita Montero <Bonita.Montero@gmail.com> writes:

    Am 25.05.2024 um 11:12 schrieb Tim Rentsch:

    Your hash function is expensive to compute, moreso even
    than the "FNV" function shown earlier. In a case like
    this one where the compares are cheap, it's better to
    have a dumb-but-fast hash function that might need a
    few more looks to find an open slot, because the cost
    of looking is so cheap compared to computing the hash
    function.

    A (size_t)pointer * LARGE_PRIME can be sufficient,
    ignoring the overflow.

    Plenty fast but the output quality is poor. I tested
    this scheme against four other hash functions, and in
    four out of five workloads it was always dead last, by
    a noticeable margin.


    The lower bits of a pointer are often all zeroes. And mutlipying by any integer will not set them. And that is catastrophic for a hash.

    I haven't read all the thread, but if I'd be going to tackle that
    I'd first try to map the pointer onto a fitting integral type (if
    that's possible; sizeof()), or first "fold" (XOR) the upper/lower
    parts of the pointer (so that zero-sequences, if they appear, also
    won't affect the outcome too badly as a side effect) to reduce the
    size of the data, and finally use any established hash algorithm
    on the (folded/reduced) integral type.

    Shall the created hash codes also be portable across platforms?

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From bart@21:1/5 to Malcolm McLean on Sat May 25 23:42:13 2024
    On 25/05/2024 23:07, Malcolm McLean wrote:
    On 25/05/2024 22:13, Janis Papanagnou wrote:

    I haven't read all the thread, but if I'd be going to tackle that
    I'd first try to map the pointer onto a fitting integral type (if
    that's possible; sizeof()), or first "fold" (XOR) the upper/lower
    parts of the pointer (so that zero-sequences, if they appear, also
    won't affect the outcome too badly as a side effect) to reduce the
    size of the data, and finally use any established hash algorithm
    on the (folded/reduced) integral type.

    Shall the created hash codes also be portable across platforms?

    Yes, the Baby X resource compiler should be portable to any hosted
    platform with a reasonable up to date C compiler.

    But we also need portablity in another sense. The hash table is hashing
    nodes of a XML document model created by the functions in the file xmlparser2.c. These make repeated calls to malloc() for small
    allocations of nodes. But I have no control, over either pointer representation or the platform's implemetation of malloc(). So the
    pointers could be at seemingly rsndom loctions in. memory, or they could
    be evenly spaced at increments of 64 bytes, and thus differing in only a
    few bits.
    OK, parsing is a task I'm familiar with. Parsing a normal language into
    a syntax tree doesn't sound too different from parsing XML into a tree
    of nodes.

    Such a task can be done at up to several million lines per second for
    source code. Are you getting throughput considerably below that?

    It's not clear to me what the hashing is for on the nodes. What exactly
    is the input being looked up, and what is being compared? Because
    something sounds off: you don't normally look up a pointer to try and
    find it in a table; you already have the pointer pointing at the data!

    Then if someone wants to help out with this project, an
    obvious thing to do would be to speed up the xml parser by replacing the calls to malloc() with a custom allocator.

    If that's a bottleneck (you need to establish that; perhaps use a double
    call to malloc, and see if it slows down significantly), then the custom allocator is simple: just allocate from a pool. It's simpler still if
    you don't need to free that memory before the program ends

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to bart on Sat May 25 22:54:59 2024
    bart <bc@freeuk.com> writes:

    On 25/05/2024 19:12, Tim Rentsch wrote:

    bart <bc@freeuk.com> writes:

    On 25/05/2024 10:12, Tim Rentsch wrote:

    bart <bc@freeuk.com> writes:
    [...]
    It looks like your hash function was tuned for this testing
    setup. With different choices for testing it does much
    worse.

    It wasn't tuned at all; it was a copy of what I've longed used
    within my lexers.

    It is designed to give a good spread when inputs are similar or
    change sequentially. (Well, not exactly designed; more trial
    and error.)

    The trial and error process you mention had the effect of tuning
    your function for the testing setup used, whether you meant it to
    or not.


    The testing is done with malloc() blocks all of the same
    requested size, and that size is 1. Typical workloads
    are likely to be both larger and more variable.

    When adding a new entry finds a collision with an old
    entry, linear probing is used to find a free slot. It's
    well understood that linear probing suffers badly as the
    load factor goes up. Better to take a few high bits of
    the hash value -- as few as five or six is fine -- to
    have the reprobe sequences have different strides.

    Your hash function is expensive to compute, moreso even
    than the "FNV" function shown earlier.

    You mean Malcolm's version? That also uses a loop. Neither
    need to process all 8 bytes of a 64-bit pointer; 6 will do, and
    probably just 4.
    And such a loop can be trivially unrolled.

    Both your function and the function posted by Malcolm are slow.
    It's just that yours is slower.

    With unoptimised code, mine is up to 50% faster (suggesting an
    inherently faster method).

    Optimised via gcc -O3, it's 17% slower.

    On my test platform the Malcolm code runs between dead even up
    to about 44% faster, depending on optimization level, except
    for -O0 where it runs about 12% slower. Note that these results
    represent only one measurement of each case, run on only one
    target environment.

    If I duplicate the tests in my language, which has a meagre optimiser,
    then mine is 3 times the speed (I haven't looked at why; something
    about the MM code that keeps locals out of registers I guess):

    Unopt gcc-O3 My lang opt (uses 64 bits)

    MM hash 2.7 0.34 1.9 seconds
    BC hash 1.7 0.4 0.6

    There are anyway any number of tweaks that can be made; no need to do
    all 64 bits for a start, and the loops can be manually
    unrolled. Suggesting it is slower is premature.

    I am simply reporting results of empirical observations made on
    my test system. Obviously different environments might produce
    different results.

    It isn't hard to produce a
    hash function of equal quality that runs more than twice as
    fast.

    With or without the aid of an optimising compiler? You can't always
    tell with the latter whether it's you who's clever, or the compiler.

    As long as both functions are run at the same level of optimization
    I don't think it matters much. For the sake of concreteness we can
    stipulate code quality comparable to gcc -O1 if that helps you.

    Let me tell you a little secret. Hash functions are like random
    number generators: good ones never do exceptionally well or
    exceptionally poorly. The reason for this is simple - whenever
    there are scenarios where one does well then inevitably there are
    other scenarios where it does poorly. > The ones
    that are left are all, with high probability, just fine no matter
    what application they are used in. That is not to say there is
    no need to test within the intended application, it's always good
    to do a sanity check, but the point of the sanity check is just
    to make sure the earlier process didn't miss something; it isn't
    the right way to compare alternatives.

    So what's a good one for use within the identifier lookup of a
    compiler's lexer?

    That depends a lot on what sort of lookup is done and on the
    degree to which the hash computation is integrated into the
    lexing code. How many entries will the hash table have?
    That also makes a difference. If the hash table is never
    more than 50% full then we can expect most lookups will
    succeed on the first try and almost all will succeed in
    no more than two tries.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Bonita Montero on Sun May 26 09:24:58 2024
    Bonita Montero <Bonita.Montero@gmail.com> writes:

    Am 25.05.2024 um 19:40 schrieb Tim Rentsch:

    Bonita Montero <Bonita.Montero@gmail.com> writes:

    Am 25.05.2024 um 11:12 schrieb Tim Rentsch:

    Your hash function is expensive to compute, moreso even
    than the "FNV" function shown earlier. In a case like
    this one where the compares are cheap, it's better to
    have a dumb-but-fast hash function that might need a
    few more looks to find an open slot, because the cost
    of looking is so cheap compared to computing the hash
    function.

    A (size_t)pointer * LARGE_PRIME can be sufficient,
    ignoring the overflow.

    Plenty fast but the output quality is poor. ...

    If the prime is large enough there'e no regular distribution.

    Your conjecture is contradicted by empirical facts.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Bonita Montero on Sun May 26 10:20:55 2024
    Bonita Montero <Bonita.Montero@gmail.com> writes:

    Am 26.05.2024 um 18:24 schrieb Tim Rentsch:

    Bonita Montero <Bonita.Montero@gmail.com> writes:

    Am 25.05.2024 um 19:40 schrieb Tim Rentsch:

    Bonita Montero <Bonita.Montero@gmail.com> writes:

    Am 25.05.2024 um 11:12 schrieb Tim Rentsch:

    Your hash function is expensive to compute, moreso even
    than the "FNV" function shown earlier. In a case like
    this one where the compares are cheap, it's better to
    have a dumb-but-fast hash function that might need a
    few more looks to find an open slot, because the cost
    of looking is so cheap compared to computing the hash
    function.

    A (size_t)pointer * LARGE_PRIME can be sufficient,
    ignoring the overflow.

    Plenty fast but the output quality is poor. ...

    If the prime is large enough there'e no regular distribution.

    Your conjecture is contradicted by empirical facts.

    There are no empirival facts for that since two times the
    taken prime is beyond the 64 bit address space.

    You don't listen very well do you?

    I say the output quality is poor because I have run tests that
    show the poor output quality. I've done that with a prime of my
    own choosing and also with 18446744073709551557, the value you
    suggested. In both cases the test runs show results that are
    clearly worse than all the other hash functions tested, including
    bart's and malcolm's. Furthermore the results for your suggested
    calculation are worse across the board, on every variety of
    dynamic workload in my test suite.

    Your proposed hash function is too weak to be taken as a serious
    candidate.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Richard Harnden@21:1/5 to bart on Sun May 26 19:58:31 2024
    On 25/05/2024 23:42, bart wrote:

    It's not clear to me what the hashing is for on the nodes. What exactly
    is the input being looked up, and what is being compared? Because
    something sounds off: you don't normally look up a pointer to try and
    find it in a table; you already have the pointer pointing at the data!

    This.

    Are you really hashing the pointer?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kaz Kylheku@21:1/5 to bart on Sun May 26 22:42:08 2024
    On 2024-05-25, bart <bc@freeuk.com> wrote:
    It's not clear to me what the hashing is for on the nodes. What exactly
    is the input being looked up, and what is being compared? Because
    something sounds off: you don't normally look up a pointer to try and
    find it in a table; you already have the pointer pointing at the data!

    This happens when you need to associate objects with external
    information which is not in those objects.

    One such situation is when you replace use of strings as hash keys with interned strings (atoms). Atoms are typically pointers and so hash that way.

    This is a smart thing to do because atoms are faster to work with.
    Comparing whether two values are the same atom is just a pointer
    comparison, and hashing the atom is just a pointer hash: much faster
    than a string hash, and with more easily assured good behavior.

    The interning mechanism maintains its own hash table which maps the
    strings to the atoms. By that mechanism, two identical strings are
    mapped to the same atom. That table is only used at the boundaries of
    the system when external data is read. The data contains identifiers
    which are interned, and then they are manipulated as atoms.

    Lisp languages have a symbol type, which is a kind of interned
    string. Symbols are regularly used as keys into dictionary structures
    like hash tables or assocation lists.

    Certain important properties are put into the symbol itself, so
    that no hashing is required to find them.

    In classical Lisps, the global value of a variable named by a symbol
    is stored in the symbol itself. The property list of a symbol is also
    stored in the symbol.

    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @Kazinator@mstdn.ca

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ben Bacarisse@21:1/5 to Bonita Montero on Tue May 28 11:07:53 2024
    Bonita Montero <Bonita.Montero@gmail.com> writes:

    This shows the perfect equal distribution if I take a 32 bit integer
    and multiply it by the largest 32 bit prime:

    There was no need to write this program. It just illustrates a specific example of a basic theorem where the multiplier and modulus are
    co-prime.

    #include <iostream>
    #include <vector>
    #include <random>

    using namespace std;

    int main()
    {
    vector<uint8_t> buckets( 1ull << 32 );
    for( size_t b = buckets.size(); b; )
    if( !++buckets[(uint32_t)--b * 0xFFFFFFFBu % buckets.size()] )
    return EXIT_FAILURE;
    vector<size_t> loads;
    for( uint8_t ld : buckets )
    {
    if( loads.size() <= ld )
    loads.resize( ld + 1 );
    ++loads[ld];
    }
    for( size_t l = 0; size_t ld : loads )
    cout << l++ << ": " << 100.0 * (ptrdiff_t)ld /
    (ptrdiff_t)buckets.size() << "%" << endl;
    }


    Result:

    0: 0%
    1: 100%


    --
    Ben.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Bonita Montero on Thu May 30 19:26:55 2024
    Bonita Montero <Bonita.Montero@gmail.com> writes:

    Am 28.05.2024 um 12:07 schrieb Ben Bacarisse:

    Bonita Montero <Bonita.Montero@gmail.com> writes:

    This shows the perfect equal distribution if I take a 32 bit integer
    and multiply it by the largest 32 bit prime:

    There was no need to write this program. It just illustrates a specific
    example of a basic theorem where the multiplier and modulus are
    co-prime.

    For Tim it was necessary.

    No it was not. I already understood this before my posting
    of Sunday, May 26.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Bonita Montero on Thu May 30 19:27:48 2024
    Bonita Montero <Bonita.Montero@gmail.com> writes:

    Am 26.05.2024 um 19:20 schrieb Tim Rentsch:

    I say the output quality is poor because I have run tests that
    show the poor output quality.

    If you chose a prime whose double is beyond 64 bit there's an
    equal distribution among the 64 bit modulos.

    I've done that with a prime of my own choosing and also with
    18446744073709551557, the value you suggested.

    Show me your code.

    Oh get real. It's not my job to make up for your
    laziness.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Tim Rentsch on Sun Jun 2 10:45:06 2024
    On Thu, 30 May 2024 19:27:48 -0700
    Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

    Bonita Montero <Bonita.Montero@gmail.com> writes:

    Am 26.05.2024 um 19:20 schrieb Tim Rentsch:

    I say the output quality is poor because I have run tests that
    show the poor output quality.

    If you chose a prime whose double is beyond 64 bit there's an
    equal distribution among the 64 bit modulos.

    I've done that with a prime of my own choosing and also with
    18446744073709551557, the value you suggested.

    Show me your code.

    Oh get real. It's not my job to make up for your
    laziness.

    So, what were your conclusions?
    Ignoring the speed of computation, would something like cryptographic
    hash scaled to bucket size be a best hash for this type of application?
    Or some sort of less thorough grinding of the bits is better?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Michael S on Sun Jun 2 16:02:05 2024
    Michael S <already5chosen@yahoo.com> writes:

    On Thu, 30 May 2024 19:27:48 -0700
    Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

    Bonita Montero <Bonita.Montero@gmail.com> writes:

    Am 26.05.2024 um 19:20 schrieb Tim Rentsch:

    I say the output quality is poor because I have run tests that
    show the poor output quality.

    If you chose a prime whose double is beyond 64 bit there's an
    equal distribution among the 64 bit modulos.

    I've done that with a prime of my own choosing and also with
    18446744073709551557, the value you suggested.

    Show me your code.

    Oh get real. It's not my job to make up for your
    laziness.

    So, what were your conclusions?
    Ignoring the speed of computation, would something like
    cryptographic hash scaled to bucket size be a best hash for
    this type of application? Or some sort of less thorough
    grinding of the bits is better?

    Factors that matter:
    will hashing be done using rehashing or aggregates (lists
    or trees, etc) grouped under the first probe location?
    (or both?)

    can keys be compared for ordering or just for equality?

    how expensive is it to test for key equality?
    (a) when the keys are equal
    (b) when the keys are not equal

    32-bit hash values suffice for (a) re-hashing schemes with
    tables up to perhaps 2**30 entries, or (b) schemes that
    use direct hashing and bucket aggregation with tables up
    to 2**32 slots. Anything bigger should use 64-bit hash
    values.

    hash functions should always produce a full-size hash value
    (either 32 bits or 64 bits). Fitting the hash value to
    the size of the table is the responsibility of the hash
    table management code, not the hash function.

    The goal:
    An ideal hash function has the property that changing any
    single bit of the input key has a 50% chance of flipping
    any output bit, with no correlation between output bit
    m and output bit n (m != n) for changing a given input
    bit, and no correlation between changes to output bit n
    for changing input bit i and changing input bit j (i != j)

    Evaluating hash functions:
    One way to evaluate a hash function is to measure changes
    in output bits as a function of changing various inputs
    bits, either exhaustively or statistically, looking for an
    ideal distribution as described above. Usually doing this
    isn't very helpful, because most hash functions don't come
    close to the ideal, and it's hard to interpret results that
    are not close to the ideal, even though those hashes may be
    perfectly fine in practice

    So basically we are left with comparing a candidate hash
    function, performance-wise, to other hash functions that
    are going to have good performance results. Three obvious
    choices for those: a slow but high quality hash such as a
    cryptographic hash; using a good random number generator
    to produce keys where the hash value is equal to the key
    value; comparing against theoretical ideals. Looking at
    each in turn:

    High-quality-but-slow-hash: it isn't hard to make a hash
    function that approximates an ideal hash, as long as we
    are willing to have it run like a turtle. Or simply use
    one of the many known cryptographic hashes and fold the
    output down to a reasonably size hash values (most of
    these cryptographic hash functions produce large hash
    values, 128 bits and up, which isn't a great fit for
    ordinary hash table use). In my recent testing I simply
    rolled my own high quality hash, which ran about 10 times
    slower than the slowest of the other hashes I looked at.

    Using a good RNG to produce keys the same as the hash
    value: I expect this method to be self-explanatory.

    Comparing against theoretical ideals: it isn't hard to
    do a calculation that computes a statistical average for
    the two kinds of hashing methods. More specifically,

    (a) for re-hashing, compute the average number of
    probes taken for a hash table with k out of n
    slots already filled, and use that to compute
    the average number of probes need to fill the
    hash table up to, say, 85%

    (b) for direct hashing, compute the average number
    of slots occupied after inserting some number
    of values representing a fixed fraction of the
    table size. For example, if we have a table
    with 100,000 slots, if we insert 100,000 values
    then on the average about 63% of the slots
    should be occupied (very close to 1 - 1/e).

    Use one or more of the above comparison values (they
    should all be very close) to compare against the
    performance of the candidate hash function for each
    of the possible workloads. My workload set included:

    (a) results of malloc() with fixed sized blocks,
    with a size going from 1 to some modest value
    (50 or 100)

    (b) results of malloc() with randomly chosen sizes
    based on some distribution

    (c) sequential addresses in an array, varying the
    element size between 1 and some modest value
    (maybe 50)

    (d) results of malloc() with a deterministic set
    of sizes. I tried varying ones of these, such
    as size(i) == i/17 + 1 (filling a table of
    65636 entries varying percentages full).

    What we would like to see:
    for every workload tested, results with 10 or 20%
    of the comparison metric used

    What we don't want to see (this is important):
    any result far away from the comparison metric on any
    workload. Importantly this includes results that are
    significantly _better_ than the comparison metric.
    The reason for this is that doing a lot better on one
    test inevitably means doing worse on a different test.
    The same is true of random number generators: the
    best ones look "averagely random" all the time, never
    either "sub random" or "super random".

    Of course we always want high quality, but the standards
    are higher for (a) direct hashing, and/or (b) cases where
    comparing keys is expensive. Note that if comparing keys
    is expensive we might choose to store the full hash value
    with each table entry, to make "trivial reject" on key
    comparison be faster.

    Probably much of the above was obvious. I make no apologies for
    that. Also it may seem disjointed or unorganized. If so then
    sorry about that chief, it's a big topic and there's a lot of
    ground to cover, and I tried to hit the most important
    highlights.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Tim Rentsch on Mon Jun 3 10:50:05 2024
    On Sun, 02 Jun 2024 16:02:05 -0700
    Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

    Michael S <already5chosen@yahoo.com> writes:

    On Thu, 30 May 2024 19:27:48 -0700
    Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

    Bonita Montero <Bonita.Montero@gmail.com> writes:

    Am 26.05.2024 um 19:20 schrieb Tim Rentsch:

    I say the output quality is poor because I have run tests that
    show the poor output quality.

    If you chose a prime whose double is beyond 64 bit there's an
    equal distribution among the 64 bit modulos.

    I've done that with a prime of my own choosing and also with
    18446744073709551557, the value you suggested.

    Show me your code.

    Oh get real. It's not my job to make up for your
    laziness.

    So, what were your conclusions?
    Ignoring the speed of computation, would something like
    cryptographic hash scaled to bucket size be a best hash for
    this type of application? Or some sort of less thorough
    grinding of the bits is better?

    Factors that matter:

    <snip>


    Probably much of the above was obvious. I make no apologies for
    that. Also it may seem disjointed or unorganized. If so then
    sorry about that chief, it's a big topic and there's a lot of
    ground to cover, and I tried to hit the most important
    highlights.

    It sounds like you *postulate* that crypto is an ideal.
    I am less in axioms and more interested in your experimental findings.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Bonita Montero on Mon Jun 3 17:46:04 2024
    On Mon, 3 Jun 2024 16:34:37 +0200
    Bonita Montero <Bonita.Montero@gmail.com> wrote:

    Am 02.06.2024 um 09:45 schrieb Michael S:

    So, what were your conclusions?
    Ignoring the speed of computation, would something like
    cryptographic hash scaled to bucket size be a best hash for this
    type of application? Or some sort of less thorough grinding of the
    bits is better?

    There's no need for a crypto-hash here.


    Do you think I don't know?
    Crypto hash is just an example of near-ideal pseudo-random uniformity.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From bart@21:1/5 to Bonita Montero on Mon Jun 3 17:24:36 2024
    On 03/06/2024 16:54, Bonita Montero wrote:
    Am 03.06.2024 um 16:46 schrieb Michael S:
    On Mon, 3 Jun 2024 16:34:37 +0200
    Bonita Montero <Bonita.Montero@gmail.com> wrote:

    Am 02.06.2024 um 09:45 schrieb Michael S:

    So, what were your conclusions?
    Ignoring the speed of computation, would something like
    cryptographic hash scaled to bucket size be a best hash for this
    type of application? Or some sort of less thorough grinding of the
    bits is better?

    There's no need for a crypto-hash here.


    Do you think I don't know?
    Crypto hash is just an example of near-ideal pseudo-random uniformity.


    As I've shown for pointers you get a perfect equal distribution with multiplying by an appropriate prime.


    A pointer with 8-byte or 16-byte alignment will have the bottom 3-4 bits
    zero.

    No matter what number you multiply them by, prime or otherwise, those
    3-4 bits will always be zero.

    If you mask the result to fit a table of size power-of-two, then the
    resulting index will can only ever refer to every 8th or every 16th
    slot; there will 8-16x as many clashes as there ought to be.

    So some extra work is needed to get around that, for example
    right-shifting before masking as some here have done, something you have
    never mentioned.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Bonita Montero on Mon Jun 3 19:50:22 2024
    On Mon, 3 Jun 2024 17:54:30 +0200
    Bonita Montero <Bonita.Montero@gmail.com> wrote:

    Am 03.06.2024 um 16:46 schrieb Michael S:
    On Mon, 3 Jun 2024 16:34:37 +0200
    Bonita Montero <Bonita.Montero@gmail.com> wrote:

    Am 02.06.2024 um 09:45 schrieb Michael S:

    So, what were your conclusions?
    Ignoring the speed of computation, would something like
    cryptographic hash scaled to bucket size be a best hash for this
    type of application? Or some sort of less thorough grinding of the
    bits is better?

    There's no need for a crypto-hash here.


    Do you think I don't know?
    Crypto hash is just an example of near-ideal pseudo-random
    uniformity.

    As I've shown for pointers you get a perfect equal distribution with multiplying by an appropriate prime.


    What you had shown is, if we borrow terminology from characterization
    of analog-to-digital converters, integral uniformity for ramp input.
    That is not sufficient. Good hash need both integral and differential uniformity.
    According to Tim Rentsch, your method does not have the later.

    What do you do exactly? Multiply by big prime modulo 2**64?
    By intuition, without testing, I'd say that it does not sound as enough.
    By intuition, without testing, doing it twice with two different primes,
    then adding (or xoring) results together and doing multiplication again
    with third prime sounds like enough.
    By intuition, without testing, I'd rather use composite multiplication
    factors instead of primes, choosing each factor as product of 3-4
    non-small primes with all 9-12 primes involved being different.

    But there should be simpler procedures that are just as good.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to bart on Mon Jun 3 20:16:46 2024
    On Mon, 3 Jun 2024 17:24:36 +0100
    bart <bc@freeuk.com> wrote:

    On 03/06/2024 16:54, Bonita Montero wrote:
    Am 03.06.2024 um 16:46 schrieb Michael S:
    On Mon, 3 Jun 2024 16:34:37 +0200
    Bonita Montero <Bonita.Montero@gmail.com> wrote:

    Am 02.06.2024 um 09:45 schrieb Michael S:

    So, what were your conclusions?
    Ignoring the speed of computation, would something like
    cryptographic hash scaled to bucket size be a best hash for this
    type of application? Or some sort of less thorough grinding of
    the bits is better?

    There's no need for a crypto-hash here.


    Do you think I don't know?
    Crypto hash is just an example of near-ideal pseudo-random
    uniformity.

    As I've shown for pointers you get a perfect equal distribution with multiplying by an appropriate prime.


    A pointer with 8-byte or 16-byte alignment will have the bottom 3-4
    bits zero.

    No matter what number you multiply them by, prime or otherwise, those
    3-4 bits will always be zero.

    If you mask the result to fit a table of size power-of-two, then the resulting index will can only ever refer to every 8th or every 16th
    slot; there will 8-16x as many clashes as there ought to be.

    So some extra work is needed to get around that, for example
    right-shifting before masking as some here have done, something you
    have never mentioned.



    According to my understanding, Bonita and Tim are discussing hash
    generator which output is not used as is. They assume that index of
    the slot will be calculated as (Hash(key)*bucket_size)/(Hash_max+1).
    For big enough Hash_max (Bonita suggests 2**63-1), poor quality of
    few LS bits of Hash(key) does not matter.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From bart@21:1/5 to Michael S on Mon Jun 3 19:48:29 2024
    On 03/06/2024 18:16, Michael S wrote:
    On Mon, 3 Jun 2024 17:24:36 +0100
    bart <bc@freeuk.com> wrote:

    On 03/06/2024 16:54, Bonita Montero wrote:
    Am 03.06.2024 um 16:46 schrieb Michael S:
    On Mon, 3 Jun 2024 16:34:37 +0200
    Bonita Montero <Bonita.Montero@gmail.com> wrote:

    Am 02.06.2024 um 09:45 schrieb Michael S:

    So, what were your conclusions?
    Ignoring the speed of computation, would something like
    cryptographic hash scaled to bucket size be a best hash for this
    type of application? Or some sort of less thorough grinding of
    the bits is better?

    There's no need for a crypto-hash here.


    Do you think I don't know?
    Crypto hash is just an example of near-ideal pseudo-random
    uniformity.

    As I've shown for pointers you get a perfect equal distribution with
    multiplying by an appropriate prime.


    A pointer with 8-byte or 16-byte alignment will have the bottom 3-4
    bits zero.

    No matter what number you multiply them by, prime or otherwise, those
    3-4 bits will always be zero.

    If you mask the result to fit a table of size power-of-two, then the
    resulting index will can only ever refer to every 8th or every 16th
    slot; there will 8-16x as many clashes as there ought to be.

    So some extra work is needed to get around that, for example
    right-shifting before masking as some here have done, something you
    have never mentioned.



    According to my understanding, Bonita and Tim are discussing hash
    generator which output is not used as is.

    For big enough Hash_max (Bonita suggests 2**63-1), poor quality of
    few LS bits of Hash(key) does not matter.

    Tim might, I'm not sure sure about BM.

    This is an exchange between Malcolm and BM:

    MM:
    The lower bits of a pointer are often all zeroes. And mutlipying by
    any integer will not set them. And that is catastrophic for a hash.


    BM:
    If you take a large integer they will be scrambled and if the
    prime is large enough there will be a good equal distribution.

    I've reiterated MM's point, which BM just ignored.

    They assume that index of
    the slot will be calculated as (Hash(key)*bucket_size)/(Hash_max+1).

    I don't remember seeing that anywhere. But, bucket_size is the size of
    the hash-table?

    And Hash_max+1 is likely to be a power of two?

    I prefer my hash values to be evaluated independently of table size.
    They usually don't have a meaningful maximum value, other than the 64
    bit of their type, and I'd rather they didn't have sequences of zero
    bits especially at the bottom end.

    If hash_max was 2**63-1, say, then dividing by hash_max+1 would probably
    give you zero anyway, especially if you first multiplied by a bucket
    size that was a power of two, as all the interesting bits would
    disappear past the top bit!

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to bart on Mon Jun 3 22:51:27 2024
    On Mon, 3 Jun 2024 19:48:29 +0100
    bart <bc@freeuk.com> wrote:


    Tim might, I'm not sure sure about BM.


    You are probably right about Bonita. His/her last post(s) make no sense.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to bart on Mon Jun 3 22:41:00 2024
    On Mon, 3 Jun 2024 19:48:29 +0100
    bart <bc@freeuk.com> wrote:


    I don't remember seeing that anywhere. But, bucket_size is the size
    of the hash-table?


    Yes

    And Hash_max+1 is likely to be a power of two?


    Yes. At very least 2**32.

    I prefer my hash values to be evaluated independently of table size.
    They usually don't have a meaningful maximum value, other than the 64
    bit of their type, and I'd rather they didn't have sequences of zero
    bits especially at the bottom end.


    Bottom is in the eye of beholder :-)

    If hash_max was 2**63-1,

    63 was my mistake. He suggests 2**64-1. It seems.

    say, then dividing by hash_max+1 would
    probably give you zero anyway, especially if you first multiplied by
    a bucket size that was a power of two, as all the interesting bits
    would disappear past the top bit!

    Huh?
    If bucket size is 2**n then
    index = Hash(key)*2**n/2**64 == Hash(key) >> (64-n);

    May be, you are thinking about C language arithmetic? That not what I
    had in mind. In post above I used equations in their integer math
    meaning rather than integer C arithmetic meaning.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to bart on Mon Jun 3 16:51:44 2024
    bart <bc@freeuk.com> writes:

    On 03/06/2024 18:16, Michael S wrote:
    [...]
    For big enough Hash_max (Bonita suggests 2**63-1), poor quality
    of few LS bits of Hash(key) does not matter.

    Tim might, [...]

    I really don't know where you could have gotten that idea.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Michael S on Mon Jun 3 17:01:26 2024
    Michael S <already5chosen@yahoo.com> writes:

    On Mon, 3 Jun 2024 17:24:36 +0100
    bart <bc@freeuk.com> wrote:

    On 03/06/2024 16:54, Bonita Montero wrote:

    Am 03.06.2024 um 16:46 schrieb Michael S:

    On Mon, 3 Jun 2024 16:34:37 +0200
    Bonita Montero <Bonita.Montero@gmail.com> wrote:

    Am 02.06.2024 um 09:45 schrieb Michael S:

    So, what were your conclusions?
    Ignoring the speed of computation, would something like
    cryptographic hash scaled to bucket size be a best hash for this
    type of application? Or some sort of less thorough grinding of
    the bits is better?

    There's no need for a crypto-hash here.

    Do you think I don't know?
    Crypto hash is just an example of near-ideal pseudo-random
    uniformity.

    As I've shown for pointers you get a perfect equal distribution with
    multiplying by an appropriate prime.

    A pointer with 8-byte or 16-byte alignment will have the bottom 3-4
    bits zero.

    No matter what number you multiply them by, prime or otherwise, those
    3-4 bits will always be zero.

    If you mask the result to fit a table of size power-of-two, then the
    resulting index will can only ever refer to every 8th or every 16th
    slot; there will 8-16x as many clashes as there ought to be.

    So some extra work is needed to get around that, for example
    right-shifting before masking as some here have done, something you
    have never mentioned.

    According to my understanding, Bonita and Tim are discussing hash
    generator which output is not used as is. They assume that index of
    the slot will be calculated as (Hash(key)*bucket_size)/(Hash_max+1).

    To clarify, I have been talking about hash functions that deliver
    a "full sized" hash value, either 32 or 64 bits, without regard
    to how hash table management code might make use of that value.
    I think that's sort of what you said, but it seemed reasonable
    to offer a clarification.

    As a point of information, I've decided to adopt a policy of
    not reading any further postings from Bonita Montero, who
    seems to have nothing useful or interesting to say.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Michael S on Mon Jun 3 18:02:21 2024
    Michael S <already5chosen@yahoo.com> writes:

    On Sun, 02 Jun 2024 16:02:05 -0700
    Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

    Michael S <already5chosen@yahoo.com> writes:

    On Thu, 30 May 2024 19:27:48 -0700
    Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

    [... concerning quality of hash functions ...]

    So, what were your conclusions?
    Ignoring the speed of computation, would something like
    cryptographic hash scaled to bucket size be a best hash for
    this type of application? Or some sort of less thorough
    grinding of the bits is better?

    Factors that matter:

    <snip>

    Probably much of the above was obvious. I make no apologies for
    that. Also it may seem disjointed or unorganized. If so then
    sorry about that chief, it's a big topic and there's a lot of
    ground to cover, and I tried to hit the most important
    highlights.

    It sounds like you *postulate* that crypto is an ideal.

    I think it's more accurate to say that I _expect_ an established
    crypto-quality hash function such as MD5 to be good approximation
    to an ideal hash function. I expect that because (a) the design
    criteria for such hash functions resembles or includes the design
    criteria for a near-ideal hash function, (b) the people who have
    done these are smart people with (usually) a fair amount of
    experience doing such things, and (c) the established ones have
    gone through a lot of use and testing by the user community, who
    likely would have found any serious weaknesses.

    I am less in axioms and more interested in your experimental
    findings.

    I'm not sure what you're looking for here. The testing I did for
    the various recently posted hash functions was simply running a
    few ad hoc test scenarios and comparing the results against each
    other and also against the results of hash function of my own
    devising likely to produce high quality results. One piece of
    empirical evidence: in all of the tests I ran, that hash function
    never gave results that varied by more than 10% or so, not matter
    what test was done. By contrast, most or all of the other hash
    functions sometimes gave results that varied by a factor of 4 or
    more, depending on what scenario was being tested. Let me say
    again that the test scenarios were ad hoc, and I was not being
    systematic but just eyeballing the numbers.

    I have a fair amount of experience (at least hundreds of hours)
    experimenting with various random number generators, including
    several of my own devising. The experience includes systematic
    testing using extensive test suites (done by people who really
    know what they are doing, which in this arena does not include
    myself). Random number generators and hash functions have a lot
    of principles in common. Suppose we have a hash function that
    takes a 64-bit key and produces a 32-bit hash value. If we use
    that hash function with simply successive 64-bit integer values,
    and the outputs pass the statistical tests of a good random
    number generator test suite, you can be sure that hash function
    has very high quality. A key observation about random number
    generators is this: if the output values are N bits, generally
    the amount of internal state should be at least 2N bits. The
    same thing is true of hash functions. If a hash function
    produces, say, 32-bit values, internally there should be at least
    64 bits of state that is mushed together in various ways, finally
    being combined in the final step to produce the 32-bit result.
    Another design principle is to replicate information. For
    example, if we are hashing a character string to produce a 32-bit
    hash value, we might combine each character in the string with
    two different 32-bit "accumulators", perhaps adding into one and
    xoring into the other, and rotating them by different amounts
    after each character scanned. Composing code on the fly (not
    compiled, and certainly not tested):

    unsigned
    string_hash( const char *s ){
    unsigned a = 12345, b = 0xabcdef;
    char c;
    while( c = *s++ ){
    a += c, b ^= c;
    a = a << 23 | a >> 32-23;
    b = b << 10 | b >> 32-10;
    a -= b; // why not? ;)
    }
    return a ^ b;
    }

    This code has a fair chance of producing acceptable hash values
    for strings that have at least three characters. (Let me say
    again that I have done no testing at all.)

    So I hope this posting has some of what you're looking for. And
    if it doesn't, well, then I am as much in the dark as ever. :)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Tim Rentsch on Tue Jun 4 11:38:39 2024
    On Mon, 03 Jun 2024 18:02:21 -0700
    Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

    Michael S <already5chosen@yahoo.com> writes:


    I am less in axioms and more interested in your experimental
    findings.

    I'm not sure what you're looking for here.

    I'd give an example.
    You said that some of the variants had 4x differences between cases.
    From my perspective, if you found a hash function that performs up to 3
    times better* than "crypto-alike" hash in majority of tests and is 1.33x
    worse that "crypto-alike" in few other tests, it's something that I'd
    consider as valuable option.

    * - i.e. produces 3x less collisions at, say, occupation ratio of 0.7

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Tim Rentsch on Wed Jun 5 00:59:16 2024
    On Sun, 26 May 2024 10:20:55 -0700
    Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

    Bonita Montero <Bonita.Montero@gmail.com> writes:

    Am 26.05.2024 um 18:24 schrieb Tim Rentsch:

    Bonita Montero <Bonita.Montero@gmail.com> writes:

    Am 25.05.2024 um 19:40 schrieb Tim Rentsch:

    Bonita Montero <Bonita.Montero@gmail.com> writes:

    Am 25.05.2024 um 11:12 schrieb Tim Rentsch:

    Your hash function is expensive to compute, moreso even
    than the "FNV" function shown earlier. In a case like
    this one where the compares are cheap, it's better to
    have a dumb-but-fast hash function that might need a
    few more looks to find an open slot, because the cost
    of looking is so cheap compared to computing the hash
    function.

    A (size_t)pointer * LARGE_PRIME can be sufficient,
    ignoring the overflow.

    Plenty fast but the output quality is poor. ...

    If the prime is large enough there'e no regular distribution.

    Your conjecture is contradicted by empirical facts.

    There are no empirival facts for that since two times the
    taken prime is beyond the 64 bit address space.

    You don't listen very well do you?

    I say the output quality is poor because I have run tests that
    show the poor output quality. I've done that with a prime of my
    own choosing and also with 18446744073709551557, the value you
    suggested. In both cases the test runs show results that are
    clearly worse than all the other hash functions tested, including
    bart's and malcolm's. Furthermore the results for your suggested
    calculation are worse across the board, on every variety of
    dynamic workload in my test suite.

    Your proposed hash function is too weak to be taken as a serious
    candidate.

    18446744073709551557 is indeed very bad (too close to 2**64).
    But I tried other prime and at least in my simple tests it performs
    rather well.
    May be my tests are to simplistic, I don't pretend to understand much
    about it.

    // hash_aes4.c
    // Reference. Hopefully behaves similarly to crypto hash
    #include <stdint.h>
    #include <string.h>
    #include <x86intrin.h>

    uint64_t hash_ref(void* key)
    {
    uint64_t ukey = (uint64_t)key;
    __m128i xkey = _mm_set_epi64x(ukey, 42);
    static const uint64_t bar[8] = {
    0xBB09BBCC90B24BF2, 0x825C622FF2792A01,
    0x94F0535CB06D4060, 0x939C756246DBFD1D,
    0x5B835E01A7E14CA1, 0xAC2BDAFC023CDD06,
    0xE0B6A4735B774AEC, 0x9CAFB43E7DDE494C,
    };
    for (int i = 0; i < 4; ++i) {
    __m128i rkey;
    memcpy(&rkey, &bar[i*2], sizeof(rkey));
    xkey = _mm_aesenc_si128(xkey, rkey);
    }
    memcpy(&ukey, &xkey, sizeof(ukey));
    return ukey;
    }

    // hash_bonita.c
    // Bonita's with different prime
    #include <stdint.h>

    uint64_t hash_foo(void* key)
    {
    uint64_t ukey = (uint64_t)key;
    static const uint64_t LARGE_PRIME = 0xbb09bbcc90b24bcdull;
    return ukey * LARGE_PRIME;
    }


    // ptr_hash.c
    // test bench
    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>
    #include <stdint.h>

    uint64_t hash_foo(void*);
    uint64_t hash_ref(void*);

    static void test(
    void** pointers, size_t n_pointers,
    size_t* table, size_t tab_sz,
    uint64_t (*hash_function)(void*));

    static const char UsageStr[] =
    "ptr_hash - examine performance of pointer hash function\n"
    "Usage\n"
    "ptr_hash n m sz1 [sz2]\n"
    "where\n"
    " n - the size of hash table\n"
    " m - # of pointers to put in the table\n"
    " sz1 - minimal size of allocated blocks\n"
    " sz2 - [optional] maximal size of allocated blocks. Default = sz1\n"
    ;
    int main(int argz, char** argv)
    {
    if (argz < 4) {
    fprintf(stderr, "%s", UsageStr);
    return 1;
    }

    size_t params[4];
    for (int arg_i = 1; arg_i < 5 && arg_i < argz; ++arg_i) {
    long val = strtol(argv[arg_i], NULL, 0);
    if (val <= 0) {
    fprintf(stderr
    ,"Bad parameter %s. Please specify positive number\n"
    ,argv[arg_i]);
    return 1;
    }
    params[arg_i-1] = val;
    }
    size_t n = params[0];
    size_t m = params[1];
    size_t sz1 = params[2];
    size_t sz2 = params[3];
    if (argz == 4)
    sz2 = sz1;
    if (sz1 > sz2) {
    size_t tmp = sz1; sz1 = sz2; sz2 = tmp;
    }

    int ret = 1;
    void **pointers = malloc(m * sizeof(*pointers));
    if (pointers) {
    size_t* table = malloc(n * sizeof(*table));
    if (table) {
    ret = 0;
    srand(1);
    const unsigned __int128 RAND_SCALE = (unsigned __int128)RAND_MAX
    + 1;
    for (size_t i = 0; i < m; ++i) {
    size_t r = rand();
    size_t sz = (unsigned __int128)(sz2-sz1+1)*r / RAND_SCALE + sz1;
    void* block = malloc(sz);
    if (!block) {
    ret = 1;
    perror("malloc()");
    for (size_t k = 0; k < i; ++k)
    free(pointers[k]);
    break;
    }
    pointers[i] = block;
    }
    if (ret == 0) {
    for (size_t i = 0; i < m; ++i)
    free(pointers[i]); // we don't need allocated blocks for a
    test
    printf("ref: ");
    test(pointers, m, table, n, hash_ref);
    printf("uut: ");
    test(pointers, m, table, n, hash_foo);
    }
    free(table);
    } else {
    perror("malloc()");
    }
    free(pointers);
    } else {
    perror("malloc()");
    }

    return ret;
    }

    static void test(
    void** pointers, size_t n_pointers,
    size_t* table, size_t tab_sz,
    uint64_t (*hash_function)(void*))
    {
    for (size_t i = 0; i < tab_sz; ++i)
    table[i] = 0;
    // simulate hash operation
    for (size_t i = 0; i < n_pointers; ++i) {
    uint64_t h = hash_function(pointers[i]);
    table[(size_t)(((unsigned __int128)tab_sz * h)>>64)] += 1;
    }
    // statistics
    size_t occupied = 0;
    size_t collisions = 0;
    size_t worst = 0;
    for (size_t i = 0; i < tab_sz; ++i) {
    size_t c = table[i];
    occupied += c != 0;
    collisions += c > 1;
    if (worst < c)
    worst = c;
    }
    printf(
    "%zu keys, %zu slots, %zu occupied %zu collisions. Worst collision:
    %zu\n"
    ,n_pointers
    , tab_sz
    , occupied
    , collisions
    , worst
    );
    }


    Compiled as:
    gcc -s -Wall -O2 -march=native ptr_hash.c hash_bonita.c hash_aes4.c


    Run as:
    ./a 100000 75000 80 1000

    Result:
    ref: 75000 keys, 100000 slots, 52799 occupied 17431 collisions. Worst collision: 7
    uut: 75000 keys, 100000 slots, 53201 occupied 17139 collisions. Worst collision: 6

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Michael S on Tue Jun 4 22:10:41 2024
    Michael S <already5chosen@yahoo.com> writes:

    On Mon, 03 Jun 2024 18:02:21 -0700
    Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

    Michael S <already5chosen@yahoo.com> writes:


    I am less in axioms and more interested in your experimental
    findings.

    I'm not sure what you're looking for here.

    I'd give an example.
    You said that some of the variants had 4x differences between cases.
    From my perspective, if you found a hash function that performs up to 3
    times better* than "crypto-alike" hash in majority of tests and is 1.33x worse that "crypto-alike" in few other tests, it's something that I'd consider as valuable option.

    * - i.e. produces 3x less collisions at, say, occupation ratio of 0.7

    Okay. For the sake of discussion let me call a "crypto-alike" hash
    an "average hash" (meaning in some sense statistically average, or
    never very far from random hash values).

    Unfortunately the example you describe is something that won't
    happen and can't happen. I say it won't happen because for all
    the hash functions I've looked at, I've never seen one that is
    "better than average" most of the time and perhaps a little bit
    "worse than average" only occasionally. The reverse situation
    pops up fairly often: a hash function that is a little bit
    "better than average" in a small number of cases, and "no
    better than average or worse than average (sometimes by quite
    a lot)" in most cases.

    For the second part, I say it can't happen because there isn't
    enough headroom for the amount of performance improvement you
    mention. For an occupancy rate of 0.7, an average hash function
    using a rehashing approach uses only 1.7 probes per insertion
    (with a minimum of 1) to fill the table. There is no way to
    get a dramatic performance improvement. Even with an 85% load
    factor, an average hash function takes just a little over 2
    probes (roughly 2.2 probes) per value inserted.

    Going the other direction, I've seen examples of hash functions
    that in some circumstances are _worse_ than average by a factor of
    10 or more. The bad examples just come up - I don't especially go
    looking for them. The small possible upside gain is basically
    never worth the much larger potential downside risk.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Bonita Montero on Wed Jun 5 12:34:50 2024
    On Wed, 5 Jun 2024 11:10:24 +0200
    Bonita Montero <Bonita.Montero@gmail.com> wrote:

    Am 04.06.2024 um 23:59 schrieb Michael S:

    18446744073709551557 is indeed very bad (too close to 2**64).

    It doesn't matter how close the prime is to 2 ^ 64 since the
    results are chopped.



    I have no idea what you are talking about.
    18446744073709551557 == -59.
    That's a very small number. With factor like that, for typical case of
    pointers that are <= 2**40 and
    for table size in 1000s or 10,000s, all keys (==pointers) would be
    pushed into the same (uppermost) slot of the table with no hashing at
    all.
    You can run my test bench and see it yourself.

    On the other hand, prime in range (2**63..2**64) that does not have too
    many bits set or cleared works very well. But majority of non-primes
    with the same properties would work very well as well.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Bonita Montero on Wed Jun 5 13:11:24 2024
    On Wed, 5 Jun 2024 12:05:07 +0200
    Bonita Montero <Bonita.Montero@gmail.com> wrote:

    Am 05.06.2024 um 11:34 schrieb Michael S:

    I have no idea what you are talking about.
    18446744073709551557 == -59.
    That's a very small number.

    You should take the value not signed. Addition and subtraction work
    the same signed and unsigned, but not multiplication and division.
    As I've shown you get a perfekt equal distribution if the input
    also has a equal distribution.

    You had not shown shite.
    If you can not understand simple things on paper then run my test
    bench.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Michael S on Wed Jun 5 08:58:46 2024
    Michael S <already5chosen@yahoo.com> writes:

    On Sun, 26 May 2024 10:20:55 -0700
    Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

    [..comments on a hash function that simply multiplies the key
    by a large prime (18446744073709551557) to give the hash
    value, and noting that tests have shown poor performance..]

    18446744073709551557 is indeed very bad (too close to 2**64).
    But I tried other prime and at least in my simple tests it performs
    rather well.
    May be my tests are to simplistic, I don't pretend to understand much
    about it.

    [lots of C code]

    Compiled as:
    gcc -s -Wall -O2 -march=native ptr_hash.c hash_bonita.c hash_aes4.c

    I couldn't get the AES-based hash function to compile. There
    was a complaint about not being able to inline an 'always-inline'
    function. I played around with it for a while but didn't get
    anywhere.

    I did get your own hash function out and put it into my little
    test rig. There may be summary comments below.

    Run as:
    ./a 100000 75000 80 1000

    Result:
    ref: 75000 keys, 100000 slots, 52799 occupied 17431 collisions. [...]
    uut: 75000 keys, 100000 slots, 53201 occupied 17139 collisions. [...]

    Both of these occupancy rates are close to the expected theoretical
    average, which I calculated to be 52764. If I had been more
    ambitious I might have done some statistical sampling to get some
    idea of the standard deviation, but I wasn't that energetic
    today. :)

    If we're trying to write a general hash function, it should work
    well across a wide range of input key sets, and also for different
    kinds of hash table management. Let's look at the output side
    first. There are two independent axes for hash tables: table
    size, and direct mapping versus using rehashing. For table size,
    the two prominent choices are some prime in the appropriate range,
    or else a power of two. Clearly any size could be chosen, but
    primes have the nice property that they divide up the space of hash
    values very evenly, whereas powers of two allow a cheaper way of
    reducing a hash value to the range of table indices (anding versus a
    modulo operation). A good hash function should work well with both.
    The choice of direct mapping versus rehashing depends on expected
    occupancy rates (which I am used to calling "load factor"). For a
    load factor of 75 to 80 percent or less, generally rehashing is
    better. Direct mapping has the advantage that it can handle load
    factors above 100%, at a cost of dealing with whatever subsidiary
    data structure (usually linked lists) is used when two or more keys
    map to the same index in the hash table. (I expect you know all
    that, I wrote it mostly just to clarify my own thinking.)

    For the recent testing I did, I chose a power of two (65536) for the
    table size, and rehashing rather than direct mapping. Both of these
    choices make it easier to evaluate how well hash functions perform.
    Having a power-of-two table size reveals weaknesses in the hash
    function more readily than taking the hash value modulo a prime;
    using rehashing allows an easy computation of how many probes on
    average are needed to do an insertion, which is an easier metric
    to use for comparisons than occupancy rates (which typically need
    some statistical calculations to appreciate the meaning of different
    resulting occupancy rates).

    Let me emphasize that I don't mean to advocate any of the different
    hash table models over the others. The key point is that a good
    hash function should work with all of them.

    On the input side, it's important to have a wide range of input
    key sets, with different sorts of distributions. The input key
    sets might include: linear addresses out of a single array, so
    a+i*m with i in the range [0..k], for different values of m (to
    be thorough all m in the range [1..64], all multiples of 2 from
    64 to 128, all multiples of 4 from 128 to 256, all multiples of 8
    from 256 to 512, and so forth up to all multiples of 64 up from
    2048 to 4096); results of malloc() using fixed sizes (all of the
    m values from the last example); results of malloc() using a
    range of sizes, for several kinds of random distributions);
    results of malloc() using linearly increasing sizes; and other
    kinds of variations. In my testing I used several fixed sizes to
    malloc() all less than 60; rand()%997 + 43 for malloc() sizes;
    malloc( i/7 + 1 ) for each key value sub i; malloc( i/17 + 1 )
    for each key value sub i; linear addressing in an array, with
    various multipliers (ad hoc rather than systematic choices);
    quadratic addressing in an array (which means array + i*i*m, for
    index i and multiplier choice m) -- here again the choices for
    m were ad hoc rather than system.

    An interesting result under the "in array" choices is that they
    sometimes would change from run to run, presumably as the array was
    put at different starting positions. Sometimes the differences were surprisingly large.

    Like I said before, a good hash function should produce average
    or near-average values for all of these scenarios. It might be
    nice to see some above-average results under some workloads, but
    they don't make up for well below-average results under other
    workloads. The situation is pretty much the opposite of what
    happens with quicksort: the worst case behavior of quicksort is
    quadratic, but we know that in practice that almost never
    happens, so quicksort is used because its typical or average
    performance is better than that of other choices. With hashing, better-than-average performance is rare, and doesn't give much
    benefit, but worse-than-average performance is more likely, and
    can be disastrous. Obviously there can be other factors that
    might motivate other choices in some situations (eg, perfect
    hashing), but for a general-use hash function the first design
    criterion should be close-to-average behavior under all scenarios
    tested.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Tim Rentsch on Wed Jun 5 19:59:05 2024
    On Wed, 05 Jun 2024 08:58:46 -0700
    Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:


    I did get your own hash function out and put it into my little
    test rig. There may be summary comments below.


    <snip>

    As you probably already paid attention, I like bottom lines.
    What is a bottom line here?
    Did you you encounter cases in which almost-bonita's-but-not-quite
    hash function performed badly?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Michael S on Wed Jun 5 21:40:27 2024
    Michael S <already5chosen@yahoo.com> writes:

    On Wed, 05 Jun 2024 08:58:46 -0700
    Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

    I did get your own hash function out and put it into my little
    test rig. There may be summary comments below.

    <snip>

    As you probably already paid attention, I like bottom lines.
    What is a bottom line here?

    The bottom line is both the multiply-by-a-large-prime hashes
    should be avoided.

    Did you you encounter cases in which almost-bonita's-but-not-quite
    hash function performed badly?

    Yes. A clear example is using directly mapped hash table that
    has a power of two elements, with inputs all from malloc( 48 )
    calls. All keys map to one of only 1024 entries, in a hash
    table that has 65536 slots.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Tim Rentsch on Thu Jun 6 11:00:09 2024
    On Wed, 05 Jun 2024 21:40:27 -0700
    Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

    Michael S <already5chosen@yahoo.com> writes:

    On Wed, 05 Jun 2024 08:58:46 -0700
    Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

    I did get your own hash function out and put it into my little
    test rig. There may be summary comments below.

    <snip>

    As you probably already paid attention, I like bottom lines.
    What is a bottom line here?

    The bottom line is both the multiply-by-a-large-prime hashes
    should be avoided.


    I am not convinced.
    So far I had seen no disadvantages for particular application mentioned
    by Malcolm.
    Of course, it is no good for Malcolm, because it can't be coded in what
    Malcolm erroneously call "ANSI C". It is not easy to be coded even in
    ISO C11. But in my own practice it does not matter. I happily use
    language extensions, like gnu __int128 or Microsoft's __umulh() when
    they are the best tool for a job.

    Did you you encounter cases in which almost-bonita's-but-not-quite
    hash function performed badly?

    Yes. A clear example is using directly mapped hash table that
    has a power of two elements, with inputs all from malloc( 48 )
    calls. All keys map to one of only 1024 entries, in a hash
    table that has 65536 slots.

    I don't see it.
    $ ./a 0x10000 0xC000 48
    ref: 49152 keys, 65536 slots, 34590 occupied 11295 collisions. Worst
    collision: 7
    uut: 49152 keys, 65536 slots, 35791 occupied 11872 collisions. Worst
    collision: 3

    It sounds like in your tests you don't use 'linear scaling to table
    size' for translation of hash value to table index. IMHO, you should.
    May be, for some particular hash functions other methods of translation
    also work, but I see no reasons to use other methods. This one is quite
    fast, robust and looks to me like the most logical. For power-of-2 sized
    tables it degenerates into simple right shift.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Tim Rentsch on Thu Jun 6 13:35:34 2024
    On Wed, 05 Jun 2024 08:58:46 -0700
    Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:


    If we're trying to write a general hash function, it should work
    well across a wide range of input key sets, and also for different
    kinds of hash table management. Let's look at the output side
    first. There are two independent axes for hash tables: table
    size, and direct mapping versus using rehashing. For table size,
    the two prominent choices are some prime in the appropriate range,
    or else a power of two. Clearly any size could be chosen, but
    primes have the nice property that they divide up the space of hash
    values very evenly, whereas powers of two allow a cheaper way of
    reducing a hash value to the range of table indices (anding versus a
    modulo operation). A good hash function should work well with both.
    The choice of direct mapping versus rehashing depends on expected
    occupancy rates (which I am used to calling "load factor"). For a
    load factor of 75 to 80 percent or less, generally rehashing is
    better. Direct mapping has the advantage that it can handle load
    factors above 100%, at a cost of dealing with whatever subsidiary
    data structure (usually linked lists) is used when two or more keys
    map to the same index in the hash table. (I expect you know all
    that, I wrote it mostly just to clarify my own thinking.)


    I never had interest in implementation of hash methods, typically just
    took what language or library provides and used it.
    All my knowledge of internals are 50 y.o. news (TAoCP, and no I did not
    read it 50 years ago, but I did read it first time slightly less than
    40 years ago and not sure if I re-read this particular topic since
    then).
    So, I am not sure that I understood everything above.
    In particular, w.r.t. rehashing, IIRC, Knuth came to conclusion that
    simple circular increment of the index is superior to more elaborate
    methods. I don't know if conclusion was changed in the following years,
    not even sure that I recollect it correctly.

    Despite (or due to?) my relative ignorance of the topic, I'd dare to
    disagree with couple of your points:
    1. I don't think that hash function should be evaluated in isolation
    from the rest of algorithm. IMHO, they should be co-developed. There
    is nothing wrong with hash function that works well with one particular
    "back end" and poorly with all others as long as limitations are
    clearly stated.
    2. I don't like "modulo" method of reduction of hash value to range. I
    don't like it for any chosen size of the bucket, not just for power of
    two, but even for prime. Of course, for power of two size my dislike to
    it is deeper.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Richard Harnden@21:1/5 to Bonita Montero on Sun Jun 9 12:40:27 2024
    On 09/06/2024 12:35, Bonita Montero wrote:
    With the code I've shown I proved that the hash AES hashfunction
    is totally stochastically and thereby a good hash function. But
    I was still discontent with the speed of the code.
    Usually you're not able to update the buckets of a hashtable with
    multiple threads without any locking. But with my code the buckets
    are only simple counters which could be updated atomically.
    So I could reduce the code and partitioned the input range to the
    hash function with the available number of cores. Here's the code:

    #include <iostream>

    This is not C



    --
    This email has been checked for viruses by AVG antivirus software.
    www.avg.com

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Malcolm McLean on Sun Jun 9 18:31:15 2024
    Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:

    On 09/06/2024 12:35, Bonita Montero wrote:

    uint64_t MichaelsHash( uint64_t key )
    {
    __m128i xkey = _mm_set_epi64x( key, 42 );
    using bar_t = pair<uint64_t, uint64_t>;
    static bar_t const bars[8] =
    {
    { 0xBB09BBCC90B24BF2, 0x825C622FF2792A01 },
    { 0x94F0535CB06D4060, 0x939C756246DBFD1D },
    { 0x5B835E01A7E14CA1, 0xAC2BDAFC023CDD06 },
    { 0xE0B6A4735B774AEC, 0x9CAFB43E7DDE494C },
    };
    for( bar_t const &bar : bars )
    xkey = _mm_aesenc_si128( xkey, _mm_set_epi64x( bar.second,
    bar.first ) );
    return xkey.m128i_u64[0];
    }

    Now the code is about six times faster and I get a eight times
    speedup over single-threaded processing with the same code. Of
    course the results are still the same.

    I have your permission to drop that in?

    Note that this code was cribbed from Michael S. If you
    think it's important to ask permission, I think he is
    the one you should be asking.

    By the way, I thought you were looking for code that works
    in standard C, and acceptable under C90 rules. Have you
    changed your mind about that? The code above is a far
    cry from C, let alone C90.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Tim Rentsch on Mon Jun 10 15:14:53 2024
    On Sun, 09 Jun 2024 18:31:15 -0700
    Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

    Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:

    On 09/06/2024 12:35, Bonita Montero wrote:

    uint64_t MichaelsHash( uint64_t key )
    {
    __m128i xkey = _mm_set_epi64x( key, 42 );
    using bar_t = pair<uint64_t, uint64_t>;
    static bar_t const bars[8] =
    {
    { 0xBB09BBCC90B24BF2, 0x825C622FF2792A01 },
    { 0x94F0535CB06D4060, 0x939C756246DBFD1D },
    { 0x5B835E01A7E14CA1, 0xAC2BDAFC023CDD06 },
    { 0xE0B6A4735B774AEC, 0x9CAFB43E7DDE494C },
    };
    for( bar_t const &bar : bars )
    xkey = _mm_aesenc_si128( xkey, _mm_set_epi64x( bar.second,
    bar.first ) );
    return xkey.m128i_u64[0];
    }

    Now the code is about six times faster and I get a eight times
    speedup over single-threaded processing with the same code. Of
    course the results are still the same.

    I have your permission to drop that in?

    Note that this code was cribbed from Michael S.

    I don't permit to use my name in variant, modified by Bonita.
    I prefer if my name is not used even on my own variant.
    This code was never intended for production use. It is intended for use
    as reference, as something that is "certainly random enough and then
    many billions times more than enough". It is much much slower than
    necessary for non-crypto hash.
    On the other hand, it is probably billions of billions times weaker
    than what now considered "good enough" for crypto hash.

    If you ask my personal opinion, I think that multiplication modulo
    2**64 by prime (or even non-prime) close in absolute value to 2**63.5
    is sufficient, as long as it used with appropriate method of conversion
    of hash function to table index, which, for this hash function, happens
    to be linear scaling by size of hash table.
    Ironically, it was the method originally proposed by Bonita. The only
    thing (s)he didn't understand was that factors that contain many
    consecutive '0' or '1' bits, esp. so in more significant bits,
    should be avoided.

    If you
    think it's important to ask permission, I think he is
    the one you should be asking.

    By the way, I thought you were looking for code that works
    in standard C, and acceptable under C90 rules. Have you
    changed your mind about that? The code above is a far
    cry from C, let alone C90.

    Exactly.
    I think that all methods that are both good and fast rely on 64-bit
    unsigned integer arithmetic. And it seems to me that C90 does not
    provide it in portable manner.
    So, under portable C90 constraints one has to choose between good and
    fast.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Michael S on Sat Jun 15 20:32:23 2024
    Michael S <already5chosen@yahoo.com> writes:

    On Sun, 09 Jun 2024 18:31:15 -0700
    Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

    Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:

    On 09/06/2024 12:35, Bonita Montero wrote:

    uint64_t MichaelsHash( uint64_t key )
    {
    __m128i xkey = _mm_set_epi64x( key, 42 );
    using bar_t = pair<uint64_t, uint64_t>;
    static bar_t const bars[8] =
    {
    { 0xBB09BBCC90B24BF2, 0x825C622FF2792A01 },
    { 0x94F0535CB06D4060, 0x939C756246DBFD1D },
    { 0x5B835E01A7E14CA1, 0xAC2BDAFC023CDD06 },
    { 0xE0B6A4735B774AEC, 0x9CAFB43E7DDE494C },
    };
    for( bar_t const &bar : bars )
    xkey = _mm_aesenc_si128( xkey, _mm_set_epi64x( bar.second,
    bar.first ) );
    return xkey.m128i_u64[0];
    }

    Now the code is about six times faster and I get a eight times
    speedup over single-threaded processing with the same code. Of
    course the results are still the same.

    I have your permission to drop that in?

    Note that this code was cribbed from Michael S.

    I don't permit to use my name in variant, modified by Bonita.
    I prefer if my name is not used even on my own variant.
    This code was never intended for production use. It is intended for use
    as reference, as something that is "certainly random enough and then
    many billions times more than enough". It is much much slower than
    necessary for non-crypto hash. [...]

    The problem with using the posted AES code (of either version) is
    that it is not sufficiently portable nor does it comply with C90
    rules. How fast or slow the code runs might be a reason not to
    use it, but it is not necessarily a reason not to use it - that
    depends on a variety of factors, not the least of which is how
    fast it actually is in the environments under consideration.

    If you
    think it's important to ask permission, I think he is
    the one you should be asking.

    By the way, I thought you were looking for code that works
    in standard C, and acceptable under C90 rules. Have you
    changed your mind about that? The code above is a far
    cry from C, let alone C90.

    Exactly.
    I think that all methods that are both good and fast rely on 64-bit
    unsigned integer arithmetic. And it seems to me that C90 does not
    provide it in portable manner.
    So, under portable C90 constraints one has to choose between good and
    fast.

    I am confident that C90 can support defining a suitable hash
    function portable to any system that has a C90 implementation
    appropriate to that system's environment.

    That said, unless some sort of objective measure is given for the
    terms good and fast, the last quoted sentence is pretty much
    meaningless.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Michael S on Sun Jun 16 04:38:58 2024
    Michael S <already5chosen@yahoo.com> writes:

    On Wed, 05 Jun 2024 08:58:46 -0700
    Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

    If we're trying to write a general hash function, it should work
    well across a wide range of input key sets, and also for different
    kinds of hash table management. Let's look at the output side
    first. There are two independent axes for hash tables: table
    size, and direct mapping versus using rehashing. For table size,
    the two prominent choices are some prime in the appropriate range,
    or else a power of two. Clearly any size could be chosen, but
    primes have the nice property that they divide up the space of hash
    values very evenly, whereas powers of two allow a cheaper way of
    reducing a hash value to the range of table indices (anding versus a
    modulo operation). A good hash function should work well with both.
    The choice of direct mapping versus rehashing depends on expected
    occupancy rates (which I am used to calling "load factor"). For a
    load factor of 75 to 80 percent or less, generally rehashing is
    better. Direct mapping has the advantage that it can handle load
    factors above 100%, at a cost of dealing with whatever subsidiary
    data structure (usually linked lists) is used when two or more keys
    map to the same index in the hash table. (I expect you know all
    that, I wrote it mostly just to clarify my own thinking.)

    I never had interest in implementation of hash methods, typically
    just took what language or library provides and used it.
    All my knowledge of internals are 50 y.o. news (TAoCP, and no I
    did not read it 50 years ago, but I did read it first time
    slightly less than 40 years ago and not sure if I re-read this
    particular topic since then).
    So, I am not sure that I understood everything above.
    In particular, w.r.t. rehashing, IIRC, Knuth came to conclusion
    that simple circular increment of the index is superior to more
    elaborate methods. I don't know if conclusion was changed in the
    following years, not even sure that I recollect it correctly.

    Hashing is discussed in volume 3 of TAOCP. I first read volume 3
    just after it came out; I reviewed the section on hashing again
    after reading your message. Your remembrance doesn't match what
    is said in that section. Moreover some of the conclusions Knuth
    wrote in 1973 are certainly out of date now. General rehashing,
    also called open addressing, is overall a net win, especially for
    hash tables that have an above-average load factor. Also there
    are variations such as cuckoo hash that depend on using open
    addressing.

    Despite (or due to?) my relative ignorance of the topic, I'd dare
    to disagree with couple of your points:
    1. I don't think that hash function should be evaluated in
    isolation from the rest of algorithm. IMHO, they should be
    co-developed. There is nothing wrong with hash function that
    works well with one particular "back end" and poorly with all
    others as long as limitations are clearly stated.

    There are several reasons why a co-development approach usually
    isn't a good way to go. Malcolm started this thread by asking for
    a good hash function for pointers (to be used in a portable ANSI C
    setting). We don't know, nor do we have any control over, how
    Malcolm might use the function (and that goes double for any random
    reader of comp.lang.c who might see some posted hash function and
    decide to use it). Or we might want to use some sort of generic
    table manager that takes a pointer-to-function to do the hashing,
    but no way to affect how the hash values are used. Another reason
    has to do with total code size. If there are M data types to be
    hashed, and N applications that use hash values, coupling the code
    that does the hashing with the code that makes use of the hash
    values means there will be M*N parts to program, whereas separating
    the two aspects means there will be only M+N parts to program. The
    M+N path almost certainly means a big reduction in the overall
    amount of effort needed.

    Of course there are some specialized applications, such as perfect
    hashing, where the hashing method needs to be tailored to some sort
    of external considerations. In most cases though a general hashing
    function is a better choice.

    As for how evaluation should be done, it's important to evaluate
    hash functions in isolation for the same reasons it's important to
    evaluate random number generators in isolation: the results need
    to be repeatable, and they need to give a quantitative measure of
    how "mixed up" the outputs are for similar inputs. Knuth talks
    about random numbers in volume 2 of TAOCP, spending more than 100
    pages on the subject, but he gives an essential precept before the
    end of page 5: "Random numbers should not be generated with a
    method chosen at random. Some theory should be used." The same is
    true of hash functions, and a repeatable quantitative evaluation of
    the function by itself is a key element of that process.

    2. I don't like "modulo" method of reduction of hash value to
    range. I don't like it for any chosen size of the bucket, not
    just for power of two, but even for prime. Of course, for power
    of two size my dislike to it is deeper.

    I understand your reaction, but this question is a separate topic.
    Part of the reason for designing a high-quality hash function is so
    that it doesn't matter which bits are used: high order bits, low
    order bits, some of each, every other bit, or every third or fourth
    bit, they should all give good results (where "good" means nearby
    inputs give what appear to be uncorrelated outputs). If we design
    our hash function to do that then it doesn't matter whether a hash
    table does a mod or uses a different method. Obviously that is
    better than a hash function that works okay with some hash tables
    but works poorly with others.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Michael S on Mon Jun 17 00:56:40 2024
    Michael S <already5chosen@yahoo.com> writes:

    On Wed, 05 Jun 2024 21:40:27 -0700
    Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

    Michael S <already5chosen@yahoo.com> writes:

    On Wed, 05 Jun 2024 08:58:46 -0700
    Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

    I did get your own hash function out and put it into my little
    test rig. There may be summary comments below.

    <snip>

    As you probably already paid attention, I like bottom lines.
    What is a bottom line here?

    The bottom line is both the multiply-by-a-large-prime hashes
    should be avoided.

    I am not convinced.

    Let me amend my earlier statement. A hash function that simply
    multiplies its input by an integer constant should be avoided
    if the goal is to produce a hash function of decent quaility
    rather than one of mediocre quality.

    Of course, it is no good for Malcolm, because it can't be coded
    in what Malcolm erroneously call "ANSI C".

    I don't know why you say that. C was an ANSI standard before it
    was an ISO standard. Or is it that you think that the language
    Malcolm is intent on using does not conform to C90/C89/ANSI C?


    Did you you encounter cases in which almost-bonita's-but-not-quite
    hash function performed badly?

    Yes. A clear example is using directly mapped hash table that
    has a power of two elements, with inputs all from malloc( 48 )
    calls. All keys map to one of only 1024 entries, in a hash
    table that has 65536 slots.

    I don't see it.
    $ ./a 0x10000 0xC000 48
    ref: 49152 keys, 65536 slots, 34590 occupied 11295 collisions.
    Worst collision: 7
    uut: 49152 keys, 65536 slots, 35791 occupied 11872 collisions.
    Worst collision: 3

    You have run a poor test. The point of quality evaluation
    testing is to look for potential problems, not to disguise
    them.

    It sounds like in your tests you don't use 'linear scaling to
    table size' for translation of hash value to table index.
    IMHO, you should.

    Like I said in another response, the question of how hash tables
    should make use of hash values is a separate topic. The point
    of the tests I ran was to evaluate the quality of hash functions,
    not to look at how hash tables should be coded.

    Here are some statistics I gathered from a more comprehensive
    set of tests for several hash functions. The tests look at
    aggregates of measured values for all output bits (considered
    eight bits at a time) as a function of all input bits (considered
    sixteen bits at a time. The first line in each pair summarizes
    the low values in each set, and the second line summarizes the
    high values in each set, in each case showing the range, average,
    and standard deviation. The first set, hash 0, uses the aes
    code you posted. I think it's easy to see the quality get
    worse going from top to bottom.

    hash 0
    186 .. 201 avg: 195.39 dev: 3.71
    312 .. 339 avg: 322.31 dev: 5.85

    hash 1
    186 .. 203 avg: 195.48 dev: 4.01
    310 .. 344 avg: 322.45 dev: 6.31

    hash 2
    160 .. 202 avg: 193.56 dev: 7.70
    311 .. 348 avg: 322.22 dev: 7.91

    hash 3
    143 .. 202 avg: 185.52 dev: 10.15
    311 .. 387 avg: 329.52 dev: 14.10

    hash 4
    0 .. 212 avg: 9.94 dev: 45.16
    270 .. 65536 avg: 44546.56 dev: 28809.47

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Tim Rentsch on Mon Jun 17 12:39:26 2024
    On Mon, 17 Jun 2024 00:56:40 -0700
    Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:


    I don't know why you say that. C was an ANSI standard before it
    was an ISO standard. Or is it that you think that the language
    Malcolm is intent on using does not conform to C90/C89/ANSI C?



    All I wanted to point by this comment is that ANSI recognizes ISO/IEC
    9899:2018 as their current C Standard and probably will recognize the
    next ISO C Standard pretty soon. For that reason I think that names like
    C89 or C90 are preferable (to ANSI C) when we want to refer to this
    particular variant of the language.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Michael S on Tue Jun 18 10:47:28 2024
    Michael S <already5chosen@yahoo.com> writes:

    On Mon, 17 Jun 2024 00:56:40 -0700
    Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

    I don't know why you say that. C was an ANSI standard before it
    was an ISO standard. Or is it that you think that the language
    Malcolm is intent on using does not conform to C90/C89/ANSI C?

    All I wanted to point by this comment is that ANSI recognizes ISO/IEC 9899:2018 as their current C Standard and probably will recognize the
    next ISO C Standard pretty soon. For that reason I think that names like
    C89 or C90 are preferable (to ANSI C) when we want to refer to this particular variant of the language.

    I see. So it isn't that you think "ANSI C" is wrong, just
    that it might be misleading or that C89 or C90 is preferable.
    Personally I would be surprised if someone used "ANSI C" to
    mean anything other than C89/C90, but certainly other people
    could have a different reaction.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Keith Thompson on Tue Jun 18 16:17:51 2024
    Keith Thompson <Keith.S.Thompson+u@gmail.com> writes:

    Tim Rentsch <tr.17687@z991.linuxsc.com> writes:

    Michael S <already5chosen@yahoo.com> writes:

    On Mon, 17 Jun 2024 00:56:40 -0700
    Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

    I don't know why you say that. C was an ANSI standard before it
    was an ISO standard. Or is it that you think that the language
    Malcolm is intent on using does not conform to C90/C89/ANSI C?

    All I wanted to point by this comment is that ANSI recognizes ISO/IEC
    9899:2018 as their current C Standard and probably will recognize the
    next ISO C Standard pretty soon. For that reason I think that names like >>> C89 or C90 are preferable (to ANSI C) when we want to refer to this
    particular variant of the language.

    I see. So it isn't that you think "ANSI C" is wrong, just
    that it might be misleading or that C89 or C90 is preferable.
    Personally I would be surprised if someone used "ANSI C" to
    mean anything other than C89/C90, but certainly other people
    could have a different reaction.

    [...] I don't necessarily complain when someone uses the phrase
    "ANSI C" to mean C89/C90, but I try to avoid it myself in favor
    of "C89" or "C90".

    I'm reminded that gcc accepts the option -ansi as a synonym for
    the option -std=c90.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From James Kuyper@21:1/5 to Tim Rentsch on Tue Jun 18 19:23:37 2024
    Tim Rentsch <tr.17687@z991.linuxsc.com> writes:
    ...
    I see. So it isn't that you think "ANSI C" is wrong, just
    that it might be misleading or that C89 or C90 is preferable.

    ANSI's documentation is quite clear about the fact that there is, at any
    time, only one ANSI C standard, which is the version most recently
    approved - the older versions cease to be ANSI standards as soon as
    newer ones are approved. C17 is the current ANSI standard for C.
    Therefore, using "ANSI C" to mean specifically C89 is inaccurate, unless
    the wording makes it clear that it's referring to a time period when
    that was the current ANSI C standard.

    Personally I would be surprised if someone used "ANSI C" to
    mean anything other than C89/C90,

    I would expect the term to be used almost exclusively by people who
    incorrectly think that it means C89. Since it became an ISO standard
    with C90, few people who care about the latest version of the C standard
    worry about the parallel ANSI standard. Most people never got into the
    habit of using "ISO C" to mean specifically C90, so they didn't need to
    break that habit when it was superseded by C99.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to James Kuyper on Sun Jun 23 11:23:08 2024
    James Kuyper <jameskuyper@alumni.caltech.edu> writes:

    Tim Rentsch <tr.17687@z991.linuxsc.com> writes:
    ...

    I see. So it isn't that you think "ANSI C" is wrong, just
    that it might be misleading or that C89 or C90 is preferable.

    ANSI's documentation is quite clear about the fact that there is, at any time, only one ANSI C standard, [...]

    I'm not talking about ANSI's documentation. I'm talking about
    how people use the term ANSI C.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)