Forum: >>> Magnum BBS <<<

Re: Good hash for pointers

From Richard Harnden@21:1/5 to Malcolm McLean on Thu May 23 15:37:44 2024

On 23/05/2024 12:11, Malcolm McLean wrote:

What is a good hash function for pointers to use in portable ANSI C?

All your pointers from malloc are going to be unique.
All of their low bits are going to be zero, because they need to align
on some n-bit boundary.
All of their high bit are going to be the same, because that's just how
it works.

So just take the middle: hash = ((intptr_t) ptr) >> 4 & 0xffff;

Good enough?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Kaz Kylheku@21:1/5 to Malcolm McLean on Thu May 23 20:34:13 2024

On 2024-05-23, Malcolm McLean <malcolm.arthur.mclean@gmail.com> wrote:

What is a good hash function for pointers to use in portable ANSI C?

I don't think that it's practical to try to do this in some portable
way, let alone to an old dialect of C from 1989.

Let's say we have a type uintptr_t to which we can convert a void *
pointer.

In the TXR Lisp run time, here is what I did (paraphrased):

switch (CHAR_BIT * sizeof (void *)) {
case 32:
return ((uintptr_t) ptr) >> 4;
case 64:
return ((uintptr_t) ptr) >> 5;
}

The idea is that on 32 bits, we suspect that pointers from malloc may be aligned on 16 bits, so the least significant four bits don't tell us
anything useful; they are likely all zero, which is bad. On 64 bits, we
throw away five bits.

This is actually an important hashing function because it's used
for hashing objects that use eq equality even when compared using
equal.

For instance, all structs for which an equality substitute is not
defined.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @Kazinator@mstdn.ca

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Malcolm McLean on Thu May 23 15:49:42 2024

Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:

What is a good hash function for pointers to use in portable
ANSI C?

I have a preliminary question. Do you really mean ANSI C, or
is C99 acceptable?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Malcolm McLean on Thu May 23 16:52:03 2024

Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:

On 23/05/2024 23:49, Tim Rentsch wrote:

Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:

What is a good hash function for pointers to use in portable
ANSI C?

I have a preliminary question. Do you really mean ANSI C, or
is C99 acceptable?

C89 is better.
But the pass has been sold.

I'm not asking which you think is better. I'm asking about
what your requirements are.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Malcolm McLean on Thu May 23 18:39:56 2024

Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:

On 24/05/2024 00:52, Tim Rentsch wrote:

Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:

On 23/05/2024 23:49, Tim Rentsch wrote:

Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:

What is a good hash function for pointers to use in portable
ANSI C?

I have a preliminary question. Do you really mean ANSI C, or
is C99 acceptable?

C89 is better.
But the pass has been sold.

I'm not asking which you think is better. I'm asking about
what your requirements are.

C 89.
I don't want to pull in C99 types and so on just for a hash function.

In that case I think you are stuck with using a half-baked
solution. The standard integer types available in C89 just
aren't a good fit in a 64-bit world.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jak@21:1/5 to All on Fri May 24 04:09:41 2024

Chris M. Thomasson ha scritto:

On 5/23/2024 5:32 PM, Chris M. Thomasson wrote:

On 5/23/2024 4:11 AM, Malcolm McLean wrote:

What is a good hash function for pointers to use in portable ANSI C?

The pointers are nodes of a tree, which are read only, and I want to
associate read/write data with them. So potentially a lage number of
pointers,and they might be consecutively ordered if they are taken
from an array, or they might be returned from repeared calls to
malloc() with small allocations. Obviously I have no control over
pointer size or internal representation.

I kind of remember an old one for 32-bit pointers. It was something
like multiply by a prime and shift to the right by 7 or 8. Shit, I
think. It was way back on comp.programming.threads.

There was an old paper from microsoft that had a pointer hash. It was
used to index into a table of locks. I need to find it.

<https://wiki.osdev.org/Global_Descriptor_Table>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Malcolm McLean on Fri May 24 09:41:54 2024

On 24/05/2024 02:28, Malcolm McLean wrote:

On 24/05/2024 00:52, Tim Rentsch wrote:

Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:

On 23/05/2024 23:49, Tim Rentsch wrote:

Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:

What is a good hash function for pointers to use in portable
ANSI C?

I have a preliminary question. Do you really mean ANSI C, or
is C99 acceptable?

C89 is better.
But the pass has been sold.

I'm not asking which you think is better. I'm asking about
what your requirements are.

C 89.
I don't want to pull in C99 types and so on just for a hash function.

Tim is right that the <stdint.h> types are /vastly/ better for this sort
of thing. And you won't find any C89/C90 compilers that don't support <stdint.h> even in C89/C90 mode. You might choose to avoid other C99
features, in case you have any users that insist on C89/C90, but you
can't get anywhere with just the standard C89/C90 types. On a platform
such as 64-bit Windows, pointers are 64-bit but the biggest C89/C90
integer type is "unsigned long", which is 32-bit on that platform. You
can't get at the upper bits of the pointer without compiler-specific
extensions - and if you allow extensions, you might as well allow
<stdint.h> and make life far simpler for everyone.

If you insist on using C89/C90, make sure that the code is compatible
(with the same semantics) for more modern standards. There are few
fanatics left who think C89/C90 is "better" than C99 or newer standards,
so the majority of potential users will want C99, C11, or - perhaps most
likely - a compiler-specific variation of one of these. It would be a
shame if your code relied on pre-C99 integer promotion rules, or used
implicit int, or had identifiers that conflict with newer standards.
These would undoubtedly be far worse for users than any purely imagined problems caused by #include <stdint.h>.

(As a pedantic quibble, I believe ANSI invariably re-publish ISO C
standards as updated ANSI C standards. So "ANSI C" currently refers to
C17. It is best, IMHO, to explicitly say C89 and/or C90 rather than
"ANSI C".)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From bart@21:1/5 to Tim Rentsch on Fri May 24 11:14:36 2024

On 24/05/2024 02:39, Tim Rentsch wrote:

Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:

On 24/05/2024 00:52, Tim Rentsch wrote:

Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:

On 23/05/2024 23:49, Tim Rentsch wrote:

Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:

What is a good hash function for pointers to use in portable
ANSI C?

I have a preliminary question. Do you really mean ANSI C, or
is C99 acceptable?

C89 is better.
But the pass has been sold.

I'm not asking which you think is better. I'm asking about
what your requirements are.

C 89.
I don't want to pull in C99 types and so on just for a hash function.

In that case I think you are stuck with using a half-baked
solution. The standard integer types available in C89 just
aren't a good fit in a 64-bit world.

I assume the C89 implementation is one that can target current 64 bit
machines.

Then char, short, int, long long will almost certainly have widths of 8,
16, 32 and 64 bits respectively.

(I don't know if 'long long' was part of C89, but it sounds like Malcolm
just doesn't want to be bothered with stdint.h, and any compiler used is
like to support it.

I can't stand it either. Just today I wrote these two lines:

typedef unsigned long long u64;
typedef unsigned char byte;

to avoid stdint.h and its ugly set of types.)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to bart on Fri May 24 06:18:15 2024

bart <bc@freeuk.com> writes:

On 24/05/2024 02:39, Tim Rentsch wrote:

Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:

On 24/05/2024 00:52, Tim Rentsch wrote:

Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:

On 23/05/2024 23:49, Tim Rentsch wrote:

Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:

What is a good hash function for pointers to use in portable
ANSI C?

I have a preliminary question. Do you really mean ANSI C, or
is C99 acceptable?

C89 is better.
But the pass has been sold.

I'm not asking which you think is better. I'm asking about
what your requirements are.

C 89.
I don't want to pull in C99 types and so on just for a hash
function.

In that case I think you are stuck with using a half-baked
solution. The standard integer types available in C89 just
aren't a good fit in a 64-bit world.

I assume the C89 implementation is one that can target current 64
bit machines.

Then char, short, int, long long will almost certainly have widths
of 8, 16, 32 and 64 bits respectively.

C89 doesn't have long long.

(I don't know if 'long long' was part of C89, but it sounds like
Malcolm just doesn't want to be bothered with stdint.h, and any
compiler used is like to support it.

What he said was C89. He didn't mention stdint.h. I take
him at his word. If what he wants is something different,
he should say clearly what it is, and not make people guess
about it. (To be clear this recommendation is intended for
every questioner, not just Malcolm.)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Malcolm McLean on Fri May 24 14:51:47 2024

Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:

On 24/05/2024 14:18, Tim Rentsch wrote:

bart <bc@freeuk.com> writes:

cJSON.c is C89. And these additions to the resource compiler were
inspired by a menntion of cJSON.c here.

So a C89 hash for a pointer to an unsigned int would be ideal. However
it might be impossible to write one which is both efficient in terms of >machine instructions and a good hash function in that it distributes the >hashes evenly given an uneven distribution of keys. And pointers
returned from repeated calls to malloc() are going to have an odd >distribution of values.

Just stick the bool in your DOM[*] nodes. Forget about const.

[*] Document Object Model (i.e. tree structure).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to bart on Fri May 24 17:00:16 2024

On 24/05/2024 12:14, bart wrote:

On 24/05/2024 02:39, Tim Rentsch wrote:

Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:

On 24/05/2024 00:52, Tim Rentsch wrote:

Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:

On 23/05/2024 23:49, Tim Rentsch wrote:

Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:

What is a good hash function for pointers to use in portable
ANSI C?

I have a preliminary question. Do you really mean ANSI C, or
is C99 acceptable?

C89 is better.
But the pass has been sold.

I'm not asking which you think is better. I'm asking about
what your requirements are.

C 89.
I don't want to pull in C99 types and so on just for a hash function.

In that case I think you are stuck with using a half-baked
solution. The standard integer types available in C89 just
aren't a good fit in a 64-bit world.

I assume the C89 implementation is one that can target current 64 bit machines.

Targeting 64-bit machines does not imply support for 64-bit types using standard C89 integer types. You probably find that most, if not all,
C89 compilers that target 64-bit processors but have 32-bit "long" will
have extensions for a 64-bit type. (It won't be an "extended integer
type", as defined by the C standards, but it will be good enough for the
job.) But if you use such extensions then you are not using standard C89.

Then char, short, int, long long will almost certainly have widths of 8,
16, 32 and 64 bits respectively.

(I don't know if 'long long' was part of C89, but it sounds like Malcolm
just doesn't want to be bothered with stdint.h, and any compiler used is
like to support it.

C89 does not have "long long". And "long" is 32-bit on 64-bit windows.
There is no way in C89 to pick an integer type that is at least or
exactly 64 bits. There is also no way to refer an integer type that is
big enough to support an integer conversion from a pointer. (In C99, in <stdint.h>, you have "uintptr_t" and "intptr_t" which can be used if and
only if the platform supports an integer type big enough for the integer conversion of pointers.)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Malcolm McLean on Fri May 24 19:27:48 2024

On Fri, 24 May 2024 17:10:45 +0100
Malcolm McLean <malcolm.arthur.mclean@gmail.com> wrote:

So if we use uintptr_t the code might break.
And if we use unsigned long we have to assume maximum 32 bits.
However it's unlikely anyone will have an xml file with more than a
billion nodes.

It's not about # of nodes. It's about good hash functions tend to use
64-bit multiplication.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From bart@21:1/5 to Malcolm McLean on Sat May 25 00:54:34 2024

On 24/05/2024 19:57, Malcolm McLean wrote:

On 24/05/2024 19:28, Bonita Montero wrote:

Am 23.05.2024 um 13:11 schrieb Malcolm McLean:

What is a good hash function for pointers to use in portable ANSI C?

The pointers are nodes of a tree, which are read only, and I want to
associate read/write data with them. So potentially a lage number of
pointers,and they might be consecutively ordered if they are taken
from an array, or they might be returned from repeared calls to
malloc() with small allocations. Obviously I have no control over
pointer size or internal representation.

Use FNV.

Here's an attempt.

/* FNV hash of a pointer */
static unsigned int hash(void *address)
{
    int i;
    unsigned long answer = 2166136261;
    unsigned char *byte = (unsigned char *) &address;

    for (i = 0; i < sizeof(void *); i++)
    {
        answer *= 16777619;
        answer ^= byte[i];
    }
    return (unsigned int) (answer & 0xFFFFFFFF);
}

Now what will compilers make of that?

Compiler, or performance?

I tried this function with the test program shown below. I used it to
populate a hash table of 64K entries with pointers from successive calls
to malloc.

Results, in terms of clashes, for different numbers N of entries
populated out of 64K were:

10000 1100
30000 12000
50000 67000
60000 216000
65535 5500000 (largest N possible)

Result were rather variable as malloc produces different patterns of
pointers on different runs. These were simply the result from the first run.

Was this good? I'd no idea. But as a comparison, I used my own hash
function, normally used to hash identifiers, shown below the main
program as the function 'bchash'.

If this is subsituted instead, the results were:

10000 230
30000 3800
50000 10300
60000 50300
65535 2700000

Hash tables need a certain amount of free capacity to stay efficient, so
3/4 full (about 50K/64K) is about right.

Again I don't know if these figures are good, they are just better than
from your hash() function, for the inputs I used, within this test, and
for this size of table.

No doubt there are much better ones.

------------------------------------------
#include <stdio.h>
#include <stdlib.h>

static unsigned int hash(void *address)
{
int i;
unsigned long answer = 2166136261;
unsigned char *byte = (unsigned char *) &address;

for (i = 0; i < sizeof(void *); i++)
{
answer *= 16777619;
answer ^= byte[i];
}
return (unsigned int) (answer & 0xFFFFFFFF);
}

void* table[65536];

int main(void) {
void* p;

int clashes=0, wrapped;
int j;

for (int i=0; i<30000; ++i) {
p = malloc(1);
j = hash(p) & 65535;

wrapped=0;
while (table[j]) {
++clashes;
++j;
if (j>65535) {
if (wrapped) { puts("Table full"); exit(1);}
wrapped=1;
j=0;
}
}
table[j] = p;

}
printf("Clashes %d\n", clashes);
}

------------------------------------------

static unsigned int bchash(void *address)
{
int i;
unsigned long hsum = 0;
unsigned char *byte = (unsigned char *) &address;

for (i = 0; i < sizeof(void *); i++) {
hsum = (hsum<<4) - hsum + byte[i];
}
return (hsum<<5) - hsum;
}

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to bart on Sat May 25 02:12:28 2024

bart <bc@freeuk.com> writes:

On 24/05/2024 19:57, Malcolm McLean wrote:

On 24/05/2024 19:28, Bonita Montero wrote:

Am 23.05.2024 um 13:11 schrieb Malcolm McLean:

What is a good hash function for pointers to use in portable ANSI C?

The pointers are nodes of a tree, which are read only, and I want
to associate read/write data with them. So potentially a lage
number of pointers,and they might be consecutively ordered if they
are taken from an array, or they might be returned from repeared
calls to malloc() with small allocations. Obviously I have no
control over pointer size or internal representation.

Use FNV.

Here's an attempt.

/* FNV hash of a pointer */
static unsigned int hash(void *address)
{
int i;
unsigned long answer = 2166136261;
unsigned char *byte = (unsigned char *) &address;

for (i = 0; i < sizeof(void *); i++)
{
answer *= 16777619;
answer ^= byte[i];
}
return (unsigned int) (answer & 0xFFFFFFFF);
}

Now what will compilers make of that?

Compiler, or performance?

I tried this function with the test program shown below. I used it to populate a hash table of 64K entries with pointers from successive
calls to malloc.

Results, in terms of clashes, for different numbers N of entries
populated out of 64K were:

10000 1100
30000 12000
50000 67000
60000 216000
65535 5500000 (largest N possible)

Result were rather variable as malloc produces different patterns of
pointers on different runs. These were simply the result from the
first run.

Was this good? I'd no idea. But as a comparison, I used my own hash function, normally used to hash identifiers, shown below the main
program as the function 'bchash'.

If this is subsituted instead, the results were:

10000 230
30000 3800
50000 10300
60000 50300
65535 2700000

Hash tables need a certain amount of free capacity to stay efficient,
so 3/4 full (about 50K/64K) is about right.

Again I don't know if these figures are good, they are just better
than from your hash() function, for the inputs I used, within this
test, and for this size of table.

No doubt there are much better ones.

------------------------------------------
#include <stdio.h>
#include <stdlib.h>

static unsigned int hash(void *address)
{
int i;
unsigned long answer = 2166136261;
unsigned char *byte = (unsigned char *) &address;

for (i = 0; i < sizeof(void *); i++)
{
answer *= 16777619;
answer ^= byte[i];
}
return (unsigned int) (answer & 0xFFFFFFFF);
}

void* table[65536];

int main(void) {
void* p;

int clashes=0, wrapped;
int j;

for (int i=0; i<30000; ++i) {
p = malloc(1);
j = hash(p) & 65535;

wrapped=0;
while (table[j]) {
++clashes;
++j;
if (j>65535) {
if (wrapped) { puts("Table full"); exit(1);}
wrapped=1;
j=0;
}
}
table[j] = p;

}
printf("Clashes %d\n", clashes);
}

------------------------------------------

static unsigned int bchash(void *address)
{
int i;
unsigned long hsum = 0;
unsigned char *byte = (unsigned char *) &address;

for (i = 0; i < sizeof(void *); i++) {
hsum = (hsum<<4) - hsum + byte[i];
}
return (hsum<<5) - hsum;
}

It looks like your hash function was tuned for this testing
setup. With different choices for testing it does much
worse.

The testing is done with malloc() blocks all of the same
requested size, and that size is 1. Typical workloads
are likely to be both larger and more variable.

When adding a new entry finds a collision with an old
entry, linear probing is used to find a free slot. It's
well understood that linear probing suffers badly as the
load factor goes up. Better to take a few high bits of
the hash value -- as few as five or six is fine -- to
have the reprobe sequences have different strides.

Your hash function is expensive to compute, moreso even
than the "FNV" function shown earlier. In a case like
this one where the compares are cheap, it's better to
have a dumb-but-fast hash function that might need a
few more looks to find an open slot, because the cost
of looking is so cheap compared to computing the hash
function.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Malcolm McLean on Sat May 25 02:49:00 2024

Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:

On 24/05/2024 14:18, Tim Rentsch wrote:

bart <bc@freeuk.com> writes:

On 24/05/2024 02:39, Tim Rentsch wrote:

Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:

On 24/05/2024 00:52, Tim Rentsch wrote:

Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:

On 23/05/2024 23:49, Tim Rentsch wrote:

Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:

What is a good hash function for pointers to use in portable >>>>>>>>> ANSI C?

I have a preliminary question. Do you really mean ANSI C, or
is C99 acceptable?

C89 is better.
But the pass has been sold.

I'm not asking which you think is better. I'm asking about
what your requirements are.

C 89.
I don't want to pull in C99 types and so on just for a hash
function.

In that case I think you are stuck with using a half-baked
solution. The standard integer types available in C89 just
aren't a good fit in a 64-bit world.

I assume the C89 implementation is one that can target current 64
bit machines.

Then char, short, int, long long will almost certainly have widths
of 8, 16, 32 and 64 bits respectively.

C89 doesn't have long long.

(I don't know if 'long long' was part of C89, but it sounds like
Malcolm just doesn't want to be bothered with stdint.h, and any
compiler used is like to support it.

What he said was C89. He didn't mention stdint.h. I take
him at his word. If what he wants is something different,
he should say clearly what it is, and not make people guess
about it. (To be clear this recommendation is intended for
every questioner, not just Malcolm.)

cJSON.c is C89. And these additions to the resource compiler were
inspired by a menntion of cJSON.c here.

So a C89 hash for a pointer to an unsigned int would be ideal. However
it might be impossible to write one which is both efficient in terms
of machine instructions and a good hash function in that it
distributes the hashes evenly given an uneven distribution of
keys. And pointers returned from repeated calls to malloc() are going
to have an odd distribution of values.

I suggest you look for a way to write and use a hash function that
uses unsigned long long without that causing cJSON.c to need
compiling by a post-C89 ruleset.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From bart@21:1/5 to Tim Rentsch on Sat May 25 12:28:49 2024

On 25/05/2024 10:12, Tim Rentsch wrote:

bart <bc@freeuk.com> writes:

It looks like your hash function was tuned for this testing
setup. With different choices for testing it does much
worse.

It wasn't tuned at all; it was a copy of what I've longed used within my lexers.

It is designed to give a good spread when inputs are similar or change sequentially. (Well, not exactly designed; more trial and error.)

The testing is done with malloc() blocks all of the same
requested size, and that size is 1. Typical workloads
are likely to be both larger and more variable.

When adding a new entry finds a collision with an old
entry, linear probing is used to find a free slot. It's
well understood that linear probing suffers badly as the
load factor goes up. Better to take a few high bits of
the hash value -- as few as five or six is fine -- to
have the reprobe sequences have different strides.

Your hash function is expensive to compute, moreso even
than the "FNV" function shown earlier.

You mean Malcolm's version? That also uses a loop. Neither need to
process all 8 bytes of a 64-bit pointer; 6 will do, and probably just 4.
And such a loop can be trivially unrolled.

In a case like

this one where the compares are cheap, it's better to
have a dumb-but-fast hash function that might need a
few more looks to find an open slot, because the cost
of looking is so cheap compared to computing the hash
function.

My first thought in this thread was to just make use of the pointer
value directly. My version was written only to compare against Malcolm's
FNV code.

If I try the other suggestions, RH's and KK's which simply shift the
pointer value (similar to my idea), the results I got were interesting: sometimes there were zero clashes (almost as though bits 4 to 19 of the
pointer were a perfect hash), but sometimes there were millions.

But when I tried to mix up the input values, then all methods seemed to converge to similar results.

In that case you might as well use the simplest and fastest method.

However it really needs testing within the actual application.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Bonita Montero on Sat May 25 10:40:22 2024

Bonita Montero <Bonita.Montero@gmail.com> writes:

Am 25.05.2024 um 11:12 schrieb Tim Rentsch:

Your hash function is expensive to compute, moreso even
than the "FNV" function shown earlier. In a case like
this one where the compares are cheap, it's better to
have a dumb-but-fast hash function that might need a
few more looks to find an open slot, because the cost
of looking is so cheap compared to computing the hash
function.

A (size_t)pointer * LARGE_PRIME can be sufficient,
ignoring the overflow.

Plenty fast but the output quality is poor. I tested
this scheme against four other hash functions, and in
four out of five workloads it was always dead last, by
a noticeable margin.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to bart on Sat May 25 11:12:43 2024

bart <bc@freeuk.com> writes:

On 25/05/2024 10:12, Tim Rentsch wrote:

bart <bc@freeuk.com> writes:
[...]
It looks like your hash function was tuned for this testing
setup. With different choices for testing it does much
worse.

It wasn't tuned at all; it was a copy of what I've longed used
within my lexers.

It is designed to give a good spread when inputs are similar or
change sequentially. (Well, not exactly designed; more trial
and error.)

The trial and error process you mention had the effect of tuning
your function for the testing setup used, whether you meant it to
or not.

The testing is done with malloc() blocks all of the same
requested size, and that size is 1. Typical workloads
are likely to be both larger and more variable.

When adding a new entry finds a collision with an old
entry, linear probing is used to find a free slot. It's
well understood that linear probing suffers badly as the
load factor goes up. Better to take a few high bits of
the hash value -- as few as five or six is fine -- to
have the reprobe sequences have different strides.

Your hash function is expensive to compute, moreso even
than the "FNV" function shown earlier.

You mean Malcolm's version? That also uses a loop. Neither
need to process all 8 bytes of a 64-bit pointer; 6 will do, and
probably just 4.
And such a loop can be trivially unrolled.

Both your function and the function posted by Malcolm are slow.
It's just that yours is slower. It isn't hard to produce a
hash function of equal quality that runs more than twice as
fast.

In a case like
this one where the compares are cheap, it's better to
have a dumb-but-fast hash function that might need a
few more looks to find an open slot, because the cost
of looking is so cheap compared to computing the hash
function.

My first thought in this thread was to just make use of the
pointer value directly. My version was written only to compare
against Malcolm's FNV code.

If I try the other suggestions, RH's and KK's which simply
shift the pointer value (similar to my idea), the results I got
were interesting: sometimes there were zero clashes (almost as
though bits 4 to 19 of the pointer were a perfect hash), but
sometimes there were millions.

But when I tried to mix up the input values, then all methods
seemed to converge to similar results.

In that case you might as well use the simplest and fastest
method.

However it really needs testing within the actual application.

Let me tell you a little secret. Hash functions are like random
number generators: good ones never do exceptionally well or
exceptionally poorly. The reason for this is simple - whenever
there are scenarios where one does well then inevitably there are
other scenarios where it does poorly. A really high quality hash
function will never do badly, no matter what mix of inputs is
given. The key takeaway here is to try as many different input
sets as you can, but don't look for candidates that do well;
rather, throw away any that do poorly on any input mix. The ones
that are left are all, with high probability, just fine no matter
what application they are used in. That is not to say there is
no need to test within the intended application, it's always good
to do a sanity check, but the point of the sanity check is just
to make sure the earlier process didn't miss something; it isn't
the right way to compare alternatives.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Malcolm McLean on Sat May 25 11:23:55 2024

Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:

On 25/05/2024 18:40, Tim Rentsch wrote:

Bonita Montero <Bonita.Montero@gmail.com> writes:

Am 25.05.2024 um 11:12 schrieb Tim Rentsch:

Your hash function is expensive to compute, moreso even
than the "FNV" function shown earlier. In a case like
this one where the compares are cheap, it's better to
have a dumb-but-fast hash function that might need a
few more looks to find an open slot, because the cost
of looking is so cheap compared to computing the hash
function.

A (size_t)pointer * LARGE_PRIME can be sufficient,
ignoring the overflow.

Plenty fast but the output quality is poor. I tested
this scheme against four other hash functions, and in
four out of five workloads it was always dead last, by
a noticeable margin.

The lower bits of a pointer are often all zeroes. And mutlipying
by any integer will not set them. And that is catastrophic for a
hash.

It can be but it doesn't have to be. It depends on how the hash
value is used to determine a place or places to look in the
table.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From bart@21:1/5 to Tim Rentsch on Sat May 25 20:31:22 2024

On 25/05/2024 19:12, Tim Rentsch wrote:

bart <bc@freeuk.com> writes:

On 25/05/2024 10:12, Tim Rentsch wrote:

bart <bc@freeuk.com> writes:
[...]
It looks like your hash function was tuned for this testing
setup. With different choices for testing it does much
worse.

It wasn't tuned at all; it was a copy of what I've longed used
within my lexers.

It is designed to give a good spread when inputs are similar or
change sequentially. (Well, not exactly designed; more trial
and error.)

The trial and error process you mention had the effect of tuning
your function for the testing setup used, whether you meant it to
or not.

The testing is done with malloc() blocks all of the same
requested size, and that size is 1. Typical workloads
are likely to be both larger and more variable.

When adding a new entry finds a collision with an old
entry, linear probing is used to find a free slot. It's
well understood that linear probing suffers badly as the
load factor goes up. Better to take a few high bits of
the hash value -- as few as five or six is fine -- to
have the reprobe sequences have different strides.

Your hash function is expensive to compute, moreso even
than the "FNV" function shown earlier.

You mean Malcolm's version? That also uses a loop. Neither
need to process all 8 bytes of a 64-bit pointer; 6 will do, and
probably just 4.
And such a loop can be trivially unrolled.

Both your function and the function posted by Malcolm are slow.
It's just that yours is slower.

With unoptimised code, mine is up to 50% faster (suggesting an
inherently faster method).

Optimised via gcc -O3, it's 17% slower.

If I duplicate the tests in my language, which has a meagre optimiser,
then mine is 3 times the speed (I haven't looked at why; something about
the MM code that keeps locals out of registers I guess):

Unopt gcc-O3 My lang opt (uses 64 bits)

MM hash 2.7 0.34 1.9 seconds
BC hash 1.7 0.4 0.6

There are anyway any number of tweaks that can be made; no need to do
all 64 bits for a start, and the loops can be manually unrolled.
Suggesting it is slower is premature.

It isn't hard to produce a
hash function of equal quality that runs more than twice as
fast.

With or without the aid of an optimising compiler? You can't always tell
with the latter whether it's you who's clever, or the compiler.

Let me tell you a little secret. Hash functions are like random
number generators: good ones never do exceptionally well or
exceptionally poorly. The reason for this is simple - whenever
there are scenarios where one does well then inevitably there are
other scenarios where it does poorly. > The ones
that are left are all, with high probability, just fine no matter
what application they are used in. That is not to say there is
no need to test within the intended application, it's always good
to do a sanity check, but the point of the sanity check is just
to make sure the earlier process didn't miss something; it isn't
the right way to compare alternatives.

So what's a good one for use within the identifier lookup of a
compiler's lexer?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Janis Papanagnou@21:1/5 to Malcolm McLean on Sat May 25 23:13:31 2024

On 25.05.2024 19:56, Malcolm McLean wrote:

On 25/05/2024 18:40, Tim Rentsch wrote:

Bonita Montero <Bonita.Montero@gmail.com> writes:

Am 25.05.2024 um 11:12 schrieb Tim Rentsch:

Your hash function is expensive to compute, moreso even
than the "FNV" function shown earlier. In a case like
this one where the compares are cheap, it's better to
have a dumb-but-fast hash function that might need a
few more looks to find an open slot, because the cost
of looking is so cheap compared to computing the hash
function.

A (size_t)pointer * LARGE_PRIME can be sufficient,
ignoring the overflow.

Plenty fast but the output quality is poor. I tested
this scheme against four other hash functions, and in
four out of five workloads it was always dead last, by
a noticeable margin.

The lower bits of a pointer are often all zeroes. And mutlipying by any integer will not set them. And that is catastrophic for a hash.

I haven't read all the thread, but if I'd be going to tackle that
I'd first try to map the pointer onto a fitting integral type (if
that's possible; sizeof()), or first "fold" (XOR) the upper/lower
parts of the pointer (so that zero-sequences, if they appear, also
won't affect the outcome too badly as a side effect) to reduce the
size of the data, and finally use any established hash algorithm
on the (folded/reduced) integral type.

Shall the created hash codes also be portable across platforms?

Janis

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From bart@21:1/5 to Malcolm McLean on Sat May 25 23:42:13 2024

On 25/05/2024 23:07, Malcolm McLean wrote:

On 25/05/2024 22:13, Janis Papanagnou wrote:

I haven't read all the thread, but if I'd be going to tackle that
I'd first try to map the pointer onto a fitting integral type (if
that's possible; sizeof()), or first "fold" (XOR) the upper/lower
parts of the pointer (so that zero-sequences, if they appear, also
won't affect the outcome too badly as a side effect) to reduce the
size of the data, and finally use any established hash algorithm
on the (folded/reduced) integral type.

Shall the created hash codes also be portable across platforms?

Yes, the Baby X resource compiler should be portable to any hosted
platform with a reasonable up to date C compiler.

But we also need portablity in another sense. The hash table is hashing
nodes of a XML document model created by the functions in the file xmlparser2.c. These make repeated calls to malloc() for small
allocations of nodes. But I have no control, over either pointer representation or the platform's implemetation of malloc(). So the
pointers could be at seemingly rsndom loctions in. memory, or they could
be evenly spaced at increments of 64 bytes, and thus differing in only a
few bits.

OK, parsing is a task I'm familiar with. Parsing a normal language into
a syntax tree doesn't sound too different from parsing XML into a tree
of nodes.

Such a task can be done at up to several million lines per second for
source code. Are you getting throughput considerably below that?

It's not clear to me what the hashing is for on the nodes. What exactly
is the input being looked up, and what is being compared? Because
something sounds off: you don't normally look up a pointer to try and
find it in a table; you already have the pointer pointing at the data!

Then if someone wants to help out with this project, an
obvious thing to do would be to speed up the xml parser by replacing the calls to malloc() with a custom allocator.

If that's a bottleneck (you need to establish that; perhaps use a double
call to malloc, and see if it slows down significantly), then the custom allocator is simple: just allocate from a pool. It's simpler still if
you don't need to free that memory before the program ends

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to bart on Sat May 25 22:54:59 2024

bart <bc@freeuk.com> writes:

On 25/05/2024 19:12, Tim Rentsch wrote:

bart <bc@freeuk.com> writes:

On 25/05/2024 10:12, Tim Rentsch wrote:

bart <bc@freeuk.com> writes:
[...]
It looks like your hash function was tuned for this testing
setup. With different choices for testing it does much
worse.

It wasn't tuned at all; it was a copy of what I've longed used
within my lexers.

It is designed to give a good spread when inputs are similar or
change sequentially. (Well, not exactly designed; more trial
and error.)

The trial and error process you mention had the effect of tuning
your function for the testing setup used, whether you meant it to
or not.

The testing is done with malloc() blocks all of the same
requested size, and that size is 1. Typical workloads
are likely to be both larger and more variable.

When adding a new entry finds a collision with an old
entry, linear probing is used to find a free slot. It's
well understood that linear probing suffers badly as the
load factor goes up. Better to take a few high bits of
the hash value -- as few as five or six is fine -- to
have the reprobe sequences have different strides.

Your hash function is expensive to compute, moreso even
than the "FNV" function shown earlier.

You mean Malcolm's version? That also uses a loop. Neither
need to process all 8 bytes of a 64-bit pointer; 6 will do, and
probably just 4.
And such a loop can be trivially unrolled.

Both your function and the function posted by Malcolm are slow.
It's just that yours is slower.

With unoptimised code, mine is up to 50% faster (suggesting an
inherently faster method).

Optimised via gcc -O3, it's 17% slower.

On my test platform the Malcolm code runs between dead even up
to about 44% faster, depending on optimization level, except
for -O0 where it runs about 12% slower. Note that these results
represent only one measurement of each case, run on only one
target environment.

If I duplicate the tests in my language, which has a meagre optimiser,
then mine is 3 times the speed (I haven't looked at why; something
about the MM code that keeps locals out of registers I guess):

Unopt gcc-O3 My lang opt (uses 64 bits)

MM hash 2.7 0.34 1.9 seconds
BC hash 1.7 0.4 0.6

There are anyway any number of tweaks that can be made; no need to do
all 64 bits for a start, and the loops can be manually
unrolled. Suggesting it is slower is premature.

I am simply reporting results of empirical observations made on
my test system. Obviously different environments might produce
different results.

It isn't hard to produce a
hash function of equal quality that runs more than twice as
fast.

With or without the aid of an optimising compiler? You can't always
tell with the latter whether it's you who's clever, or the compiler.

As long as both functions are run at the same level of optimization
I don't think it matters much. For the sake of concreteness we can
stipulate code quality comparable to gcc -O1 if that helps you.

Let me tell you a little secret. Hash functions are like random
number generators: good ones never do exceptionally well or
exceptionally poorly. The reason for this is simple - whenever
there are scenarios where one does well then inevitably there are
other scenarios where it does poorly. > The ones
that are left are all, with high probability, just fine no matter
what application they are used in. That is not to say there is
no need to test within the intended application, it's always good
to do a sanity check, but the point of the sanity check is just
to make sure the earlier process didn't miss something; it isn't
the right way to compare alternatives.

So what's a good one for use within the identifier lookup of a
compiler's lexer?

That depends a lot on what sort of lookup is done and on the
degree to which the hash computation is integrated into the
lexing code. How many entries will the hash table have?
That also makes a difference. If the hash table is never
more than 50% full then we can expect most lookups will
succeed on the first try and almost all will succeed in
no more than two tries.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Bonita Montero on Sun May 26 09:24:58 2024

Bonita Montero <Bonita.Montero@gmail.com> writes:

Am 25.05.2024 um 19:40 schrieb Tim Rentsch:

Bonita Montero <Bonita.Montero@gmail.com> writes:

Am 25.05.2024 um 11:12 schrieb Tim Rentsch:

Your hash function is expensive to compute, moreso even
than the "FNV" function shown earlier. In a case like
this one where the compares are cheap, it's better to
have a dumb-but-fast hash function that might need a
few more looks to find an open slot, because the cost
of looking is so cheap compared to computing the hash
function.

A (size_t)pointer * LARGE_PRIME can be sufficient,
ignoring the overflow.

Plenty fast but the output quality is poor. ...

If the prime is large enough there'e no regular distribution.

Your conjecture is contradicted by empirical facts.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Bonita Montero on Sun May 26 10:20:55 2024

Bonita Montero <Bonita.Montero@gmail.com> writes:

Am 26.05.2024 um 18:24 schrieb Tim Rentsch:

Bonita Montero <Bonita.Montero@gmail.com> writes:

Am 25.05.2024 um 19:40 schrieb Tim Rentsch:

Bonita Montero <Bonita.Montero@gmail.com> writes:

Am 25.05.2024 um 11:12 schrieb Tim Rentsch:

Your hash function is expensive to compute, moreso even
than the "FNV" function shown earlier. In a case like
this one where the compares are cheap, it's better to
have a dumb-but-fast hash function that might need a
few more looks to find an open slot, because the cost
of looking is so cheap compared to computing the hash
function.

A (size_t)pointer * LARGE_PRIME can be sufficient,
ignoring the overflow.

Plenty fast but the output quality is poor. ...

If the prime is large enough there'e no regular distribution.

Your conjecture is contradicted by empirical facts.

There are no empirival facts for that since two times the
taken prime is beyond the 64 bit address space.

You don't listen very well do you?

I say the output quality is poor because I have run tests that
show the poor output quality. I've done that with a prime of my
own choosing and also with 18446744073709551557, the value you
suggested. In both cases the test runs show results that are
clearly worse than all the other hash functions tested, including
bart's and malcolm's. Furthermore the results for your suggested
calculation are worse across the board, on every variety of
dynamic workload in my test suite.

Your proposed hash function is too weak to be taken as a serious
candidate.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Richard Harnden@21:1/5 to bart on Sun May 26 19:58:31 2024

On 25/05/2024 23:42, bart wrote:

It's not clear to me what the hashing is for on the nodes. What exactly
is the input being looked up, and what is being compared? Because
something sounds off: you don't normally look up a pointer to try and
find it in a table; you already have the pointer pointing at the data!

This.

Are you really hashing the pointer?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Kaz Kylheku@21:1/5 to bart on Sun May 26 22:42:08 2024

On 2024-05-25, bart <bc@freeuk.com> wrote:

It's not clear to me what the hashing is for on the nodes. What exactly
is the input being looked up, and what is being compared? Because
something sounds off: you don't normally look up a pointer to try and
find it in a table; you already have the pointer pointing at the data!

This happens when you need to associate objects with external
information which is not in those objects.

One such situation is when you replace use of strings as hash keys with interned strings (atoms). Atoms are typically pointers and so hash that way.

This is a smart thing to do because atoms are faster to work with.
Comparing whether two values are the same atom is just a pointer
comparison, and hashing the atom is just a pointer hash: much faster
than a string hash, and with more easily assured good behavior.

The interning mechanism maintains its own hash table which maps the
strings to the atoms. By that mechanism, two identical strings are
mapped to the same atom. That table is only used at the boundaries of
the system when external data is read. The data contains identifiers
which are interned, and then they are manipulated as atoms.

Lisp languages have a symbol type, which is a kind of interned
string. Symbols are regularly used as keys into dictionary structures
like hash tables or assocation lists.

Certain important properties are put into the symbol itself, so
that no hashing is required to find them.

In classical Lisps, the global value of a variable named by a symbol
is stored in the symbol itself. The property list of a symbol is also
stored in the symbol.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @Kazinator@mstdn.ca

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Ben Bacarisse@21:1/5 to Bonita Montero on Tue May 28 11:07:53 2024

Bonita Montero <Bonita.Montero@gmail.com> writes:

This shows the perfect equal distribution if I take a 32 bit integer
and multiply it by the largest 32 bit prime:

There was no need to write this program. It just illustrates a specific example of a basic theorem where the multiplier and modulus are
co-prime.

#include <iostream>
#include <vector>
#include <random>

using namespace std;

int main()
{
vector<uint8_t> buckets( 1ull << 32 );
for( size_t b = buckets.size(); b; )
if( !++buckets[(uint32_t)--b * 0xFFFFFFFBu % buckets.size()] )
return EXIT_FAILURE;
vector<size_t> loads;
for( uint8_t ld : buckets )
{
if( loads.size() <= ld )
loads.resize( ld + 1 );
++loads[ld];
}
for( size_t l = 0; size_t ld : loads )
cout << l++ << ": " << 100.0 * (ptrdiff_t)ld /
(ptrdiff_t)buckets.size() << "%" << endl;
}

Result:

0: 0%
1: 100%

--
Ben.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Bonita Montero on Thu May 30 19:26:55 2024

Bonita Montero <Bonita.Montero@gmail.com> writes:

Am 28.05.2024 um 12:07 schrieb Ben Bacarisse:

Bonita Montero <Bonita.Montero@gmail.com> writes:

This shows the perfect equal distribution if I take a 32 bit integer
and multiply it by the largest 32 bit prime:

There was no need to write this program. It just illustrates a specific
example of a basic theorem where the multiplier and modulus are
co-prime.

For Tim it was necessary.

No it was not. I already understood this before my posting
of Sunday, May 26.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Bonita Montero on Thu May 30 19:27:48 2024

Bonita Montero <Bonita.Montero@gmail.com> writes:

Am 26.05.2024 um 19:20 schrieb Tim Rentsch:

I say the output quality is poor because I have run tests that
show the poor output quality.

If you chose a prime whose double is beyond 64 bit there's an
equal distribution among the 64 bit modulos.

I've done that with a prime of my own choosing and also with
18446744073709551557, the value you suggested.

Show me your code.

Oh get real. It's not my job to make up for your
laziness.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Tim Rentsch on Sun Jun 2 10:45:06 2024

On Thu, 30 May 2024 19:27:48 -0700
Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

Bonita Montero <Bonita.Montero@gmail.com> writes:

Am 26.05.2024 um 19:20 schrieb Tim Rentsch:

I say the output quality is poor because I have run tests that
show the poor output quality.

If you chose a prime whose double is beyond 64 bit there's an
equal distribution among the 64 bit modulos.

I've done that with a prime of my own choosing and also with
18446744073709551557, the value you suggested.

Show me your code.

Oh get real. It's not my job to make up for your
laziness.

So, what were your conclusions?
Ignoring the speed of computation, would something like cryptographic
hash scaled to bucket size be a best hash for this type of application?
Or some sort of less thorough grinding of the bits is better?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Michael S on Sun Jun 2 16:02:05 2024

Michael S <already5chosen@yahoo.com> writes:

On Thu, 30 May 2024 19:27:48 -0700
Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

Bonita Montero <Bonita.Montero@gmail.com> writes:

Am 26.05.2024 um 19:20 schrieb Tim Rentsch:

I say the output quality is poor because I have run tests that
show the poor output quality.

If you chose a prime whose double is beyond 64 bit there's an
equal distribution among the 64 bit modulos.

I've done that with a prime of my own choosing and also with
18446744073709551557, the value you suggested.

Show me your code.

Oh get real. It's not my job to make up for your
laziness.

So, what were your conclusions?
Ignoring the speed of computation, would something like
cryptographic hash scaled to bucket size be a best hash for
this type of application? Or some sort of less thorough
grinding of the bits is better?

Factors that matter:
will hashing be done using rehashing or aggregates (lists
or trees, etc) grouped under the first probe location?
(or both?)

can keys be compared for ordering or just for equality?

how expensive is it to test for key equality?
(a) when the keys are equal
(b) when the keys are not equal

32-bit hash values suffice for (a) re-hashing schemes with
tables up to perhaps 2**30 entries, or (b) schemes that
use direct hashing and bucket aggregation with tables up
to 2**32 slots. Anything bigger should use 64-bit hash
values.

hash functions should always produce a full-size hash value
(either 32 bits or 64 bits). Fitting the hash value to
the size of the table is the responsibility of the hash
table management code, not the hash function.

The goal:
An ideal hash function has the property that changing any
single bit of the input key has a 50% chance of flipping
any output bit, with no correlation between output bit
m and output bit n (m != n) for changing a given input
bit, and no correlation between changes to output bit n
for changing input bit i and changing input bit j (i != j)

Evaluating hash functions:
One way to evaluate a hash function is to measure changes
in output bits as a function of changing various inputs
bits, either exhaustively or statistically, looking for an
ideal distribution as described above. Usually doing this
isn't very helpful, because most hash functions don't come
close to the ideal, and it's hard to interpret results that
are not close to the ideal, even though those hashes may be
perfectly fine in practice

So basically we are left with comparing a candidate hash
function, performance-wise, to other hash functions that
are going to have good performance results. Three obvious
choices for those: a slow but high quality hash such as a
cryptographic hash; using a good random number generator
to produce keys where the hash value is equal to the key
value; comparing against theoretical ideals. Looking at
each in turn:

High-quality-but-slow-hash: it isn't hard to make a hash
function that approximates an ideal hash, as long as we
are willing to have it run like a turtle. Or simply use
one of the many known cryptographic hashes and fold the
output down to a reasonably size hash values (most of
these cryptographic hash functions produce large hash
values, 128 bits and up, which isn't a great fit for
ordinary hash table use). In my recent testing I simply
rolled my own high quality hash, which ran about 10 times
slower than the slowest of the other hashes I looked at.

Using a good RNG to produce keys the same as the hash
value: I expect this method to be self-explanatory.

Comparing against theoretical ideals: it isn't hard to
do a calculation that computes a statistical average for
the two kinds of hashing methods. More specifically,

(a) for re-hashing, compute the average number of
probes taken for a hash table with k out of n
slots already filled, and use that to compute
the average number of probes need to fill the
hash table up to, say, 85%

(b) for direct hashing, compute the average number
of slots occupied after inserting some number
of values representing a fixed fraction of the
table size. For example, if we have a table
with 100,000 slots, if we insert 100,000 values
then on the average about 63% of the slots
should be occupied (very close to 1 - 1/e).

Use one or more of the above comparison values (they
should all be very close) to compare against the
performance of the candidate hash function for each
of the possible workloads. My workload set included:

(a) results of malloc() with fixed sized blocks,
with a size going from 1 to some modest value
(50 or 100)

(b) results of malloc() with randomly chosen sizes
based on some distribution

(c) sequential addresses in an array, varying the
element size between 1 and some modest value
(maybe 50)

(d) results of malloc() with a deterministic set
of sizes. I tried varying ones of these, such
as size(i) == i/17 + 1 (filling a table of
65636 entries varying percentages full).

What we would like to see:
for every workload tested, results with 10 or 20%
of the comparison metric used

What we don't want to see (this is important):
any result far away from the comparison metric on any
workload. Importantly this includes results that are
significantly _better_ than the comparison metric.
The reason for this is that doing a lot better on one
test inevitably means doing worse on a different test.
The same is true of random number generators: the
best ones look "averagely random" all the time, never
either "sub random" or "super random".

Of course we always want high quality, but the standards
are higher for (a) direct hashing, and/or (b) cases where
comparing keys is expensive. Note that if comparing keys
is expensive we might choose to store the full hash value
with each table entry, to make "trivial reject" on key
comparison be faster.

Probably much of the above was obvious. I make no apologies for
that. Also it may seem disjointed or unorganized. If so then
sorry about that chief, it's a big topic and there's a lot of
ground to cover, and I tried to hit the most important
highlights.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Tim Rentsch on Mon Jun 3 10:50:05 2024

On Sun, 02 Jun 2024 16:02:05 -0700
Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

Michael S <already5chosen@yahoo.com> writes:

On Thu, 30 May 2024 19:27:48 -0700
Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

Bonita Montero <Bonita.Montero@gmail.com> writes:

Am 26.05.2024 um 19:20 schrieb Tim Rentsch:

I say the output quality is poor because I have run tests that
show the poor output quality.

If you chose a prime whose double is beyond 64 bit there's an
equal distribution among the 64 bit modulos.

I've done that with a prime of my own choosing and also with
18446744073709551557, the value you suggested.

Show me your code.

Oh get real. It's not my job to make up for your
laziness.

So, what were your conclusions?
Ignoring the speed of computation, would something like
cryptographic hash scaled to bucket size be a best hash for
this type of application? Or some sort of less thorough
grinding of the bits is better?

Factors that matter:

<snip>

Probably much of the above was obvious. I make no apologies for
that. Also it may seem disjointed or unorganized. If so then
sorry about that chief, it's a big topic and there's a lot of
ground to cover, and I tried to hit the most important
highlights.

It sounds like you *postulate* that crypto is an ideal.
I am less in axioms and more interested in your experimental findings.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Bonita Montero on Mon Jun 3 17:46:04 2024

On Mon, 3 Jun 2024 16:34:37 +0200
Bonita Montero <Bonita.Montero@gmail.com> wrote:

Am 02.06.2024 um 09:45 schrieb Michael S:

So, what were your conclusions?
Ignoring the speed of computation, would something like
cryptographic hash scaled to bucket size be a best hash for this
type of application? Or some sort of less thorough grinding of the
bits is better?

There's no need for a crypto-hash here.

Do you think I don't know?
Crypto hash is just an example of near-ideal pseudo-random uniformity.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From bart@21:1/5 to Bonita Montero on Mon Jun 3 17:24:36 2024

On 03/06/2024 16:54, Bonita Montero wrote:

Am 03.06.2024 um 16:46 schrieb Michael S:

On Mon, 3 Jun 2024 16:34:37 +0200
Bonita Montero <Bonita.Montero@gmail.com> wrote:

Am 02.06.2024 um 09:45 schrieb Michael S:

So, what were your conclusions?
Ignoring the speed of computation, would something like
cryptographic hash scaled to bucket size be a best hash for this
type of application? Or some sort of less thorough grinding of the
bits is better?

There's no need for a crypto-hash here.

Do you think I don't know?
Crypto hash is just an example of near-ideal pseudo-random uniformity.

As I've shown for pointers you get a perfect equal distribution with multiplying by an appropriate prime.

A pointer with 8-byte or 16-byte alignment will have the bottom 3-4 bits
zero.

No matter what number you multiply them by, prime or otherwise, those
3-4 bits will always be zero.

If you mask the result to fit a table of size power-of-two, then the
resulting index will can only ever refer to every 8th or every 16th
slot; there will 8-16x as many clashes as there ought to be.

So some extra work is needed to get around that, for example
right-shifting before masking as some here have done, something you have
never mentioned.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Bonita Montero on Mon Jun 3 19:50:22 2024

On Mon, 3 Jun 2024 17:54:30 +0200
Bonita Montero <Bonita.Montero@gmail.com> wrote:

Am 03.06.2024 um 16:46 schrieb Michael S:

On Mon, 3 Jun 2024 16:34:37 +0200
Bonita Montero <Bonita.Montero@gmail.com> wrote:

Am 02.06.2024 um 09:45 schrieb Michael S:

So, what were your conclusions?
Ignoring the speed of computation, would something like
cryptographic hash scaled to bucket size be a best hash for this
type of application? Or some sort of less thorough grinding of the
bits is better?

There's no need for a crypto-hash here.

Do you think I don't know?
Crypto hash is just an example of near-ideal pseudo-random
uniformity.

As I've shown for pointers you get a perfect equal distribution with multiplying by an appropriate prime.

What you had shown is, if we borrow terminology from characterization
of analog-to-digital converters, integral uniformity for ramp input.
That is not sufficient. Good hash need both integral and differential uniformity.
According to Tim Rentsch, your method does not have the later.

What do you do exactly? Multiply by big prime modulo 2**64?
By intuition, without testing, I'd say that it does not sound as enough.
By intuition, without testing, doing it twice with two different primes,
then adding (or xoring) results together and doing multiplication again
with third prime sounds like enough.
By intuition, without testing, I'd rather use composite multiplication
factors instead of primes, choosing each factor as product of 3-4
non-small primes with all 9-12 primes involved being different.

But there should be simpler procedures that are just as good.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to bart on Mon Jun 3 20:16:46 2024

On Mon, 3 Jun 2024 17:24:36 +0100
bart <bc@freeuk.com> wrote:

On 03/06/2024 16:54, Bonita Montero wrote:

Am 03.06.2024 um 16:46 schrieb Michael S:

On Mon, 3 Jun 2024 16:34:37 +0200
Bonita Montero <Bonita.Montero@gmail.com> wrote:

Am 02.06.2024 um 09:45 schrieb Michael S:

So, what were your conclusions?
Ignoring the speed of computation, would something like
cryptographic hash scaled to bucket size be a best hash for this
type of application? Or some sort of less thorough grinding of
the bits is better?

There's no need for a crypto-hash here.

Do you think I don't know?
Crypto hash is just an example of near-ideal pseudo-random
uniformity.

As I've shown for pointers you get a perfect equal distribution with multiplying by an appropriate prime.

A pointer with 8-byte or 16-byte alignment will have the bottom 3-4
bits zero.

No matter what number you multiply them by, prime or otherwise, those
3-4 bits will always be zero.

If you mask the result to fit a table of size power-of-two, then the resulting index will can only ever refer to every 8th or every 16th
slot; there will 8-16x as many clashes as there ought to be.

So some extra work is needed to get around that, for example
right-shifting before masking as some here have done, something you
have never mentioned.

According to my understanding, Bonita and Tim are discussing hash
generator which output is not used as is. They assume that index of
the slot will be calculated as (Hash(key)*bucket_size)/(Hash_max+1).
For big enough Hash_max (Bonita suggests 2**63-1), poor quality of
few LS bits of Hash(key) does not matter.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From bart@21:1/5 to Michael S on Mon Jun 3 19:48:29 2024

On 03/06/2024 18:16, Michael S wrote:

On Mon, 3 Jun 2024 17:24:36 +0100
bart <bc@freeuk.com> wrote:

On 03/06/2024 16:54, Bonita Montero wrote:

Am 03.06.2024 um 16:46 schrieb Michael S:

On Mon, 3 Jun 2024 16:34:37 +0200
Bonita Montero <Bonita.Montero@gmail.com> wrote:

Am 02.06.2024 um 09:45 schrieb Michael S:

So, what were your conclusions?
Ignoring the speed of computation, would something like
cryptographic hash scaled to bucket size be a best hash for this
type of application? Or some sort of less thorough grinding of
the bits is better?

There's no need for a crypto-hash here.

Do you think I don't know?
Crypto hash is just an example of near-ideal pseudo-random
uniformity.

As I've shown for pointers you get a perfect equal distribution with
multiplying by an appropriate prime.

A pointer with 8-byte or 16-byte alignment will have the bottom 3-4
bits zero.

No matter what number you multiply them by, prime or otherwise, those
3-4 bits will always be zero.

If you mask the result to fit a table of size power-of-two, then the
resulting index will can only ever refer to every 8th or every 16th
slot; there will 8-16x as many clashes as there ought to be.

So some extra work is needed to get around that, for example
right-shifting before masking as some here have done, something you
have never mentioned.

According to my understanding, Bonita and Tim are discussing hash
generator which output is not used as is.

For big enough Hash_max (Bonita suggests 2**63-1), poor quality of
few LS bits of Hash(key) does not matter.

Tim might, I'm not sure sure about BM.

This is an exchange between Malcolm and BM:

MM:

The lower bits of a pointer are often all zeroes. And mutlipying by
any integer will not set them. And that is catastrophic for a hash.

BM:
If you take a large integer they will be scrambled and if the
prime is large enough there will be a good equal distribution.

I've reiterated MM's point, which BM just ignored.

They assume that index of
the slot will be calculated as (Hash(key)*bucket_size)/(Hash_max+1).

I don't remember seeing that anywhere. But, bucket_size is the size of
the hash-table?

And Hash_max+1 is likely to be a power of two?

I prefer my hash values to be evaluated independently of table size.
They usually don't have a meaningful maximum value, other than the 64
bit of their type, and I'd rather they didn't have sequences of zero
bits especially at the bottom end.

If hash_max was 2**63-1, say, then dividing by hash_max+1 would probably
give you zero anyway, especially if you first multiplied by a bucket
size that was a power of two, as all the interesting bits would
disappear past the top bit!

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to bart on Mon Jun 3 22:51:27 2024

On Mon, 3 Jun 2024 19:48:29 +0100
bart <bc@freeuk.com> wrote:

Tim might, I'm not sure sure about BM.

You are probably right about Bonita. His/her last post(s) make no sense.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to bart on Mon Jun 3 22:41:00 2024

On Mon, 3 Jun 2024 19:48:29 +0100
bart <bc@freeuk.com> wrote:

I don't remember seeing that anywhere. But, bucket_size is the size
of the hash-table?

Yes

And Hash_max+1 is likely to be a power of two?

Yes. At very least 2**32.

I prefer my hash values to be evaluated independently of table size.
They usually don't have a meaningful maximum value, other than the 64
bit of their type, and I'd rather they didn't have sequences of zero
bits especially at the bottom end.

Bottom is in the eye of beholder :-)

If hash_max was 2**63-1,

63 was my mistake. He suggests 2**64-1. It seems.

say, then dividing by hash_max+1 would
probably give you zero anyway, especially if you first multiplied by
a bucket size that was a power of two, as all the interesting bits
would disappear past the top bit!

Huh?
If bucket size is 2**n then
index = Hash(key)*2**n/2**64 == Hash(key) >> (64-n);

May be, you are thinking about C language arithmetic? That not what I
had in mind. In post above I used equations in their integer math
meaning rather than integer C arithmetic meaning.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to bart on Mon Jun 3 16:51:44 2024

bart <bc@freeuk.com> writes:

On 03/06/2024 18:16, Michael S wrote:

[...]

For big enough Hash_max (Bonita suggests 2**63-1), poor quality
of few LS bits of Hash(key) does not matter.

Tim might, [...]

I really don't know where you could have gotten that idea.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Michael S on Mon Jun 3 17:01:26 2024

Michael S <already5chosen@yahoo.com> writes:

On Mon, 3 Jun 2024 17:24:36 +0100
bart <bc@freeuk.com> wrote:

On 03/06/2024 16:54, Bonita Montero wrote:

Am 03.06.2024 um 16:46 schrieb Michael S:

On Mon, 3 Jun 2024 16:34:37 +0200
Bonita Montero <Bonita.Montero@gmail.com> wrote:

Am 02.06.2024 um 09:45 schrieb Michael S:

So, what were your conclusions?
Ignoring the speed of computation, would something like
cryptographic hash scaled to bucket size be a best hash for this
type of application? Or some sort of less thorough grinding of
the bits is better?

There's no need for a crypto-hash here.

Do you think I don't know?
Crypto hash is just an example of near-ideal pseudo-random
uniformity.

As I've shown for pointers you get a perfect equal distribution with
multiplying by an appropriate prime.

A pointer with 8-byte or 16-byte alignment will have the bottom 3-4
bits zero.

No matter what number you multiply them by, prime or otherwise, those
3-4 bits will always be zero.

If you mask the result to fit a table of size power-of-two, then the
resulting index will can only ever refer to every 8th or every 16th
slot; there will 8-16x as many clashes as there ought to be.

So some extra work is needed to get around that, for example
right-shifting before masking as some here have done, something you
have never mentioned.

According to my understanding, Bonita and Tim are discussing hash
generator which output is not used as is. They assume that index of
the slot will be calculated as (Hash(key)*bucket_size)/(Hash_max+1).

To clarify, I have been talking about hash functions that deliver
a "full sized" hash value, either 32 or 64 bits, without regard
to how hash table management code might make use of that value.
I think that's sort of what you said, but it seemed reasonable
to offer a clarification.

As a point of information, I've decided to adopt a policy of
not reading any further postings from Bonita Montero, who
seems to have nothing useful or interesting to say.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Michael S on Mon Jun 3 18:02:21 2024

Michael S <already5chosen@yahoo.com> writes:

On Sun, 02 Jun 2024 16:02:05 -0700
Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

Michael S <already5chosen@yahoo.com> writes:

On Thu, 30 May 2024 19:27:48 -0700
Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

[... concerning quality of hash functions ...]

So, what were your conclusions?
Ignoring the speed of computation, would something like
cryptographic hash scaled to bucket size be a best hash for
this type of application? Or some sort of less thorough
grinding of the bits is better?

Factors that matter:

<snip>

Probably much of the above was obvious. I make no apologies for
that. Also it may seem disjointed or unorganized. If so then
sorry about that chief, it's a big topic and there's a lot of
ground to cover, and I tried to hit the most important
highlights.

It sounds like you *postulate* that crypto is an ideal.

I think it's more accurate to say that I _expect_ an established
crypto-quality hash function such as MD5 to be good approximation
to an ideal hash function. I expect that because (a) the design
criteria for such hash functions resembles or includes the design
criteria for a near-ideal hash function, (b) the people who have
done these are smart people with (usually) a fair amount of
experience doing such things, and (c) the established ones have
gone through a lot of use and testing by the user community, who
likely would have found any serious weaknesses.

I am less in axioms and more interested in your experimental
findings.

I'm not sure what you're looking for here. The testing I did for
the various recently posted hash functions was simply running a
few ad hoc test scenarios and comparing the results against each
other and also against the results of hash function of my own
devising likely to produce high quality results. One piece of
empirical evidence: in all of the tests I ran, that hash function
never gave results that varied by more than 10% or so, not matter
what test was done. By contrast, most or all of the other hash
functions sometimes gave results that varied by a factor of 4 or
more, depending on what scenario was being tested. Let me say
again that the test scenarios were ad hoc, and I was not being
systematic but just eyeballing the numbers.

I have a fair amount of experience (at least hundreds of hours)
experimenting with various random number generators, including
several of my own devising. The experience includes systematic
testing using extensive test suites (done by people who really
know what they are doing, which in this arena does not include
myself). Random number generators and hash functions have a lot
of principles in common. Suppose we have a hash function that
takes a 64-bit key and produces a 32-bit hash value. If we use
that hash function with simply successive 64-bit integer values,
and the outputs pass the statistical tests of a good random
number generator test suite, you can be sure that hash function
has very high quality. A key observation about random number
generators is this: if the output values are N bits, generally
the amount of internal state should be at least 2N bits. The
same thing is true of hash functions. If a hash function
produces, say, 32-bit values, internally there should be at least
64 bits of state that is mushed together in various ways, finally
being combined in the final step to produce the 32-bit result.
Another design principle is to replicate information. For
example, if we are hashing a character string to produce a 32-bit
hash value, we might combine each character in the string with
two different 32-bit "accumulators", perhaps adding into one and
xoring into the other, and rotating them by different amounts
after each character scanned. Composing code on the fly (not
compiled, and certainly not tested):

unsigned
string_hash( const char *s ){
unsigned a = 12345, b = 0xabcdef;
char c;
while( c = *s++ ){
a += c, b ^= c;
a = a << 23 | a >> 32-23;
b = b << 10 | b >> 32-10;
a -= b; // why not? ;)
}
return a ^ b;
}

This code has a fair chance of producing acceptable hash values
for strings that have at least three characters. (Let me say
again that I have done no testing at all.)

So I hope this posting has some of what you're looking for. And
if it doesn't, well, then I am as much in the dark as ever. :)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Tim Rentsch on Tue Jun 4 11:38:39 2024

On Mon, 03 Jun 2024 18:02:21 -0700
Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

Michael S <already5chosen@yahoo.com> writes:

I am less in axioms and more interested in your experimental
findings.

I'm not sure what you're looking for here.

I'd give an example.
You said that some of the variants had 4x differences between cases.
From my perspective, if you found a hash function that performs up to 3
times better* than "crypto-alike" hash in majority of tests and is 1.33x
worse that "crypto-alike" in few other tests, it's something that I'd
consider as valuable option.

* - i.e. produces 3x less collisions at, say, occupation ratio of 0.7

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Tim Rentsch on Wed Jun 5 00:59:16 2024

On Sun, 26 May 2024 10:20:55 -0700
Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

Bonita Montero <Bonita.Montero@gmail.com> writes:

Am 26.05.2024 um 18:24 schrieb Tim Rentsch:

Bonita Montero <Bonita.Montero@gmail.com> writes:

Am 25.05.2024 um 19:40 schrieb Tim Rentsch:

Bonita Montero <Bonita.Montero@gmail.com> writes:

Am 25.05.2024 um 11:12 schrieb Tim Rentsch:

Your hash function is expensive to compute, moreso even
than the "FNV" function shown earlier. In a case like
this one where the compares are cheap, it's better to
have a dumb-but-fast hash function that might need a
few more looks to find an open slot, because the cost
of looking is so cheap compared to computing the hash
function.

A (size_t)pointer * LARGE_PRIME can be sufficient,
ignoring the overflow.

Plenty fast but the output quality is poor. ...

If the prime is large enough there'e no regular distribution.

Your conjecture is contradicted by empirical facts.

There are no empirival facts for that since two times the
taken prime is beyond the 64 bit address space.

You don't listen very well do you?

I say the output quality is poor because I have run tests that
show the poor output quality. I've done that with a prime of my
own choosing and also with 18446744073709551557, the value you
suggested. In both cases the test runs show results that are
clearly worse than all the other hash functions tested, including
bart's and malcolm's. Furthermore the results for your suggested
calculation are worse across the board, on every variety of
dynamic workload in my test suite.

Your proposed hash function is too weak to be taken as a serious
candidate.

18446744073709551557 is indeed very bad (too close to 2**64).
But I tried other prime and at least in my simple tests it performs
rather well.
May be my tests are to simplistic, I don't pretend to understand much
about it.

// hash_aes4.c
// Reference. Hopefully behaves similarly to crypto hash
#include <stdint.h>
#include <string.h>
#include <x86intrin.h>

uint64_t hash_ref(void* key)
{
uint64_t ukey = (uint64_t)key;
__m128i xkey = _mm_set_epi64x(ukey, 42);
static const uint64_t bar[8] = {
0xBB09BBCC90B24BF2, 0x825C622FF2792A01,
0x94F0535CB06D4060, 0x939C756246DBFD1D,
0x5B835E01A7E14CA1, 0xAC2BDAFC023CDD06,
0xE0B6A4735B774AEC, 0x9CAFB43E7DDE494C,
};
for (int i = 0; i < 4; ++i) {
__m128i rkey;
memcpy(&rkey, &bar[i*2], sizeof(rkey));
xkey = _mm_aesenc_si128(xkey, rkey);
}
memcpy(&ukey, &xkey, sizeof(ukey));
return ukey;
}

// hash_bonita.c
// Bonita's with different prime
#include <stdint.h>

uint64_t hash_foo(void* key)
{
uint64_t ukey = (uint64_t)key;
static const uint64_t LARGE_PRIME = 0xbb09bbcc90b24bcdull;
return ukey * LARGE_PRIME;
}

// ptr_hash.c
// test bench
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdint.h>

uint64_t hash_foo(void*);
uint64_t hash_ref(void*);

static void test(
void** pointers, size_t n_pointers,
size_t* table, size_t tab_sz,
uint64_t (*hash_function)(void*));

static const char UsageStr[] =
"ptr_hash - examine performance of pointer hash function\n"
"Usage\n"
"ptr_hash n m sz1 [sz2]\n"
"where\n"
" n - the size of hash table\n"
" m - # of pointers to put in the table\n"
" sz1 - minimal size of allocated blocks\n"
" sz2 - [optional] maximal size of allocated blocks. Default = sz1\n"
;
int main(int argz, char** argv)
{
if (argz < 4) {
fprintf(stderr, "%s", UsageStr);
return 1;
}

size_t params[4];
for (int arg_i = 1; arg_i < 5 && arg_i < argz; ++arg_i) {
long val = strtol(argv[arg_i], NULL, 0);
if (val <= 0) {
fprintf(stderr
,"Bad parameter %s. Please specify positive number\n"
,argv[arg_i]);
return 1;
}
params[arg_i-1] = val;
}
size_t n = params[0];
size_t m = params[1];
size_t sz1 = params[2];
size_t sz2 = params[3];
if (argz == 4)
sz2 = sz1;
if (sz1 > sz2) {
size_t tmp = sz1; sz1 = sz2; sz2 = tmp;
}

int ret = 1;
void **pointers = malloc(m * sizeof(*pointers));
if (pointers) {
size_t* table = malloc(n * sizeof(*table));
if (table) {
ret = 0;
srand(1);
const unsigned __int128 RAND_SCALE = (unsigned __int128)RAND_MAX
+ 1;
for (size_t i = 0; i < m; ++i) {
size_t r = rand();
size_t sz = (unsigned __int128)(sz2-sz1+1)*r / RAND_SCALE + sz1;
void* block = malloc(sz);
if (!block) {
ret = 1;
perror("malloc()");
for (size_t k = 0; k < i; ++k)
free(pointers[k]);
break;
}
pointers[i] = block;
}
if (ret == 0) {
for (size_t i = 0; i < m; ++i)
free(pointers[i]); // we don't need allocated blocks for a
test
printf("ref: ");
test(pointers, m, table, n, hash_ref);
printf("uut: ");
test(pointers, m, table, n, hash_foo);
}
free(table);
} else {
perror("malloc()");
}
free(pointers);
} else {
perror("malloc()");
}

return ret;
}

static void test(
void** pointers, size_t n_pointers,
size_t* table, size_t tab_sz,
uint64_t (*hash_function)(void*))
{
for (size_t i = 0; i < tab_sz; ++i)
table[i] = 0;
// simulate hash operation
for (size_t i = 0; i < n_pointers; ++i) {
uint64_t h = hash_function(pointers[i]);
table[(size_t)(((unsigned __int128)tab_sz * h)>>64)] += 1;
}
// statistics
size_t occupied = 0;
size_t collisions = 0;
size_t worst = 0;
for (size_t i = 0; i < tab_sz; ++i) {
size_t c = table[i];
occupied += c != 0;
collisions += c > 1;
if (worst < c)
worst = c;
}
printf(
"%zu keys, %zu slots, %zu occupied %zu collisions. Worst collision:
%zu\n"
,n_pointers
, tab_sz
, occupied
, collisions
, worst
);
}

Compiled as:
gcc -s -Wall -O2 -march=native ptr_hash.c hash_bonita.c hash_aes4.c

Run as:
./a 100000 75000 80 1000

Result:
ref: 75000 keys, 100000 slots, 52799 occupied 17431 collisions. Worst collision: 7
uut: 75000 keys, 100000 slots, 53201 occupied 17139 collisions. Worst collision: 6

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Michael S on Tue Jun 4 22:10:41 2024

Michael S <already5chosen@yahoo.com> writes:

On Mon, 03 Jun 2024 18:02:21 -0700
Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

Michael S <already5chosen@yahoo.com> writes:

I am less in axioms and more interested in your experimental
findings.

I'm not sure what you're looking for here.

I'd give an example.
You said that some of the variants had 4x differences between cases.
From my perspective, if you found a hash function that performs up to 3
times better* than "crypto-alike" hash in majority of tests and is 1.33x worse that "crypto-alike" in few other tests, it's something that I'd consider as valuable option.

* - i.e. produces 3x less collisions at, say, occupation ratio of 0.7

Okay. For the sake of discussion let me call a "crypto-alike" hash
an "average hash" (meaning in some sense statistically average, or
never very far from random hash values).

Unfortunately the example you describe is something that won't
happen and can't happen. I say it won't happen because for all
the hash functions I've looked at, I've never seen one that is
"better than average" most of the time and perhaps a little bit
"worse than average" only occasionally. The reverse situation
pops up fairly often: a hash function that is a little bit
"better than average" in a small number of cases, and "no
better than average or worse than average (sometimes by quite
a lot)" in most cases.

For the second part, I say it can't happen because there isn't
enough headroom for the amount of performance improvement you
mention. For an occupancy rate of 0.7, an average hash function
using a rehashing approach uses only 1.7 probes per insertion
(with a minimum of 1) to fill the table. There is no way to
get a dramatic performance improvement. Even with an 85% load
factor, an average hash function takes just a little over 2
probes (roughly 2.2 probes) per value inserted.

Going the other direction, I've seen examples of hash functions
that in some circumstances are _worse_ than average by a factor of
10 or more. The bad examples just come up - I don't especially go
looking for them. The small possible upside gain is basically
never worth the much larger potential downside risk.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Bonita Montero on Wed Jun 5 12:34:50 2024

On Wed, 5 Jun 2024 11:10:24 +0200
Bonita Montero <Bonita.Montero@gmail.com> wrote:

Am 04.06.2024 um 23:59 schrieb Michael S:

18446744073709551557 is indeed very bad (too close to 2**64).

It doesn't matter how close the prime is to 2 ^ 64 since the
results are chopped.

I have no idea what you are talking about.
18446744073709551557 == -59.
That's a very small number. With factor like that, for typical case of
pointers that are <= 2**40 and
for table size in 1000s or 10,000s, all keys (==pointers) would be
pushed into the same (uppermost) slot of the table with no hashing at
all.
You can run my test bench and see it yourself.

On the other hand, prime in range (2**63..2**64) that does not have too
many bits set or cleared works very well. But majority of non-primes
with the same properties would work very well as well.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Bonita Montero on Wed Jun 5 13:11:24 2024

On Wed, 5 Jun 2024 12:05:07 +0200
Bonita Montero <Bonita.Montero@gmail.com> wrote:

Am 05.06.2024 um 11:34 schrieb Michael S:

I have no idea what you are talking about.
18446744073709551557 == -59.
That's a very small number.

You should take the value not signed. Addition and subtraction work
the same signed and unsigned, but not multiplication and division.
As I've shown you get a perfekt equal distribution if the input
also has a equal distribution.

You had not shown shite.
If you can not understand simple things on paper then run my test
bench.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Michael S on Wed Jun 5 08:58:46 2024

Michael S <already5chosen@yahoo.com> writes:

On Sun, 26 May 2024 10:20:55 -0700
Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

[..comments on a hash function that simply multiplies the key
by a large prime (18446744073709551557) to give the hash
value, and noting that tests have shown poor performance..]

18446744073709551557 is indeed very bad (too close to 2**64).
But I tried other prime and at least in my simple tests it performs
rather well.
May be my tests are to simplistic, I don't pretend to understand much
about it.

[lots of C code]

Compiled as:
gcc -s -Wall -O2 -march=native ptr_hash.c hash_bonita.c hash_aes4.c

I couldn't get the AES-based hash function to compile. There
was a complaint about not being able to inline an 'always-inline'
function. I played around with it for a while but didn't get
anywhere.

I did get your own hash function out and put it into my little
test rig. There may be summary comments below.

Run as:
./a 100000 75000 80 1000

Result:
ref: 75000 keys, 100000 slots, 52799 occupied 17431 collisions. [...]
uut: 75000 keys, 100000 slots, 53201 occupied 17139 collisions. [...]

Both of these occupancy rates are close to the expected theoretical
average, which I calculated to be 52764. If I had been more
ambitious I might have done some statistical sampling to get some
idea of the standard deviation, but I wasn't that energetic
today. :)

If we're trying to write a general hash function, it should work
well across a wide range of input key sets, and also for different
kinds of hash table management. Let's look at the output side
first. There are two independent axes for hash tables: table
size, and direct mapping versus using rehashing. For table size,
the two prominent choices are some prime in the appropriate range,
or else a power of two. Clearly any size could be chosen, but
primes have the nice property that they divide up the space of hash
values very evenly, whereas powers of two allow a cheaper way of
reducing a hash value to the range of table indices (anding versus a
modulo operation). A good hash function should work well with both.
The choice of direct mapping versus rehashing depends on expected
occupancy rates (which I am used to calling "load factor"). For a
load factor of 75 to 80 percent or less, generally rehashing is
better. Direct mapping has the advantage that it can handle load
factors above 100%, at a cost of dealing with whatever subsidiary
data structure (usually linked lists) is used when two or more keys
map to the same index in the hash table. (I expect you know all
that, I wrote it mostly just to clarify my own thinking.)

For the recent testing I did, I chose a power of two (65536) for the
table size, and rehashing rather than direct mapping. Both of these
choices make it easier to evaluate how well hash functions perform.
Having a power-of-two table size reveals weaknesses in the hash
function more readily than taking the hash value modulo a prime;
using rehashing allows an easy computation of how many probes on
average are needed to do an insertion, which is an easier metric
to use for comparisons than occupancy rates (which typically need
some statistical calculations to appreciate the meaning of different
resulting occupancy rates).

Let me emphasize that I don't mean to advocate any of the different
hash table models over the others. The key point is that a good
hash function should work with all of them.

On the input side, it's important to have a wide range of input
key sets, with different sorts of distributions. The input key
sets might include: linear addresses out of a single array, so
a+i*m with i in the range [0..k], for different values of m (to
be thorough all m in the range [1..64], all multiples of 2 from
64 to 128, all multiples of 4 from 128 to 256, all multiples of 8
from 256 to 512, and so forth up to all multiples of 64 up from
2048 to 4096); results of malloc() using fixed sizes (all of the
m values from the last example); results of malloc() using a
range of sizes, for several kinds of random distributions);
results of malloc() using linearly increasing sizes; and other
kinds of variations. In my testing I used several fixed sizes to
malloc() all less than 60; rand()%997 + 43 for malloc() sizes;
malloc( i/7 + 1 ) for each key value sub i; malloc( i/17 + 1 )
for each key value sub i; linear addressing in an array, with
various multipliers (ad hoc rather than systematic choices);
quadratic addressing in an array (which means array + i*i*m, for
index i and multiplier choice m) -- here again the choices for
m were ad hoc rather than system.

An interesting result under the "in array" choices is that they
sometimes would change from run to run, presumably as the array was
put at different starting positions. Sometimes the differences were surprisingly large.

Like I said before, a good hash function should produce average
or near-average values for all of these scenarios. It might be
nice to see some above-average results under some workloads, but
they don't make up for well below-average results under other
workloads. The situation is pretty much the opposite of what
happens with quicksort: the worst case behavior of quicksort is
quadratic, but we know that in practice that almost never
happens, so quicksort is used because its typical or average
performance is better than that of other choices. With hashing, better-than-average performance is rare, and doesn't give much
benefit, but worse-than-average performance is more likely, and
can be disastrous. Obviously there can be other factors that
might motivate other choices in some situations (eg, perfect
hashing), but for a general-use hash function the first design
criterion should be close-to-average behavior under all scenarios
tested.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Tim Rentsch on Wed Jun 5 19:59:05 2024

On Wed, 05 Jun 2024 08:58:46 -0700
Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

I did get your own hash function out and put it into my little
test rig. There may be summary comments below.

<snip>

As you probably already paid attention, I like bottom lines.
What is a bottom line here?
Did you you encounter cases in which almost-bonita's-but-not-quite
hash function performed badly?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Michael S on Wed Jun 5 21:40:27 2024

Michael S <already5chosen@yahoo.com> writes:

On Wed, 05 Jun 2024 08:58:46 -0700
Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

I did get your own hash function out and put it into my little
test rig. There may be summary comments below.

<snip>

As you probably already paid attention, I like bottom lines.
What is a bottom line here?

The bottom line is both the multiply-by-a-large-prime hashes
should be avoided.

Did you you encounter cases in which almost-bonita's-but-not-quite
hash function performed badly?

Yes. A clear example is using directly mapped hash table that
has a power of two elements, with inputs all from malloc( 48 )
calls. All keys map to one of only 1024 entries, in a hash
table that has 65536 slots.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Tim Rentsch on Thu Jun 6 11:00:09 2024

On Wed, 05 Jun 2024 21:40:27 -0700
Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

Michael S <already5chosen@yahoo.com> writes:

On Wed, 05 Jun 2024 08:58:46 -0700
Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

I did get your own hash function out and put it into my little
test rig. There may be summary comments below.

<snip>

As you probably already paid attention, I like bottom lines.
What is a bottom line here?

The bottom line is both the multiply-by-a-large-prime hashes
should be avoided.

I am not convinced.
So far I had seen no disadvantages for particular application mentioned
by Malcolm.
Of course, it is no good for Malcolm, because it can't be coded in what
Malcolm erroneously call "ANSI C". It is not easy to be coded even in
ISO C11. But in my own practice it does not matter. I happily use
language extensions, like gnu __int128 or Microsoft's __umulh() when
they are the best tool for a job.

Did you you encounter cases in which almost-bonita's-but-not-quite
hash function performed badly?

Yes. A clear example is using directly mapped hash table that
has a power of two elements, with inputs all from malloc( 48 )
calls. All keys map to one of only 1024 entries, in a hash
table that has 65536 slots.

I don't see it.
$ ./a 0x10000 0xC000 48
ref: 49152 keys, 65536 slots, 34590 occupied 11295 collisions. Worst
collision: 7
uut: 49152 keys, 65536 slots, 35791 occupied 11872 collisions. Worst
collision: 3

It sounds like in your tests you don't use 'linear scaling to table
size' for translation of hash value to table index. IMHO, you should.
May be, for some particular hash functions other methods of translation
also work, but I see no reasons to use other methods. This one is quite
fast, robust and looks to me like the most logical. For power-of-2 sized
tables it degenerates into simple right shift.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Tim Rentsch on Thu Jun 6 13:35:34 2024

On Wed, 05 Jun 2024 08:58:46 -0700
Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

If we're trying to write a general hash function, it should work
well across a wide range of input key sets, and also for different
kinds of hash table management. Let's look at the output side
first. There are two independent axes for hash tables: table
size, and direct mapping versus using rehashing. For table size,
the two prominent choices are some prime in the appropriate range,
or else a power of two. Clearly any size could be chosen, but
primes have the nice property that they divide up the space of hash
values very evenly, whereas powers of two allow a cheaper way of
reducing a hash value to the range of table indices (anding versus a
modulo operation). A good hash function should work well with both.
The choice of direct mapping versus rehashing depends on expected
occupancy rates (which I am used to calling "load factor"). For a
load factor of 75 to 80 percent or less, generally rehashing is
better. Direct mapping has the advantage that it can handle load
factors above 100%, at a cost of dealing with whatever subsidiary
data structure (usually linked lists) is used when two or more keys
map to the same index in the hash table. (I expect you know all
that, I wrote it mostly just to clarify my own thinking.)

I never had interest in implementation of hash methods, typically just
took what language or library provides and used it.
All my knowledge of internals are 50 y.o. news (TAoCP, and no I did not
read it 50 years ago, but I did read it first time slightly less than
40 years ago and not sure if I re-read this particular topic since
then).
So, I am not sure that I understood everything above.
In particular, w.r.t. rehashing, IIRC, Knuth came to conclusion that
simple circular increment of the index is superior to more elaborate
methods. I don't know if conclusion was changed in the following years,
not even sure that I recollect it correctly.

Despite (or due to?) my relative ignorance of the topic, I'd dare to
disagree with couple of your points:
1. I don't think that hash function should be evaluated in isolation
from the rest of algorithm. IMHO, they should be co-developed. There
is nothing wrong with hash function that works well with one particular
"back end" and poorly with all others as long as limitations are
clearly stated.
2. I don't like "modulo" method of reduction of hash value to range. I
don't like it for any chosen size of the bucket, not just for power of
two, but even for prime. Of course, for power of two size my dislike to
it is deeper.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Richard Harnden@21:1/5 to Bonita Montero on Sun Jun 9 12:40:27 2024

On 09/06/2024 12:35, Bonita Montero wrote:

With the code I've shown I proved that the hash AES hashfunction
is totally stochastically and thereby a good hash function. But
I was still discontent with the speed of the code.
Usually you're not able to update the buckets of a hashtable with
multiple threads without any locking. But with my code the buckets
are only simple counters which could be updated atomically.
So I could reduce the code and partitioned the input range to the
hash function with the available number of cores. Here's the code:

#include <iostream>

This is not C

--
This email has been checked for viruses by AVG antivirus software.
www.avg.com

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Malcolm McLean on Sun Jun 9 18:31:15 2024

Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:

On 09/06/2024 12:35, Bonita Montero wrote:

uint64_t MichaelsHash( uint64_t key )
{
__m128i xkey = _mm_set_epi64x( key, 42 );
using bar_t = pair<uint64_t, uint64_t>;
static bar_t const bars[8] =
{
{ 0xBB09BBCC90B24BF2, 0x825C622FF2792A01 },
{ 0x94F0535CB06D4060, 0x939C756246DBFD1D },
{ 0x5B835E01A7E14CA1, 0xAC2BDAFC023CDD06 },
{ 0xE0B6A4735B774AEC, 0x9CAFB43E7DDE494C },
};
for( bar_t const &bar : bars )
xkey = _mm_aesenc_si128( xkey, _mm_set_epi64x( bar.second,
bar.first ) );
return xkey.m128i_u64[0];
}

Now the code is about six times faster and I get a eight times
speedup over single-threaded processing with the same code. Of
course the results are still the same.

I have your permission to drop that in?

Note that this code was cribbed from Michael S. If you
think it's important to ask permission, I think he is
the one you should be asking.

By the way, I thought you were looking for code that works
in standard C, and acceptable under C90 rules. Have you
changed your mind about that? The code above is a far
cry from C, let alone C90.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Tim Rentsch on Mon Jun 10 15:14:53 2024

On Sun, 09 Jun 2024 18:31:15 -0700
Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:

On 09/06/2024 12:35, Bonita Montero wrote:

uint64_t MichaelsHash( uint64_t key )
{
__m128i xkey = _mm_set_epi64x( key, 42 );
using bar_t = pair<uint64_t, uint64_t>;
static bar_t const bars[8] =
{
{ 0xBB09BBCC90B24BF2, 0x825C622FF2792A01 },
{ 0x94F0535CB06D4060, 0x939C756246DBFD1D },
{ 0x5B835E01A7E14CA1, 0xAC2BDAFC023CDD06 },
{ 0xE0B6A4735B774AEC, 0x9CAFB43E7DDE494C },
};
for( bar_t const &bar : bars )
xkey = _mm_aesenc_si128( xkey, _mm_set_epi64x( bar.second,
bar.first ) );
return xkey.m128i_u64[0];
}

Now the code is about six times faster and I get a eight times
speedup over single-threaded processing with the same code. Of
course the results are still the same.

I have your permission to drop that in?

Note that this code was cribbed from Michael S.

I don't permit to use my name in variant, modified by Bonita.
I prefer if my name is not used even on my own variant.
This code was never intended for production use. It is intended for use
as reference, as something that is "certainly random enough and then
many billions times more than enough". It is much much slower than
necessary for non-crypto hash.
On the other hand, it is probably billions of billions times weaker
than what now considered "good enough" for crypto hash.

If you ask my personal opinion, I think that multiplication modulo
2**64 by prime (or even non-prime) close in absolute value to 2**63.5
is sufficient, as long as it used with appropriate method of conversion
of hash function to table index, which, for this hash function, happens
to be linear scaling by size of hash table.
Ironically, it was the method originally proposed by Bonita. The only
thing (s)he didn't understand was that factors that contain many
consecutive '0' or '1' bits, esp. so in more significant bits,
should be avoided.

If you
think it's important to ask permission, I think he is
the one you should be asking.

By the way, I thought you were looking for code that works
in standard C, and acceptable under C90 rules. Have you
changed your mind about that? The code above is a far
cry from C, let alone C90.

Exactly.
I think that all methods that are both good and fast rely on 64-bit
unsigned integer arithmetic. And it seems to me that C90 does not
provide it in portable manner.
So, under portable C90 constraints one has to choose between good and
fast.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Michael S on Sat Jun 15 20:32:23 2024

Michael S <already5chosen@yahoo.com> writes:

On Sun, 09 Jun 2024 18:31:15 -0700
Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:

On 09/06/2024 12:35, Bonita Montero wrote:

uint64_t MichaelsHash( uint64_t key )
{
__m128i xkey = _mm_set_epi64x( key, 42 );
using bar_t = pair<uint64_t, uint64_t>;
static bar_t const bars[8] =
{
{ 0xBB09BBCC90B24BF2, 0x825C622FF2792A01 },
{ 0x94F0535CB06D4060, 0x939C756246DBFD1D },
{ 0x5B835E01A7E14CA1, 0xAC2BDAFC023CDD06 },
{ 0xE0B6A4735B774AEC, 0x9CAFB43E7DDE494C },
};
for( bar_t const &bar : bars )
xkey = _mm_aesenc_si128( xkey, _mm_set_epi64x( bar.second,
bar.first ) );
return xkey.m128i_u64[0];
}

Now the code is about six times faster and I get a eight times
speedup over single-threaded processing with the same code. Of
course the results are still the same.

I have your permission to drop that in?

Note that this code was cribbed from Michael S.

I don't permit to use my name in variant, modified by Bonita.
I prefer if my name is not used even on my own variant.
This code was never intended for production use. It is intended for use
as reference, as something that is "certainly random enough and then
many billions times more than enough". It is much much slower than
necessary for non-crypto hash. [...]

The problem with using the posted AES code (of either version) is
that it is not sufficiently portable nor does it comply with C90
rules. How fast or slow the code runs might be a reason not to
use it, but it is not necessarily a reason not to use it - that
depends on a variety of factors, not the least of which is how
fast it actually is in the environments under consideration.

If you
think it's important to ask permission, I think he is
the one you should be asking.

By the way, I thought you were looking for code that works
in standard C, and acceptable under C90 rules. Have you
changed your mind about that? The code above is a far
cry from C, let alone C90.

Exactly.
I think that all methods that are both good and fast rely on 64-bit
unsigned integer arithmetic. And it seems to me that C90 does not
provide it in portable manner.
So, under portable C90 constraints one has to choose between good and
fast.

I am confident that C90 can support defining a suitable hash
function portable to any system that has a C90 implementation
appropriate to that system's environment.

That said, unless some sort of objective measure is given for the
terms good and fast, the last quoted sentence is pretty much
meaningless.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Michael S on Sun Jun 16 04:38:58 2024

Michael S <already5chosen@yahoo.com> writes:

On Wed, 05 Jun 2024 08:58:46 -0700
Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

If we're trying to write a general hash function, it should work
well across a wide range of input key sets, and also for different
kinds of hash table management. Let's look at the output side
first. There are two independent axes for hash tables: table
size, and direct mapping versus using rehashing. For table size,
the two prominent choices are some prime in the appropriate range,
or else a power of two. Clearly any size could be chosen, but
primes have the nice property that they divide up the space of hash
values very evenly, whereas powers of two allow a cheaper way of
reducing a hash value to the range of table indices (anding versus a
modulo operation). A good hash function should work well with both.
The choice of direct mapping versus rehashing depends on expected
occupancy rates (which I am used to calling "load factor"). For a
load factor of 75 to 80 percent or less, generally rehashing is
better. Direct mapping has the advantage that it can handle load
factors above 100%, at a cost of dealing with whatever subsidiary
data structure (usually linked lists) is used when two or more keys
map to the same index in the hash table. (I expect you know all
that, I wrote it mostly just to clarify my own thinking.)

I never had interest in implementation of hash methods, typically
just took what language or library provides and used it.
All my knowledge of internals are 50 y.o. news (TAoCP, and no I
did not read it 50 years ago, but I did read it first time
slightly less than 40 years ago and not sure if I re-read this
particular topic since then).
So, I am not sure that I understood everything above.
In particular, w.r.t. rehashing, IIRC, Knuth came to conclusion
that simple circular increment of the index is superior to more
elaborate methods. I don't know if conclusion was changed in the
following years, not even sure that I recollect it correctly.

Hashing is discussed in volume 3 of TAOCP. I first read volume 3
just after it came out; I reviewed the section on hashing again
after reading your message. Your remembrance doesn't match what
is said in that section. Moreover some of the conclusions Knuth
wrote in 1973 are certainly out of date now. General rehashing,
also called open addressing, is overall a net win, especially for
hash tables that have an above-average load factor. Also there
are variations such as cuckoo hash that depend on using open
addressing.

Despite (or due to?) my relative ignorance of the topic, I'd dare
to disagree with couple of your points:
1. I don't think that hash function should be evaluated in
isolation from the rest of algorithm. IMHO, they should be
co-developed. There is nothing wrong with hash function that
works well with one particular "back end" and poorly with all
others as long as limitations are clearly stated.

There are several reasons why a co-development approach usually
isn't a good way to go. Malcolm started this thread by asking for
a good hash function for pointers (to be used in a portable ANSI C
setting). We don't know, nor do we have any control over, how
Malcolm might use the function (and that goes double for any random
reader of comp.lang.c who might see some posted hash function and
decide to use it). Or we might want to use some sort of generic
table manager that takes a pointer-to-function to do the hashing,
but no way to affect how the hash values are used. Another reason
has to do with total code size. If there are M data types to be
hashed, and N applications that use hash values, coupling the code
that does the hashing with the code that makes use of the hash
values means there will be M*N parts to program, whereas separating
the two aspects means there will be only M+N parts to program. The
M+N path almost certainly means a big reduction in the overall
amount of effort needed.

Of course there are some specialized applications, such as perfect
hashing, where the hashing method needs to be tailored to some sort
of external considerations. In most cases though a general hashing
function is a better choice.

As for how evaluation should be done, it's important to evaluate
hash functions in isolation for the same reasons it's important to
evaluate random number generators in isolation: the results need
to be repeatable, and they need to give a quantitative measure of
how "mixed up" the outputs are for similar inputs. Knuth talks
about random numbers in volume 2 of TAOCP, spending more than 100
pages on the subject, but he gives an essential precept before the
end of page 5: "Random numbers should not be generated with a
method chosen at random. Some theory should be used." The same is
true of hash functions, and a repeatable quantitative evaluation of
the function by itself is a key element of that process.

2. I don't like "modulo" method of reduction of hash value to
range. I don't like it for any chosen size of the bucket, not
just for power of two, but even for prime. Of course, for power
of two size my dislike to it is deeper.

I understand your reaction, but this question is a separate topic.
Part of the reason for designing a high-quality hash function is so
that it doesn't matter which bits are used: high order bits, low
order bits, some of each, every other bit, or every third or fourth
bit, they should all give good results (where "good" means nearby
inputs give what appear to be uncorrelated outputs). If we design
our hash function to do that then it doesn't matter whether a hash
table does a mod or uses a different method. Obviously that is
better than a hash function that works okay with some hash tables
but works poorly with others.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Michael S on Mon Jun 17 00:56:40 2024

Michael S <already5chosen@yahoo.com> writes:

On Wed, 05 Jun 2024 21:40:27 -0700
Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

Michael S <already5chosen@yahoo.com> writes:

On Wed, 05 Jun 2024 08:58:46 -0700
Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

I did get your own hash function out and put it into my little
test rig. There may be summary comments below.

<snip>

As you probably already paid attention, I like bottom lines.
What is a bottom line here?

The bottom line is both the multiply-by-a-large-prime hashes
should be avoided.

I am not convinced.

Let me amend my earlier statement. A hash function that simply
multiplies its input by an integer constant should be avoided
if the goal is to produce a hash function of decent quaility
rather than one of mediocre quality.

Of course, it is no good for Malcolm, because it can't be coded
in what Malcolm erroneously call "ANSI C".

I don't know why you say that. C was an ANSI standard before it
was an ISO standard. Or is it that you think that the language
Malcolm is intent on using does not conform to C90/C89/ANSI C?

Did you you encounter cases in which almost-bonita's-but-not-quite
hash function performed badly?

Yes. A clear example is using directly mapped hash table that
has a power of two elements, with inputs all from malloc( 48 )
calls. All keys map to one of only 1024 entries, in a hash
table that has 65536 slots.

I don't see it.
$ ./a 0x10000 0xC000 48
ref: 49152 keys, 65536 slots, 34590 occupied 11295 collisions.
Worst collision: 7
uut: 49152 keys, 65536 slots, 35791 occupied 11872 collisions.
Worst collision: 3

You have run a poor test. The point of quality evaluation
testing is to look for potential problems, not to disguise
them.

It sounds like in your tests you don't use 'linear scaling to
table size' for translation of hash value to table index.
IMHO, you should.

Like I said in another response, the question of how hash tables
should make use of hash values is a separate topic. The point
of the tests I ran was to evaluate the quality of hash functions,
not to look at how hash tables should be coded.

Here are some statistics I gathered from a more comprehensive
set of tests for several hash functions. The tests look at
aggregates of measured values for all output bits (considered
eight bits at a time) as a function of all input bits (considered
sixteen bits at a time. The first line in each pair summarizes
the low values in each set, and the second line summarizes the
high values in each set, in each case showing the range, average,
and standard deviation. The first set, hash 0, uses the aes
code you posted. I think it's easy to see the quality get
worse going from top to bottom.

hash 0
186 .. 201 avg: 195.39 dev: 3.71
312 .. 339 avg: 322.31 dev: 5.85

hash 1
186 .. 203 avg: 195.48 dev: 4.01
310 .. 344 avg: 322.45 dev: 6.31

hash 2
160 .. 202 avg: 193.56 dev: 7.70
311 .. 348 avg: 322.22 dev: 7.91

hash 3
143 .. 202 avg: 185.52 dev: 10.15
311 .. 387 avg: 329.52 dev: 14.10

hash 4
0 .. 212 avg: 9.94 dev: 45.16
270 .. 65536 avg: 44546.56 dev: 28809.47

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Tim Rentsch on Mon Jun 17 12:39:26 2024

On Mon, 17 Jun 2024 00:56:40 -0700
Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

I don't know why you say that. C was an ANSI standard before it
was an ISO standard. Or is it that you think that the language
Malcolm is intent on using does not conform to C90/C89/ANSI C?

All I wanted to point by this comment is that ANSI recognizes ISO/IEC
9899:2018 as their current C Standard and probably will recognize the
next ISO C Standard pretty soon. For that reason I think that names like
C89 or C90 are preferable (to ANSI C) when we want to refer to this
particular variant of the language.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Michael S on Tue Jun 18 10:47:28 2024

Michael S <already5chosen@yahoo.com> writes:

On Mon, 17 Jun 2024 00:56:40 -0700
Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

I don't know why you say that. C was an ANSI standard before it
was an ISO standard. Or is it that you think that the language
Malcolm is intent on using does not conform to C90/C89/ANSI C?

All I wanted to point by this comment is that ANSI recognizes ISO/IEC 9899:2018 as their current C Standard and probably will recognize the
next ISO C Standard pretty soon. For that reason I think that names like
C89 or C90 are preferable (to ANSI C) when we want to refer to this particular variant of the language.

I see. So it isn't that you think "ANSI C" is wrong, just
that it might be misleading or that C89 or C90 is preferable.
Personally I would be surprised if someone used "ANSI C" to
mean anything other than C89/C90, but certainly other people
could have a different reaction.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Keith Thompson on Tue Jun 18 16:17:51 2024

Keith Thompson <Keith.S.Thompson+u@gmail.com> writes:

Tim Rentsch <tr.17687@z991.linuxsc.com> writes:

Michael S <already5chosen@yahoo.com> writes:

On Mon, 17 Jun 2024 00:56:40 -0700
Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

I don't know why you say that. C was an ANSI standard before it
was an ISO standard. Or is it that you think that the language
Malcolm is intent on using does not conform to C90/C89/ANSI C?

All I wanted to point by this comment is that ANSI recognizes ISO/IEC
9899:2018 as their current C Standard and probably will recognize the
next ISO C Standard pretty soon. For that reason I think that names like >>> C89 or C90 are preferable (to ANSI C) when we want to refer to this
particular variant of the language.

I see. So it isn't that you think "ANSI C" is wrong, just
that it might be misleading or that C89 or C90 is preferable.
Personally I would be surprised if someone used "ANSI C" to
mean anything other than C89/C90, but certainly other people
could have a different reaction.

[...] I don't necessarily complain when someone uses the phrase
"ANSI C" to mean C89/C90, but I try to avoid it myself in favor
of "C89" or "C90".

I'm reminded that gcc accepts the option -ansi as a synonym for
the option -std=c90.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From James Kuyper@21:1/5 to Tim Rentsch on Tue Jun 18 19:23:37 2024

Tim Rentsch <tr.17687@z991.linuxsc.com> writes:
...

I see. So it isn't that you think "ANSI C" is wrong, just
that it might be misleading or that C89 or C90 is preferable.

ANSI's documentation is quite clear about the fact that there is, at any
time, only one ANSI C standard, which is the version most recently
approved - the older versions cease to be ANSI standards as soon as
newer ones are approved. C17 is the current ANSI standard for C.
Therefore, using "ANSI C" to mean specifically C89 is inaccurate, unless
the wording makes it clear that it's referring to a time period when
that was the current ANSI C standard.

Personally I would be surprised if someone used "ANSI C" to
mean anything other than C89/C90,

I would expect the term to be used almost exclusively by people who
incorrectly think that it means C89. Since it became an ISO standard
with C90, few people who care about the latest version of the C standard
worry about the parallel ANSI standard. Most people never got into the
habit of using "ISO C" to mean specifically C90, so they didn't need to
break that habit when it was superseded by C99.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to James Kuyper on Sun Jun 23 11:23:08 2024

James Kuyper <jameskuyper@alumni.caltech.edu> writes:

Tim Rentsch <tr.17687@z991.linuxsc.com> writes:
...

I see. So it isn't that you think "ANSI C" is wrong, just
that it might be misleading or that C89 or C90 is preferable.

ANSI's documentation is quite clear about the fact that there is, at any time, only one ANSI C standard, [...]

I'm not talking about ANSI's documentation. I'm talking about
how people use the term ANSI C.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Thlc
  Sat Sep 13 17:11:34 2025
  from Rognac, France via Telnet
- Thlc
  Sat Sep 13 17:04:03 2025
  from Rognac, France via Telnet
- Thlc
  Sat Sep 13 16:32:19 2025
  from Rognac, France via SSH
- Thlc
  Sat Sep 13 15:41:11 2025
  from Rognac, France via SSH
- Thlc
  Sat Sep 13 07:56:03 2025
  from Rognac, France via SSH
- Gretchiie
  Sat Sep 13 07:22:10 2025
  from Derry, Nh via Telnet
- Thlc
  Sat Sep 13 06:57:56 2025
  from Rognac, France via SSH
- Thlc
  Sat Sep 13 06:47:28 2025
  from Rognac, France via SSH

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	546
Nodes:	16 (2 / 14)
Uptime:	148:42:07
Calls:	10,383
Calls today:	8
Files:	14,054
D/L today:	2 files (1,861K bytes)
Messages:	6,417,760

Re: Good hash for pointers

Who's Online

Recent Visitors

System Info