Forum: >>> Magnum BBS <<<

Re: a MSVC and glibc-compatible fmod()

From Muttley@DastardlyHQ.org@21:1/5 to All on Mon Feb 24 12:00:30 2025

On Mon, 24 Feb 2025 11:48:08 +0100
Bonita Montero <Bonita.Montero@gmail.com> wibbled:

I wanted to optimize fmod to be a bit faster. This is my C++20 solution.

double myFmod( double x, double y )

[snip]

It's about six times faster than the glibc 2.31 solution in my
benchmark. The returned NaNs and the raised exceptions are MSVC-
and glibc-compatible.

double myFmod(double x, double y)
{
double div = x / y;
return y * (div - (long)div);
}

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Muttley@DastardlyHQ.org@21:1/5 to All on Mon Feb 24 13:09:50 2025

On Mon, 24 Feb 2025 13:48:02 +0100
Bonita Montero <Bonita.Montero@gmail.com> wibbled:

Am 24.02.2025 um 13:00 schrieb Muttley@DastardlyHQ.org:

On Mon, 24 Feb 2025 11:48:08 +0100
Bonita Montero <Bonita.Montero@gmail.com> wibbled:

I wanted to optimize fmod to be a bit faster. This is my C++20 solution. >>>
double myFmod( double x, double y )

[snip]

It's about six times faster than the glibc 2.31 solution in my
benchmark. The returned NaNs and the raised exceptions are MSVC-
and glibc-compatible.

double myFmod(double x, double y)
{
double div = x / y;
return y * (div - (long)div);
}

If the exponent difference between x and y is large enough this
returns results which are larger than y. glibc does it completley
with integer-operations also.

If the values are so large or small that you start to get floating point
based errors then you should probably be using integer arthmetic or a large number library anyway like GMP anyway.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Muttley@DastardlyHQ.org@21:1/5 to All on Mon Feb 24 15:10:53 2025

On Mon, 24 Feb 2025 16:22:46 +0200
Michael S <already5chosen@yahoo.com> wibbled:

On Mon, 24 Feb 2025 13:09:50 -0000 (UTC)
Muttley@DastardlyHQ.org wrote:

If the values are so large or small that you start to get floating
point based errors then you should probably be using integer
arthmetic or a large number library anyway like GMP anyway.

Your method will sometimes produce results that are 1 LSB off
relatively to IEEE-754 prescription when values are neither small nor
large.
And sometimes 1 LSB off means that result is 2x off.
For example, for x=0.9999999999999999, y=0.9999999999999998 your method >produces 2.2204460492503126e-16. A correct result is, of course, >1.1102230246251565e-16

Frankly I doubt anyone would care, its zero in all but name.

Also, I don't think that your method is any faster than correct methods.

Don't know, but its only 3 mathematical operations all of which can be done
by the hardware so its going to be pretty fast.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Bonita Montero on Mon Feb 24 17:13:43 2025

On Mon, 24 Feb 2025 14:22:04 +0100
Bonita Montero <Bonita.Montero@gmail.com> wrote:

Am 24.02.2025 um 14:09 schrieb Muttley@DastardlyHQ.org:

If the values are so large or small that you start to get floating
point based errors then you should probably be using integer
arthmetic or a large number library anyway like GMP anyway.

There's no need for large integer arithmetics since each calculation
step results in a mantissa with equal or less bits than the divisor.
Even if the exponents are close enough to have not missing integer
-bits the following multiplication is very likely to have a precision
-loss. All current solutions (MSVC, libstdc++) work with integer-ope-
tations and are always 100% precise.

Do you have real application for fast fmod() or just playing?

For as long as y is positive and abs(x/y) <= 2**53, a very simple
formula will produce precise result: fma(trunc(x/fabs(y)), -fabs(y), x).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Muttley@DastardlyHQ.org on Mon Feb 24 16:22:46 2025

On Mon, 24 Feb 2025 13:09:50 -0000 (UTC)
Muttley@DastardlyHQ.org wrote:

On Mon, 24 Feb 2025 13:48:02 +0100
Bonita Montero <Bonita.Montero@gmail.com> wibbled:

Am 24.02.2025 um 13:00 schrieb Muttley@DastardlyHQ.org:

On Mon, 24 Feb 2025 11:48:08 +0100
Bonita Montero <Bonita.Montero@gmail.com> wibbled:

I wanted to optimize fmod to be a bit faster. This is my C++20
solution.

double myFmod( double x, double y )

[snip]

It's about six times faster than the glibc 2.31 solution in my
benchmark. The returned NaNs and the raised exceptions are MSVC-
and glibc-compatible.

double myFmod(double x, double y)
{
double div = x / y;
return y * (div - (long)div);
}

If the exponent difference between x and y is large enough this
returns results which are larger than y. glibc does it completley
with integer-operations also.

If the values are so large or small that you start to get floating
point based errors then you should probably be using integer
arthmetic or a large number library anyway like GMP anyway.

Your method will sometimes produce results that are 1 LSB off
relatively to IEEE-754 prescription when values are neither small nor
large.
And sometimes 1 LSB off means that result is 2x off.
For example, for x=0.9999999999999999, y=0.9999999999999998 your method produces 2.2204460492503126e-16. A correct result is, of course, 1.1102230246251565e-16

Also, I don't think that your method is any faster than correct methods.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Bonita Montero on Mon Feb 24 17:52:58 2025

On Mon, 24 Feb 2025 16:19:35 +0100
Bonita Montero <Bonita.Montero@gmail.com> wrote:

Am 24.02.2025 um 16:13 schrieb Michael S:

Do you have real application for fast fmod() or just playing?

I experimented with x87 FPREM and wanted to know whether it is
precise; it isn't

That is not surprising. It wouldn't be called *partial* reminder
otherwise.

and the results can be > y.

That's a little unexpected, but mentioned in the Intel and AMD manuals.
It can happen when abs(x/y) > 2**32. If you think about how FPREM
works, you'd realize that for abs(x/y) > 2**63 it is a necessity.

Still rem(x,y) == rem(fprem(x,y), y), so this unusual property does not
prevent FPREM from getting correct answer when we run it in loop.
But in the worst case loop can take something like 1000 iterations
:(

So I developed my own
routine which is always 100% precise.

For as long as y is positive

Actually, formula appears to work for negative y as well.

and abs(x/y) <= 2**53, a very simple

formula will produce precise result: fma(trunc(x/fabs(y)),
-fabs(y), x).

The multiplication mostly will drop bits so that the difference might
become larger than y.

That is why I don't use multiplication. Did you ever asked yourself
what is the meaning of 'f' in 'fma' ?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Muttley@DastardlyHQ.org on Mon Feb 24 17:33:45 2025

On Mon, 24 Feb 2025 15:10:53 -0000 (UTC)
Muttley@DastardlyHQ.org wrote:

On Mon, 24 Feb 2025 16:22:46 +0200
Michael S <already5chosen@yahoo.com> wibbled:

On Mon, 24 Feb 2025 13:09:50 -0000 (UTC)
Muttley@DastardlyHQ.org wrote:

If the values are so large or small that you start to get floating
point based errors then you should probably be using integer
arthmetic or a large number library anyway like GMP anyway.

Your method will sometimes produce results that are 1 LSB off
relatively to IEEE-754 prescription when values are neither small nor >large.
And sometimes 1 LSB off means that result is 2x off.
For example, for x=0.9999999999999999, y=0.9999999999999998 your
method produces 2.2204460492503126e-16. A correct result is, of
course, 1.1102230246251565e-16

Frankly I doubt anyone would care, its zero in all but name.

Also, I don't think that your method is any faster than correct
methods.

Don't know, but its only 3 mathematical operations all of which can
be done by the hardware so its going to be pretty fast.

Looks like 4 operations to me - division, truncation, subtraction, multiplication. If compiler takes it literally, which he probably
should if compiled without special non-standard-conforming flags like -fast-math, then there are 5 operations - double->int and
int->double conversions instead of truncation

Nevertheless, after a bit of thinking I concur that your formula is
faster than 100% correct methods. Initially, I didn't took into account
all difficulties that correct methods have to face in cases of very
large x to y ratios.

However your method is approximately the same speed as *mostly correct*
method shown in my post above. May be, yours is even a little slower,
at least as long as we use good optimizing compiler and target modern
CPUs that support trunc() and fma() as fast hardware instructions.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Muttley@DastardlyHQ.org@21:1/5 to All on Mon Feb 24 16:48:10 2025

On Mon, 24 Feb 2025 17:33:45 +0200
Michael S <already5chosen@yahoo.com> wibbled:

On Mon, 24 Feb 2025 15:10:53 -0000 (UTC)
Muttley@DastardlyHQ.org wrote:

Don't know, but its only 3 mathematical operations all of which can
be done by the hardware so its going to be pretty fast.

Looks like 4 operations to me - division, truncation, subtraction, >multiplication. If compiler takes it literally, which he probably

Yes, I should have included the cast. Not sure whether that could be done
in hardware or not, my assembler knowledge - for x86 - is way too rusty.

Nevertheless, after a bit of thinking I concur that your formula is
faster than 100% correct methods. Initially, I didn't took into account
all difficulties that correct methods have to face in cases of very
large x to y ratios.

In those sorts of cases the result of your program will be running into floating point precision errors elsewhere so IMO its somewhat moot.

However your method is approximately the same speed as *mostly correct* >method shown in my post above. May be, yours is even a little slower,
at least as long as we use good optimizing compiler and target modern
CPUs that support trunc() and fma() as fast hardware instructions.

The only way to really know would be to test it on various OS's and CPUs
I guess.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Mr Flibble@21:1/5 to Bonita.Montero@gmail.com on Mon Feb 24 22:21:03 2025

On Mon, 24 Feb 2025 11:48:08 +0100, Bonita Montero
<Bonita.Montero@gmail.com> wrote:

I wanted to optimize fmod to be a bit faster. This is my C++20 solution.

double myFmod( double x, double y )
{
constexpr uint64_t
SIGN = 1ull << 63,
IMPLICIT = 1ull << 52,
MANT = IMPLICIT - 1,
QBIT = 1ull << 51;
uint64_t const
binX = bit_cast<uint64_t>( x ),
binY = bit_cast<uint64_t>( y );
static auto abs = []( uint64_t m ) { return m & ~SIGN; };
auto isNaN = []( uint64_t m ) { return abs( m ) >= 0x7FF0000000000001u; }; auto isSig = []( uint64_t m ) { return !(m & QBIT); };
if( isNaN( binX ) ) [[unlikely]] // x == NaN
#if defined(_MSC_VER)
return bit_cast<double>( isNaN( binY ) ? binY | binX & binY & QBIT :
binX );
#else
{
if( isSig( binX ) || isNaN( binY ) && isSig( binY ) ) [[unlikely]]
feraiseexcept( FE_INVALID );
return bit_cast<double>( binX | QBIT );
}
#endif
if( isNaN( binY ) ) [[unlikely]] // x != NaN || y == NaN
#if defined(_MSC_VER)
return y;
#else
{
if( isSig( binY ) ) [[unlikely]]
feraiseexcept( FE_INVALID );
return bit_cast<double>( binY | QBIT );
}
#endif
auto isInf = []( uint64_t m ) { return abs( m ) == 0x7FF0000000000000u; }; if( isInf( binX ) ) // x == Inf
{
feraiseexcept( FE_INVALID );
#if defined(_MSC_VER)
return bit_cast<double>( binX & ~MANT | QBIT );
#else
return -numeric_limits<double>::quiet_NaN();
#endif
}
if( !abs( binY ) ) [[unlikely]] // y == 0
{
feraiseexcept( FE_INVALID );
#if defined(_MSC_VER)
return numeric_limits<double>::quiet_NaN();
#else
return -numeric_limits<double>::quiet_NaN();
#endif
}
if( !abs( binX ) || isInf( binY ) ) [[unlikely]] // x == 0 || y == Inf
return x;
auto exp = []( uint64_t b ) -> int { return b >> 52 & 0x7FF; };
int
expX = exp( binX ),
expY = exp( binY );
auto mant = []( uint64_t b ) { return b & MANT; };
uint64_t
mantX = mant( binX ),
mantY = mant( binY );
static auto normalize = []( int &exp, uint64_t &mant )
{
unsigned shift = countl_zero( mant ) - 11;
mant <<= shift;
exp -= shift;
};
auto build = []( int &exp, uint64_t &mant )
{
if( exp ) [[likely]]
mant |= IMPLICIT;
else
{
exp = 1;
normalize( exp, mant );
}
};
build( expX, mantX );
build( expY, mantY );
uint64_t signX = binX & SIGN;
int expDiff;
while( (expDiff = expX - expY) > 0 )
{
unsigned bits = expDiff <= 11 ? expDiff : 11;
if( !(mantX = (mantX << bits) % mantY) ) [[unlikely]]
return bit_cast<double>( signX );
expX -= bits;
normalize( expX, mantX );
}
if( !expDiff && mantX >= mantY ) [[unlikely]]
if( (mantX -= mantY) ) [[likely]]
normalize( expX, mantX );
else
return bit_cast<double>( signX );
if( expX <= 0 ) [[unlikely]]
{
assert(expX >= -51);
mantX = mantX >> (unsigned)(-expX + 1);
expX = 0;
}
return bit_cast<double>( signX | (uint64_t)expX << 52 | mantX & MANT );
}

It's about six times faster than the glibc 2.31 solution in my
benchmark. The returned NaNs and the raised exceptions are MSVC-
and glibc-compatible.

double myFmod(double x, double y)
{
double div = x / y;
return y * (div - std::round(div));
}

/Flibble

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Muttley@DastardlyHQ.org@21:1/5 to All on Tue Feb 25 08:24:56 2025

On Mon, 24 Feb 2025 22:21:03 +0000
Mr Flibble <leigh@i42.co.uk> wibbled:

On Mon, 24 Feb 2025 11:48:08 +0100, Bonita Montero
<Bonita.Montero@gmail.com> wrote:

It's about six times faster than the glibc 2.31 solution in my
benchmark. The returned NaNs and the raised exceptions are MSVC-
and glibc-compatible.

double myFmod(double x, double y)
{
double div = x / y;
return y * (div - std::round(div));
}

You don't ever want it rounded up, it must always just be the integer component.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Bonita Montero on Tue Feb 25 17:26:18 2025

On Tue, 25 Feb 2025 09:09:21 +0100
Bonita Montero <Bonita.Montero@gmail.com> wrote:

Am 24.02.2025 um 16:52 schrieb Michael S:

That is why I don't use multiplication. Did you ever asked yourself
what is the meaning of 'f' in 'fma' ?

The FMA-instructions produce the same results:

#include <iostream>
#include <random>
#include <bit>
#include <cmath>
#include <iomanip>
#include <intrin.h>

using namespace std;

int main()
{
auto fma = []( double a, double b, double c )
{
__m128d mA, mB, mC;
mA.m128d_f64[0] = a;
mB.m128d_f64[0] = b;
mC.m128d_f64[0] = c;
return _mm_fmadd_pd( mA, mB, mC ).m128d_f64[0];
};
mt19937_64 mt;
uniform_int_distribution<uint64_t> finites( 1,
0x7FEFFFFFFFFFFFFFu ); auto rnd = [&]() -> double { return
bit_cast<double>( finites( mt ) ); }; ptrdiff_t nEQs = 0;
for( ptrdiff_t r = 0; r != 1'000'000; ++r )
{
double
a = rnd(), b = rnd(), c = rnd(),
rA = fma( a, b, c ),
rB = a * b + c;
nEQs = rA != rB;
}
cout << hexfloat << nEQs / 1.0e6 << endl;
}

GIGO.
Do a proper test then you'd get a proper answer.

fma.c:

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

int main(int argz, char** argv)
{
if (argz < 4) {
fprintf(stderr, "Usage:\nfma x y z\n");
return 1;
}

double values[3];
for (int i = 0; i < 3; ++i) {
char* arg = argv[i+1];
char* endp;
values[i] = strtod(arg, &endp);
if (arg == endp) {
fprintf(stderr, "Bad argument '%s'. Not a number.\n", arg);
return 1;
}
}

double r1 = values[0]* values[1] + values[2];
double r2 = fma(values[0], values[1], values[2]);
printf(" %.17e * %.17e + %.17e = %.17e\n",
values[0], values[1], values[2], r1);
printf("fma(%.17e , %.17e , %.17e) = %.17e\n",
values[0], values[1], values[2], r2);

return 0;
}

$ gcc -O2 -Wall fma.c -o fma

$ ./fma 1000000001 999999999 -1e18
1.00000000100000000e+09 * 9.99999999000000000e+08 +
-1.00000000000000000e+18 = 0.00000000000000000e+00 fma(1.00000000100000000e+09 , 9.99999999000000000e+08 ,
-1.00000000000000000e+18) = -1.00000000000000000e+00

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Bonita Montero on Tue Feb 25 19:17:15 2025

On Tue, 25 Feb 2025 16:45:25 +0100
Bonita Montero <Bonita.Montero@gmail.com> wrote:

Am 25.02.2025 um 16:26 schrieb Michael S:

On Tue, 25 Feb 2025 09:09:21 +0100
Bonita Montero <Bonita.Montero@gmail.com> wrote:

Am 24.02.2025 um 16:52 schrieb Michael S:

That is why I don't use multiplication. Did you ever asked
yourself what is the meaning of 'f' in 'fma' ?

The FMA-instructions produce the same results:

#include <iostream>
#include <random>
#include <bit>
#include <cmath>
#include <iomanip>
#include <intrin.h>

using namespace std;

int main()
{
auto fma = []( double a, double b, double c )
{
__m128d mA, mB, mC;
mA.m128d_f64[0] = a;
mB.m128d_f64[0] = b;
mC.m128d_f64[0] = c;
return _mm_fmadd_pd( mA, mB, mC ).m128d_f64[0];
};
mt19937_64 mt;
uniform_int_distribution<uint64_t> finites( 1,
0x7FEFFFFFFFFFFFFFu ); auto rnd = [&]() -> double { return
bit_cast<double>( finites( mt ) ); }; ptrdiff_t nEQs = 0;
for( ptrdiff_t r = 0; r != 1'000'000; ++r )
{
double
a = rnd(), b = rnd(), c = rnd(),
rA = fma( a, b, c ),
rB = a * b + c;
nEQs = rA != rB;
}
cout << hexfloat << nEQs / 1.0e6 << endl;
}

GIGO.
Do a proper test then you'd get a proper answer.

The test is proper with MSVC since MSVC doesn't replace the
"a * b + c"-operation with a FMA-operation.

Don't invent your own fma(). Use one provided by library.
Then MSVC will do what it is prescribed to do by the standard.

With your code
it isn't guaranteed that the CPU-specific FMA-operations are
used.

Correct. And it does not matter. When run on CPU without HW FMA, it
would be slower. But would still produce a right result.

I'm using the SSE FMA operation explicitly and I'm using
it for a million random finite double-values.

Originally, I didn't even try to investigate what garbage exactly you
are feeding to your test. Now I took a look. It seems that you are
doing something fundamentally stupid, like all fma inputs positive.
Of course, for all positive inputs the (fma(x,y,z) != x*y+z) is quite
rare. It still happens sometimes, but it's likely that for your chosen distribution it happens less often than once per million.
OTOH, when x*y and z have different signs and similar magnitude, it
happens all the time. Still, for your stupidly chosen distribution the
similar magnitude is probably quite rare too.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Mr Flibble@21:1/5 to Bonita.Montero@gmail.com on Tue Feb 25 23:03:04 2025

On Tue, 25 Feb 2025 07:37:23 +0100, Bonita Montero
<Bonita.Montero@gmail.com> wrote:

Am 24.02.2025 um 23:21 schrieb Mr Flibble:

double myFmod(double x, double y)
{
double div = x / y;
return y * (div - std::round(div));
}

Doesn't work, not only for the reasons already mentioned.

double myFmod(double x, double y)
{
double div = x / y;
return y * (div - std::trunc(div));
}

/Flibble

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Bonita Montero on Wed Feb 26 00:36:19 2025

On Tue, 25 Feb 2025 18:58:23 +0100
Bonita Montero <Bonita.Montero@gmail.com> wrote:

Am 25.02.2025 um 18:17 schrieb Michael S:

Don't invent your own fma(). Use one provided by library.
Then MSVC will do what it is prescribed to do by the standard.

I want to be sure that I'm using the SSE FMA operation

SSEn has no FMA operations.
FMA was introduces as part of AVX series, simulatneously with AVX2.
In practice all CPUs that support AVX2 also support FMA, but
from formal perspective they AVX2 and FMA are different extensions.

and
not a conventional substitute of two instructions.

fma() is guaranteed to do a right thing by standard.
It can not be substituted by two instructions. It's either one
instruction or many, likely over dozen. Never two.

Originally, I didn't even try to investigate what garbage exactly
you are feeding to your test. Now I took a look. It seems that you
are doing something fundamentally stupid, like all fma inputs
positive.

I extended the test to incorporate all possible double
bitrepresentations (taken von mt()) - with no difference.

What you wrote makes no sense. For positive double-precision x, y and
z there are 2**189 posible combinations. You cant check all of them even
if you were given all computers of the world for millenium.

However now I see that there indeed is a bug in Microsoft's
implementation of fma() library routine (also used by gcc on msys2).
When programs linked with their dynamic library run on hardware with FMA instructions then everything works correctly.

Same programs on hardware without FMA mostly produce correct results
when x*y and z differ in sign.
But when x*y and z have the same sign then [on hardware without FMA] Microsoft's routine appear to do non-fused calculations.

Here is an example of the program that prints 250508 on Intel Haswell
CPU, but prints 0 on Intel Ivy Bridge.
Compiled as 'cl -O1 -W4 -MD fma_tst0.c'.

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

static
unsigned long long rnd(void)
{
unsigned long long x = 0;
for (int i = 0; i < 5; ++i)
x = (x << 15) + (rand() & 0x7FFF);
return x;
}

int main(void)
{
srand(1);
int n = 0;
for (int i = 0; i < 1000000; ++i) {
double x = rnd() * 0x1p-64;
double y = rnd() * 0x1p-64;
double z = rnd() * 0x1p-114;
double r1 = x*y + z;
double r2 = fma(x, y, z);
n += r1 != r2;
}
printf("%d\n", n);
return 0;
}

It is certainly worth a bug report, but I am afraid that Microsoft will
do nothing to fix it, likely claiming that they don't care about old
hardware.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Mr Flibble on Wed Feb 26 02:00:48 2025

On Tue, 25 Feb 2025 23:03:04 +0000
Mr Flibble <leigh@i42.co.uk> wrote:

On Tue, 25 Feb 2025 07:37:23 +0100, Bonita Montero
<Bonita.Montero@gmail.com> wrote:

Am 24.02.2025 um 23:21 schrieb Mr Flibble:

double myFmod(double x, double y)
{
double div = x / y;
return y * (div - std::round(div));
}

Doesn't work, not only for the reasons already mentioned.

double myFmod(double x, double y)
{
double div = x / y;
return y * (div - std::trunc(div));
}

/Flibble

Even ignoring potential overflow during division, this method is
very imprecise.
(1e3/9 - trunc(1e3/9))*9 = 1.000000000000028
(1e6/9 - trunc(1e6/9))*9 = 0.999999999985448
(1e9/9 - trunc(1e9/9))*9 = 0.999999940395355
(1e12/9 - trunc(1e12/9))*9 = 1.000030517578125
(1e15/9 - trunc(1e15/9))*9 = 0.984375

OTOH
1e3/9 - trunc(1e3/9)*9 = 1
1e6/9 - trunc(1e6/9)*9 = 1
1e9/9 - trunc(1e9/9)*9 = 1
1e12/9 - trunc(1e12/9)*9 = 1
1e15/9 - trunc(1e15/9)*9 = 1

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Wuns Haerst@21:1/5 to All on Wed Feb 26 04:01:06 2025

Am 26.02.2025 um 00:03 schrieb Mr Flibble:

double myFmod(double x, double y)
{
double div = x / y;
return y * (div - std::trunc(div));
}

Produces a lot of erroneous results.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Muttley@DastardlyHQ.org@21:1/5 to All on Wed Feb 26 08:16:01 2025

On Wed, 26 Feb 2025 00:36:19 +0200
Michael S <already5chosen@yahoo.com> wibbled:

Here is an example of the program that prints 250508 on Intel Haswell
CPU, but prints 0 on Intel Ivy Bridge.
Compiled as 'cl -O1 -W4 -MD fma_tst0.c'.

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

static
unsigned long long rnd(void)
{
unsigned long long x = 0;
for (int i = 0; i < 5; ++i)
x = (x << 15) + (rand() & 0x7FFF);
return x;
}

int main(void)
{
srand(1);
int n = 0;
for (int i = 0; i < 1000000; ++i) {
double x = rnd() * 0x1p-64;
double y = rnd() * 0x1p-64;
double z = rnd() * 0x1p-114;
double r1 = x*y + z;
double r2 = fma(x, y, z);
n += r1 != r2;
}
printf("%d\n", n);
return 0;
}

It is certainly worth a bug report, but I am afraid that Microsoft will
do nothing to fix it, likely claiming that they don't care about old >hardware.

Just FYI - it also returns 0 when compiled by Clang on an ARM Mac.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Muttley@DastardlyHQ.org on Wed Feb 26 14:27:21 2025

On Wed, 26 Feb 2025 08:16:01 -0000 (UTC)
Muttley@DastardlyHQ.org wrote:

On Wed, 26 Feb 2025 00:36:19 +0200
Michael S <already5chosen@yahoo.com> wibbled:

Here is an example of the program that prints 250508 on Intel Haswell
CPU, but prints 0 on Intel Ivy Bridge.
Compiled as 'cl -O1 -W4 -MD fma_tst0.c'.

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

static
unsigned long long rnd(void)
{
unsigned long long x = 0;
for (int i = 0; i < 5; ++i)
x = (x << 15) + (rand() & 0x7FFF);
return x;
}

int main(void)
{
srand(1);
int n = 0;
for (int i = 0; i < 1000000; ++i) {
double x = rnd() * 0x1p-64;
double y = rnd() * 0x1p-64;
double z = rnd() * 0x1p-114;
double r1 = x*y + z;
double r2 = fma(x, y, z);
n += r1 != r2;
}
printf("%d\n", n);
return 0;
}

It is certainly worth a bug report, but I am afraid that Microsoft
will do nothing to fix it, likely claiming that they don't care
about old hardware.

Just FYI - it also returns 0 when compiled by Clang on an ARM Mac.

Looks like a bug in clang.
New versions of clang generate FMA instead of mul+add. I.e. clang bug is opposite of MS bug.
By Standard, compilers not allowed to do it in "standard" C mode in
absence of special flags like -ffast-math.

I played a little on godbolt and it seems that the bug is relatively
new. clang 13 still generates correct code. clang 14 does not. I.e.
slightly less than 3 years.

It happened simultaneously on x86-64 and ARM64

clang 14
https://godbolt.org/z/asochKz5P
https://godbolt.org/z/c7xTaGWzv

clang 13
https://godbolt.org/z/6onP3dE3c
https://godbolt.org/z/W9exqTanf

clang 13 -ffast-math
https://godbolt.org/z/8f875qMrf
https://godbolt.org/z/qPGafr563

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Muttley@DastardlyHQ.org@21:1/5 to All on Wed Feb 26 14:39:34 2025

On Wed, 26 Feb 2025 14:27:21 +0200
Michael S <already5chosen@yahoo.com> wibbled:

On Wed, 26 Feb 2025 08:16:01 -0000 (UTC)
Muttley@DastardlyHQ.org wrote:
I played a little on godbolt and it seems that the bug is relatively
new. clang 13 still generates correct code. clang 14 does not. I.e.
slightly less than 3 years.

I don't think they've noticed:

R8603$ cc --version
Apple clang version 16.0.0 (clang-1600.0.26.6)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Mr Flibble@21:1/5 to Bonita.Montero@gmail.com on Thu Feb 27 21:16:50 2025

On Mon, 24 Feb 2025 11:48:08 +0100, Bonita Montero
<Bonita.Montero@gmail.com> wrote:

I wanted to optimize fmod to be a bit faster. This is my C++20 solution.

double myFmod( double x, double y )
{
constexpr uint64_t
SIGN = 1ull << 63,
IMPLICIT = 1ull << 52,
MANT = IMPLICIT - 1,
QBIT = 1ull << 51;
uint64_t const
binX = bit_cast<uint64_t>( x ),
binY = bit_cast<uint64_t>( y );
static auto abs = []( uint64_t m ) { return m & ~SIGN; };
auto isNaN = []( uint64_t m ) { return abs( m ) >= 0x7FF0000000000001u; }; auto isSig = []( uint64_t m ) { return !(m & QBIT); };
if( isNaN( binX ) ) [[unlikely]] // x == NaN
#if defined(_MSC_VER)
return bit_cast<double>( isNaN( binY ) ? binY | binX & binY & QBIT :
binX );
#else
{
if( isSig( binX ) || isNaN( binY ) && isSig( binY ) ) [[unlikely]]
feraiseexcept( FE_INVALID );
return bit_cast<double>( binX | QBIT );
}
#endif
if( isNaN( binY ) ) [[unlikely]] // x != NaN || y == NaN
#if defined(_MSC_VER)
return y;
#else
{
if( isSig( binY ) ) [[unlikely]]
feraiseexcept( FE_INVALID );
return bit_cast<double>( binY | QBIT );
}
#endif
auto isInf = []( uint64_t m ) { return abs( m ) == 0x7FF0000000000000u; }; if( isInf( binX ) ) // x == Inf
{
feraiseexcept( FE_INVALID );
#if defined(_MSC_VER)
return bit_cast<double>( binX & ~MANT | QBIT );
#else
return -numeric_limits<double>::quiet_NaN();
#endif
}
if( !abs( binY ) ) [[unlikely]] // y == 0
{
feraiseexcept( FE_INVALID );
#if defined(_MSC_VER)
return numeric_limits<double>::quiet_NaN();
#else
return -numeric_limits<double>::quiet_NaN();
#endif
}
if( !abs( binX ) || isInf( binY ) ) [[unlikely]] // x == 0 || y == Inf
return x;
auto exp = []( uint64_t b ) -> int { return b >> 52 & 0x7FF; };
int
expX = exp( binX ),
expY = exp( binY );
auto mant = []( uint64_t b ) { return b & MANT; };
uint64_t
mantX = mant( binX ),
mantY = mant( binY );
static auto normalize = []( int &exp, uint64_t &mant )
{
unsigned shift = countl_zero( mant ) - 11;
mant <<= shift;
exp -= shift;
};
auto build = []( int &exp, uint64_t &mant )
{
if( exp ) [[likely]]
mant |= IMPLICIT;
else
{
exp = 1;
normalize( exp, mant );
}
};
build( expX, mantX );
build( expY, mantY );
uint64_t signX = binX & SIGN;
int expDiff;
while( (expDiff = expX - expY) > 0 )
{
unsigned bits = expDiff <= 11 ? expDiff : 11;
if( !(mantX = (mantX << bits) % mantY) ) [[unlikely]]
return bit_cast<double>( signX );
expX -= bits;
normalize( expX, mantX );
}
if( !expDiff && mantX >= mantY ) [[unlikely]]
if( (mantX -= mantY) ) [[likely]]
normalize( expX, mantX );
else
return bit_cast<double>( signX );
if( expX <= 0 ) [[unlikely]]
{
assert(expX >= -51);
mantX = mantX >> (unsigned)(-expX + 1);
expX = 0;
}
return bit_cast<double>( signX | (uint64_t)expX << 52 | mantX & MANT );
}

It's about six times faster than the glibc 2.31 solution in my
benchmark. The returned NaNs and the raised exceptions are MSVC-
and glibc-compatible.

double myFmod(double x, double y)
{
return x / y - std::trunc(x / y) * y;
}

/Flibble

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Mr Flibble on Fri Feb 28 12:35:29 2025

On Thu, 27 Feb 2025 21:16:50 +0000
Mr Flibble <leigh@i42.co.uk> wrote:

double myFmod(double x, double y)
{
return x / y - std::trunc(x / y) * y;
}

/Flibble

Nonsense.

The one below is not nonsense, but still very bad.
double myFmod(double x, double y)
{
return x - trunc(x / y) * y;
}

All solutions that work for all combinations of inputs are complicated.
They can be based either on integer arithmetic or on FMA.

In the former case in order to get any sort of speed one has to use non-standard extensions to the language, like gcc __int128 or MS/Intel _umul128/_umulh or MS/ARM __umulh.

In the latter class of solutions one has to be careful about rounding -
either check on every step that rounding didn't went a wrong way or set rounding mode to FE_TOWARDZERO at the beginning and restore it to
original at the end.

For both solutions worst case (huge x, tiny y) is pretty slow - 30-40
steps with several arithmetic operations on every step. Without
experimentation it is hard to say which of the solutions is faster. It
depends on your hardware, anyway.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Mr Flibble@21:1/5 to already5chosen@yahoo.com on Fri Feb 28 18:12:06 2025

On Fri, 28 Feb 2025 12:35:29 +0200, Michael S
<already5chosen@yahoo.com> wrote:

On Thu, 27 Feb 2025 21:16:50 +0000
Mr Flibble <leigh@i42.co.uk> wrote:

double myFmod(double x, double y)
{
return x / y - std::trunc(x / y) * y;
}

/Flibble

Nonsense.

The one below is not nonsense, but still very bad.
double myFmod(double x, double y)
{
return x - trunc(x / y) * y;
}

Yes it is nonsense, YOUR nonsense (I didn't actually think about the
problem, just reposted yours as I foolishly assumed you were correct):

On Wed, 26 Feb 2025 02:00:48 +0200, Michael S
<already5chosen@yahoo.com> wrote:

Even ignoring potential overflow during division, this method is
very imprecise.
(1e3/9 - trunc(1e3/9))*9 = 1.000000000000028
(1e6/9 - trunc(1e6/9))*9 = 0.999999999985448
(1e9/9 - trunc(1e9/9))*9 = 0.999999940395355
(1e12/9 - trunc(1e12/9))*9 = 1.000030517578125
(1e15/9 - trunc(1e15/9))*9 = 0.984375

OTOH
1e3/9 - trunc(1e3/9)*9 = 1
1e6/9 - trunc(1e6/9)*9 = 1
1e9/9 - trunc(1e9/9)*9 = 1
1e12/9 - trunc(1e12/9)*9 = 1
1e15/9 - trunc(1e15/9)*9 = 1

/Flibble

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Mr Flibble@21:1/5 to Bonita.Montero@gmail.com on Fri Feb 28 18:30:47 2025

On Mon, 24 Feb 2025 11:48:08 +0100, Bonita Montero
<Bonita.Montero@gmail.com> wrote:

I wanted to optimize fmod to be a bit faster. This is my C++20 solution.

double myFmod( double x, double y )
{
constexpr uint64_t
SIGN = 1ull << 63,
IMPLICIT = 1ull << 52,
MANT = IMPLICIT - 1,
QBIT = 1ull << 51;
uint64_t const
binX = bit_cast<uint64_t>( x ),
binY = bit_cast<uint64_t>( y );
static auto abs = []( uint64_t m ) { return m & ~SIGN; };
auto isNaN = []( uint64_t m ) { return abs( m ) >= 0x7FF0000000000001u; }; auto isSig = []( uint64_t m ) { return !(m & QBIT); };
if( isNaN( binX ) ) [[unlikely]] // x == NaN
#if defined(_MSC_VER)
return bit_cast<double>( isNaN( binY ) ? binY | binX & binY & QBIT :
binX );
#else
{
if( isSig( binX ) || isNaN( binY ) && isSig( binY ) ) [[unlikely]]
feraiseexcept( FE_INVALID );
return bit_cast<double>( binX | QBIT );
}
#endif
if( isNaN( binY ) ) [[unlikely]] // x != NaN || y == NaN
#if defined(_MSC_VER)
return y;
#else
{
if( isSig( binY ) ) [[unlikely]]
feraiseexcept( FE_INVALID );
return bit_cast<double>( binY | QBIT );
}
#endif
auto isInf = []( uint64_t m ) { return abs( m ) == 0x7FF0000000000000u; }; if( isInf( binX ) ) // x == Inf
{
feraiseexcept( FE_INVALID );
#if defined(_MSC_VER)
return bit_cast<double>( binX & ~MANT | QBIT );
#else
return -numeric_limits<double>::quiet_NaN();
#endif
}
if( !abs( binY ) ) [[unlikely]] // y == 0
{
feraiseexcept( FE_INVALID );
#if defined(_MSC_VER)
return numeric_limits<double>::quiet_NaN();
#else
return -numeric_limits<double>::quiet_NaN();
#endif
}
if( !abs( binX ) || isInf( binY ) ) [[unlikely]] // x == 0 || y == Inf
return x;
auto exp = []( uint64_t b ) -> int { return b >> 52 & 0x7FF; };
int
expX = exp( binX ),
expY = exp( binY );
auto mant = []( uint64_t b ) { return b & MANT; };
uint64_t
mantX = mant( binX ),
mantY = mant( binY );
static auto normalize = []( int &exp, uint64_t &mant )
{
unsigned shift = countl_zero( mant ) - 11;
mant <<= shift;
exp -= shift;
};
auto build = []( int &exp, uint64_t &mant )
{
if( exp ) [[likely]]
mant |= IMPLICIT;
else
{
exp = 1;
normalize( exp, mant );
}
};
build( expX, mantX );
build( expY, mantY );
uint64_t signX = binX & SIGN;
int expDiff;
while( (expDiff = expX - expY) > 0 )
{
unsigned bits = expDiff <= 11 ? expDiff : 11;
if( !(mantX = (mantX << bits) % mantY) ) [[unlikely]]
return bit_cast<double>( signX );
expX -= bits;
normalize( expX, mantX );
}
if( !expDiff && mantX >= mantY ) [[unlikely]]
if( (mantX -= mantY) ) [[likely]]
normalize( expX, mantX );
else
return bit_cast<double>( signX );
if( expX <= 0 ) [[unlikely]]
{
assert(expX >= -51);
mantX = mantX >> (unsigned)(-expX + 1);
expX = 0;
}
return bit_cast<double>( signX | (uint64_t)expX << 52 | mantX & MANT );
}

It's about six times faster than the glibc 2.31 solution in my
benchmark. The returned NaNs and the raised exceptions are MSVC-
and glibc-compatible.

double my_fmod(double x, double y)
{
if (y == 0.0)
return x / y;
return x - std::trunc(x / y) * y;
}

/Flibble

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Mr Flibble@21:1/5 to Bonita.Montero@gmail.com on Fri Feb 28 19:47:23 2025

On Fri, 28 Feb 2025 19:31:44 +0100, Bonita Montero
<Bonita.Montero@gmail.com> wrote:

Am 28.02.2025 um 19:30 schrieb Mr Flibble:

double my_fmod(double x, double y)
{
if (y == 0.0)
return x / y;
return x - std::trunc(x / y) * y;
}

This still sucks. Try it with this test:

#include <iostream>
#include <cmath>
#include <random>
#include <bit>

using namespace std;

double trivialFmod( double a, double b );

int main()
{
mt19937_64 mt;
uniform_int_distribution<uint64_t> gen( 1, 0x7FEFFFFFFFFFFFFFu );
size_t imprecise = 0, outOfRange = 0;
for( size_t r = 1'000'000; r; --r )
{
double
a = bit_cast<double>( gen( mt ) ),
b = bit_cast<double>( gen( mt ) ),
fm = fmod( a, b ),
tfm = trivialFmod( a, b );
imprecise += fm != tfm;
outOfRange += tfm >= b;
}
auto print = []( char const *what, size_t n ) { cout << what <<
(ptrdiff_t)n / (1.0e6 / 100) << "%" << endl; };
print( "imprecise: ", imprecise );
print( "out of range: ", outOfRange );
}

double trivialFmod( double a, double b )
{
return a - trunc( a / b ) * b;
}

IEEE 754 does not define how std::fmod should behave, only
std::remainder.

/Flibble

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Mr Flibble@21:1/5 to Bonita.Montero@gmail.com on Sat Mar 1 14:37:28 2025

On Fri, 28 Feb 2025 20:49:38 +0100, Bonita Montero
<Bonita.Montero@gmail.com> wrote:

Am 28.02.2025 um 20:47 schrieb Mr Flibble:

IEEE 754 does not define how std::fmod should behave, only
std::remainder.

There's only one way to do it for finite numbers.

Not true as there is a fixed mantissa size, so finite precision,
making your test case useless if x is sufficiently large.

/Flibble

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Mr Flibble@21:1/5 to Bonita.Montero@gmail.com on Sat Mar 1 17:11:54 2025

On Sat, 1 Mar 2025 16:58:51 +0100, Bonita Montero
<Bonita.Montero@gmail.com> wrote:

Am 01.03.2025 um 15:37 schrieb Mr Flibble:

On Fri, 28 Feb 2025 20:49:38 +0100, Bonita Montero
<Bonita.Montero@gmail.com> wrote:

Am 28.02.2025 um 20:47 schrieb Mr Flibble:

IEEE 754 does not define how std::fmod should behave, only
std::remainder.

There's only one way to do it for finite numbers.

Not true as there is a fixed mantissa size, so finite precision,
making your test case useless if x is sufficiently large.

The way to do a modolo calculations for every floating point value
except inf or nan (finite numbers) is always the same for all imple- >mentations. And correct implementations are always without precision
los, i.e. exact.
As I've shown solutions like yours are only 50% exact and in 2% they
generate out of range results.

Thus you are asserting that all finite numbers have an exact IEEE 754
floating point representation which is of course an erroneous
assertion ergo your solution is bogus.

/Flibble

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Mr Flibble@21:1/5 to Bonita.Montero@gmail.com on Sat Mar 1 20:32:40 2025

On Sat, 1 Mar 2025 21:02:32 +0100, Bonita Montero
<Bonita.Montero@gmail.com> wrote:

Am 01.03.2025 um 18:11 schrieb Mr Flibble:

Thus you are asserting that all finite numbers have an exact IEEE 754
floating point representation ...

But the result of floating-point operations usually have precision-loss; >except fmod(), which is always correct - when implemented properly. You >didn't implement it correctly.

False, see my other post.

/Flibble

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Mr Flibble@21:1/5 to Bonita.Montero@gmail.com on Sat Mar 1 20:45:26 2025

On Sat, 1 Mar 2025 21:37:10 +0100, Bonita Montero
<Bonita.Montero@gmail.com> wrote:

Am 01.03.2025 um 21:32 schrieb Mr Flibble:

On Sat, 1 Mar 2025 21:02:32 +0100, Bonita Montero
<Bonita.Montero@gmail.com> wrote:

Am 01.03.2025 um 18:11 schrieb Mr Flibble:

Thus you are asserting that all finite numbers have an exact IEEE 754
floating point representation ...

But the result of floating-point operations usually have precision-loss; >>> except fmod(), which is always correct - when implemented properly. You
didn't implement it correctly.

False, see my other post.

/Flibble

This:

#include <iostream>
#include <cmath>
#include <random>
#include <bit>

using namespace std;

double my_fmod( double x, double y );

int main()
{
mt19937_64 mt;
uniform_int_distribution<uint64_t> gen( 1, 0x7FEFFFFFFFFFFFFFu );
size_t imprecise = 0, outOfRange = 0;
for( size_t r = 1'000'000; r; --r )
{
double
a = bit_cast<double>( gen( mt ) ),
b = bit_cast<double>( gen( mt ) ),
fm = fmod( a, b ),
tfm = my_fmod( a, b );
imprecise += fm != tfm;
outOfRange += tfm >= b;
}
auto print = []( char const *what, size_t n ) { cout << what <<
(ptrdiff_t)n / (1.0e6 / 100) << "%" << endl; };
print( "imprecise: ", imprecise );
print( "out of range: ", outOfRange );
}

double my_fmod( double x, double y )
{
if( y == 0.0 )
return x / y;
return x - std::trunc( x / y ) * y;
}

... prints this ...

imprecise: 49.9096%
out of range: 2.0039%

So your solution is unusable.

False, see my other post.

/Flibble

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Bonita Montero on Sat Mar 1 22:55:59 2025

On Mon, 24 Feb 2025 11:48:08 +0100
Bonita Montero <Bonita.Montero@gmail.com> wrote:

I wanted to optimize fmod to be a bit faster. This is my C++20
solution.

double myFmod( double x, double y )
{
constexpr uint64_t
SIGN = 1ull << 63,
IMPLICIT = 1ull << 52,
MANT = IMPLICIT - 1,
QBIT = 1ull << 51;
uint64_t const
binX = bit_cast<uint64_t>( x ),
binY = bit_cast<uint64_t>( y );
static auto abs = []( uint64_t m ) { return m & ~SIGN; };
auto isNaN = []( uint64_t m ) { return abs( m ) >=
0x7FF0000000000001u; }; auto isSig = []( uint64_t m ) { return !(m &
QBIT); }; if( isNaN( binX ) ) [[unlikely]] // x == NaN
#if defined(_MSC_VER)
return bit_cast<double>( isNaN( binY ) ? binY | binX
& binY & QBIT : binX );
#else
{
if( isSig( binX ) || isNaN( binY ) && isSig( binY ) ) [[unlikely]] feraiseexcept( FE_INVALID );
return bit_cast<double>( binX | QBIT );
}
#endif
if( isNaN( binY ) ) [[unlikely]] // x != NaN || y == NaN
#if defined(_MSC_VER)
return y;
#else
{
if( isSig( binY ) ) [[unlikely]]
feraiseexcept( FE_INVALID );
return bit_cast<double>( binY | QBIT );
}
#endif
auto isInf = []( uint64_t m ) { return abs( m ) ==
0x7FF0000000000000u; }; if( isInf( binX ) ) // x == Inf
{
feraiseexcept( FE_INVALID );
#if defined(_MSC_VER)
return bit_cast<double>( binX & ~MANT | QBIT );
#else
return -numeric_limits<double>::quiet_NaN();
#endif
}
if( !abs( binY ) ) [[unlikely]] // y == 0
{
feraiseexcept( FE_INVALID );
#if defined(_MSC_VER)
return numeric_limits<double>::quiet_NaN();
#else
return -numeric_limits<double>::quiet_NaN();
#endif
}
if( !abs( binX ) || isInf( binY ) ) [[unlikely]] // x == 0 ||
y == Inf return x;
auto exp = []( uint64_t b ) -> int { return b >> 52 & 0x7FF;
}; int
expX = exp( binX ),
expY = exp( binY );
auto mant = []( uint64_t b ) { return b & MANT; };
uint64_t
mantX = mant( binX ),
mantY = mant( binY );
static auto normalize = []( int &exp, uint64_t &mant )
{
unsigned shift = countl_zero( mant ) - 11;
mant <<= shift;
exp -= shift;
};
auto build = []( int &exp, uint64_t &mant )
{
if( exp ) [[likely]]
mant |= IMPLICIT;
else
{
exp = 1;
normalize( exp, mant );
}
};
build( expX, mantX );
build( expY, mantY );
uint64_t signX = binX & SIGN;
int expDiff;
while( (expDiff = expX - expY) > 0 )
{
unsigned bits = expDiff <= 11 ? expDiff : 11;
if( !(mantX = (mantX << bits) % mantY) ) [[unlikely]]
return bit_cast<double>( signX );
expX -= bits;
normalize( expX, mantX );
}
if( !expDiff && mantX >= mantY ) [[unlikely]]
if( (mantX -= mantY) ) [[likely]]
normalize( expX, mantX );
else
return bit_cast<double>( signX );
if( expX <= 0 ) [[unlikely]]
{
assert(expX >= -51);
mantX = mantX >> (unsigned)(-expX + 1);
expX = 0;
}
return bit_cast<double>( signX | (uint64_t)expX << 52 | mantX
& MANT ); }

It's about six times faster than the glibc 2.31 solution in my
benchmark. The returned NaNs and the raised exceptions are MSVC-
and glibc-compatible.

How about that?
Pay attention, it's C rather than C++. So 5 times shorter :-)
It's not the fastest for big x/y ratios, but rather simple and not
*too* slow. At least as long as hardware supports FMA.
For small x/y ratios it should be pretty close to best possible.

#include <math.h>
#include <fenv.h>

double my_fmod(double x, double y)
{
if (isnan(x))
return x;

// pre-process y
if (isless(y, 0))
y = -y;
else if (isgreater(y, 0))
;
else {
if (isnan(y))
return y;
// y==0
feraiseexcept(FE_INVALID);
return nan("y0");
}

// y in (0:+inf]

// Quick path
double xx = x * 0x1p-53;
if (xx > -y && xx < y) {
// among other things, x guaranteed to be finite
if (x > -y && x < y)
return x; // case y=+-inf covered here
double d = trunc(x/y);
double res = fma(-y, d, x);
if (signbit(x) != signbit(res)) {
// overshoot because of unfortunate division rounding
// it is extremely rare for small x/y,
// but not rare when x/y is close to 2**53
res = fma(-y, d+(signbit(x)*2-1), x);
}
return res;
}

// slow path
if (isinf(x)) {
feraiseexcept(FE_INVALID);
return nan("xinf");
}

int oldRnd = fegetround();
fesetround(FE_TOWARDZERO);

double ax = fabs(x);
do {
double yy = y;
while (yy < ax * 0x1p-1022)
yy *= 0x1p1021;

do
ax = fma(-yy, trunc(ax/yy), ax);
while (ax >= yy);

} while (ax >= y);

ax = copysign(ax, x);
fesetround(oldRnd);
return ax;
}

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Bonita Montero on Sun Mar 2 15:41:55 2025

On Mon, 24 Feb 2025 11:48:08 +0100
Bonita Montero <Bonita.Montero@gmail.com> wrote:

I wanted to optimize fmod to be a bit faster. This is my C++20
solution.

It's about six times faster than the glibc 2.31 solution in my
benchmark. The returned NaNs and the raised exceptions are MSVC-
and glibc-compatible.

Here are my benchmark results on Windows/msys2, 4.25MHz Skylake:
Compiler x/y range fmod my_fmod (yours)
clang [0.5:2**53] 5.6 38.0
gcc [0.5:2**53] 5.8 36.7
MSVC [0.5:2**53] 11.0 36.8
clang [2**-2098:2**2098] 102.9 294.3
gcc [2**-2098:2**2098] 102.9 291.7
MSVC [2**-2098:2**2098] 109.9 289.6

Your variant is 2.6x to 6.8x times slower than standard library.

My variant is also slower than standard library, but the margin of
defeat is much closer.

------- first range
#include <cstdio>
#include <cstring>
#include <cmath>
#include <cfenv>
#include <vector>
#include <random>
#include <chrono>

double my_fmod(double x, double y);

int main(void)
{
const int VEC_LEN = 100000;
const int N_IT = 31;

std::vector<double> xy(VEC_LEN*2);
std::mt19937_64 rndGen;
const uint64_t EXP_MASK = 2047ull << 52;
for (int i = 0; i < VEC_LEN*2; ++i) {
uint64_t u = rndGen();
uint64_t exp = 1023;
if (i % 2 == 0) { // x
uint64_t exp0 = (u >> 52) & 2047;
exp += exp0 % 52;
}
u = (u & ~EXP_MASK) | (exp << 52);
double d;
memcpy(&d, &u, sizeof(d));
xy[i] = d;
}
std::vector<double> res(VEC_LEN);
std::vector<double> ref(VEC_LEN);

auto t00 = std::chrono::steady_clock::now();
const double* pXY = xy.data();
double* pRef = ref.data();
double* pRes = res.data();
std::vector<int64_t> tref(N_IT);
std::vector<int64_t> tres(N_IT);
for (int it = 0; it < N_IT; ++it) {
auto t0 = std::chrono::steady_clock::now();
for (int i = 0; i < VEC_LEN; ++i)
pRef[i] = fmod(pXY[i*2+0], pXY[i*2+1]);
auto t1 = std::chrono::steady_clock::now();
for (int i = 0; i < VEC_LEN; ++i)
pRes[i] = my_fmod(pXY[i*2+0], pXY[i*2+1]);
auto t2 = std::chrono::steady_clock::now();

tref[it] = std::chrono::duration_cast<std::chrono::nanoseconds>(t1
- t0).count();
tres[it] = std::chrono::duration_cast<std::chrono::nanoseconds>(t2
- t1).count();

for (int i = 0; i < VEC_LEN; ++i) {
if (pRef[i] != pRes[i]) {
if (!std::isnan(pRef[i]) || !std::isnan(pRes[i])) {
printf(
"Mismatch. fmod(%.17e, %.17e).\n"
"ref %.17e\n"
"my %.17e\n"
,xy[i*2+0]
,xy[i*2+1]
,ref[i]
,res[i]
);
return 1;
}
}
}
}

auto t11 = std::chrono::steady_clock::now();
int64_t dt = std::chrono::duration_cast<std::chrono::nanoseconds>(t11
- t00).count();

std::nth_element(tref.begin(), tref.begin()+N_IT/2, tref.end());
std::nth_element(tres.begin(), tres.begin()+N_IT/2, tres.end());
printf("fmod %.2f nsec. my_fmod %.2f nsec. Test time %.3f msec\n"
,double(tref[N_IT/2]) / VEC_LEN
,double(tres[N_IT/2]) / VEC_LEN
,double(dt)*1e-6
);

return 0;
}

-- second range
#include <cstdio>
#include <cstring>
#include <cmath>
#include <cfenv>
#include <vector>
#include <random>
#include <chrono>

double my_fmod(double x, double y);

int main(void)
{
const int VEC_LEN = 100000;
const int N_IT = 31;

std::vector<double> xy(VEC_LEN*2);
std::mt19937_64 rndGen;
for (int i = 0; i < VEC_LEN*2; ++i) {
uint64_t u = rndGen();
double d;
memcpy(&d, &u, sizeof(d));
xy[i] = d;
}
std::vector<double> res(VEC_LEN);
std::vector<double> ref(VEC_LEN);

auto t00 = std::chrono::steady_clock::now();
const double* pXY = xy.data();
double* pRef = ref.data();
double* pRes = res.data();
std::vector<int64_t> tref(N_IT);
std::vector<int64_t> tres(N_IT);
for (int it = 0; it < N_IT; ++it) {
auto t0 = std::chrono::steady_clock::now();
for (int i = 0; i < VEC_LEN; ++i)
pRef[i] = fmod(pXY[i*2+0], pXY[i*2+1]);
auto t1 = std::chrono::steady_clock::now();
for (int i = 0; i < VEC_LEN; ++i)
pRes[i] = my_fmod(pXY[i*2+0], pXY[i*2+1]);
auto t2 = std::chrono::steady_clock::now();

tref[it] = std::chrono::duration_cast<std::chrono::nanoseconds>(t1
- t0).count();
tres[it] = std::chrono::duration_cast<std::chrono::nanoseconds>(t2
- t1).count();

for (int i = 0; i < VEC_LEN; ++i) {
if (pRef[i] != pRes[i]) {
if (!std::isnan(pRef[i]) || !std::isnan(pRes[i])) {
printf(
"Mismatch. fmod(%.17e, %.17e).\n"
"ref %.17e\n"
"my %.17e\n"
,xy[i*2+0]
,xy[i*2+1]
,ref[i]
,res[i]
);
return 1;
}
}
}
}

auto t11 = std::chrono::steady_clock::now();
int64_t dt = std::chrono::duration_cast<std::chrono::nanoseconds>(t11
- t00).count();

std::nth_element(tref.begin(), tref.begin()+N_IT/2, tref.end());
std::nth_element(tres.begin(), tres.begin()+N_IT/2, tres.end());
printf("fmod %.2f nsec. my_fmod %.2f nsec. Test time %.3f msec\n"
,double(tref[N_IT/2]) / VEC_LEN
,double(tres[N_IT/2]) / VEC_LEN
,double(dt)*1e-6
);

return 0;
}

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Michael S on Sun Mar 2 18:54:14 2025

On Sat, 1 Mar 2025 22:55:59 +0200
Michael S <already5chosen@yahoo.com> wrote:

On Mon, 24 Feb 2025 11:48:08 +0100
Bonita Montero <Bonita.Montero@gmail.com> wrote:

I wanted to optimize fmod to be a bit faster. This is my C++20
solution.

How about that?
Pay attention, it's C rather than C++. So 5 times shorter :-)
It's not the fastest for big x/y ratios, but rather simple and not
*too* slow. At least as long as hardware supports FMA.
For small x/y ratios it should be pretty close to best possible.

I didn't RTFM about signbit() and didn't check with compilers others
than clang, so didn't pay attention to the bug until testing with
MSVC. At the same opportunity I looked at MSVC-generated asm and found
out that it does not inline trunc() and copysign(). So, I changed the
code please MSVC idiosyncrasy, replacing trunc() with floor() and using
if () instead of copysign(). Fortunately, the changes didn't make
clang/gcc compiled code any slower.

Here is hopefully correct version:

#include <math.h>
#include <fenv.h>

double my_fmod(double x, double y)
{
if (isnan(x))
return x;

// pre-process y
if (y < 0)
y = -y;
else if (y > 0)
;
else {
if (isnan(y))
return y;
// y==0
feraiseexcept(FE_INVALID);
return NAN;
}

// y in (0:+inf]
double ax = fabs(x);

// Quick path
double xx = ax * 0x1p-53;
if (xx < y) {
// among other things, x guaranteed to be finite
if (ax < y)
return x; // case y=+-inf covered here
double d = floor(ax/y);
double res = fma(-y, d, ax);
if (res < 0) {
// overshoot because of unfortunate division rounding
// it is extremely rare for small x/y,
// but not rare when x/y is close to 2**53
res = fma(-y, d-1, ax);
}
if (x < 0)
res = -res;
return res;
}

// slow path
if (isinf(x)) {
feraiseexcept(FE_INVALID);
return NAN;
}

int oldRnd = fegetround();
fesetround(FE_TOWARDZERO);

do {
double yy = y;
while (yy < ax * 0x1p-1022)
yy *= 0x1p1021;

do
ax = fma(-yy, floor(ax/yy), ax);
while (ax >= yy);

} while (ax >= y);

if (x < 0)
ax = -ax;
fesetround(oldRnd);
return ax;
}

The behaviour w.r.t. raising FE_INVALID when either x or y is signaling
NaN could differ from library implementation, but as far as I understand
both raising and not raising exception in this case is legal.

This code is between 15-20% slower then standard library for clang and
gcc, and 0 to 15% faster then standard library for MSVC, so of no
practical interest.

Still, it is better than Bonita's in all possible circumstances.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Mr Flibble@21:1/5 to Bonita.Montero@gmail.com on Sun Mar 2 16:22:20 2025

On Sun, 2 Mar 2025 17:10:37 +0100, Bonita Montero
<Bonita.Montero@gmail.com> wrote:

This is my code, improved by the _udiv128-intrinsic of MSVC which
provides a 128 / 64 division. With that my algorithm becomes nearly
tree times as fast as before. I'll provide a g++ / clang++ compatible
version with inline-assembly later.

Still slow tho.

/Flibble

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Bonita Montero on Sun Mar 2 19:07:59 2025

On Sun, 2 Mar 2025 17:26:18 +0100
Bonita Montero <Bonita.Montero@gmail.com> wrote:

With MSVC and the fairer random numbers I chose I'm 2.5 times
faster.

I don't trust your benchmarking skills.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Mr Flibble on Sun Mar 2 19:37:16 2025

On Sun, 02 Mar 2025 16:22:20 +0000
Mr Flibble <leigh@i42.co.uk> wrote:

On Sun, 2 Mar 2025 17:10:37 +0100, Bonita Montero
<Bonita.Montero@gmail.com> wrote:

This is my code, improved by the _udiv128-intrinsic of MSVC which
provides a 128 / 64 division. With that my algorithm becomes nearly
tree times as fast as before. I'll provide a g++ / clang++ compatible >version with inline-assembly later.

Still slow tho.

/Flibble

The truth is that relative speed of FP vs Integer algorithms depends on specific CPU that one is using for measurements.
I measured on relatively old CPU - Intel Skylake. On this CPU integer
division is very significantly slower than floating-point division.
On newer CPUs, like Intel IceLake/Tiger Lake and Alder Lake or AMD Zen
3/4/5 and even more sore on Apple M-series the difference in speed
between floating-point and integer division is less significant, and
in few cases integer division is even faster, so in theory Bonita's code
could be more competitive.

From Agner Fog's tables:
Arch DIVSD DIV r64
Skylake 13-14 35-88
IceLake 13-14 15
Alder Lake 14 10
Zen3 13.5 9-17

My problem is that because of Bonita's horrible coding style I am not
even trying to understand what's is going on within his/her code.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Bonita Montero on Sun Mar 2 19:52:21 2025

On Sun, 2 Mar 2025 18:11:14 +0100
Bonita Montero <Bonita.Montero@gmail.com> wrote:

Am 02.03.2025 um 18:09 schrieb Bonita Montero:

Am 02.03.2025 um 18:07 schrieb Michael S:

On Sun, 2 Mar 2025 17:26:18 +0100
Bonita Montero <Bonita.Montero@gmail.com> wrote:

With MSVC and the fairer random numbers I chose I'm 2.5 times
faster.

I don't trust your benchmarking skills.

I don't prefer your fast-path value combinations but I chose 75%
random finite combinations.

This generates the random values for me:

mt19937_64 mt;
uniform_int_distribution<uint64_t>
genType( 0, 15 ),
genFinite( 0x0010000000000000u, 0x7FEFFFFFFFFFFFFFu ),
genDen( 1, 0x000FFFFFFFFFFFFFu ),
genNaN( 0x7FF0000000000001u, 0x7FFFFFFFFFFFFFFFu );
auto get = [&]()
{
constexpr uint64_t
FINITE_THRESH = 4, // 75% finites
ZERO = 3, // 6.25% zeroes
DENORMALS = 2, // 6.25% denormals
INF = 1, // 6.25% Infs
NAN_ = 0; // 6.25% NaNs
uint64_t
sign = mt() & -
numeric_limits<int64_t>::min(), type = genType( mt );
if( type >= FINITE_THRESH )
return bit_cast<double>( sign | genFinite( mt
) ); if( type == ZERO )
return bit_cast<double>( sign );
if( type == DENORMALS )
return bit_cast<double>( sign | genDen( mt )
); if( type == INF )
return bit_cast<double>( sign |
0x7FF0000000000000u ); assert(type == NAN_);
return bit_cast<double>( sign | genNaN( mt ) );
};

Your distribution is very different from what one would expect in
real-world usage.
In real-world usage apart from debugging stage there are no inf, nan or
y=zero. x=zero happens, but with lower probability that 6%. Denormals
also happen, but with even lower probability than x=zero. Also in
majority of real-world scenarios huge x/y ratios either not happen at
all or are extremely rare.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Bonita Montero on Sun Mar 2 20:00:05 2025

On Sun, 2 Mar 2025 18:41:58 +0100
Bonita Montero <Bonita.Montero@gmail.com> wrote:

Your coding-style is horrible.

Even with minimal comments it's not too hard to understand what's going
on within my code. With few more comments it could become completely
clear.
Of course, floating-point algorithm is inherently simpler. That helps.

Mine is "too beautiful" (my employer).

It sounds like your employer agrees with me, but he expresses his
thought in humoristic style.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Bonita Montero on Sun Mar 2 20:09:36 2025

On Sun, 2 Mar 2025 18:55:22 +0100
Bonita Montero <Bonita.Montero@gmail.com> wrote:

Am 02.03.2025 um 18:52 schrieb Michael S:

Your distribution is very different from what one would expect in real-world usage.

There's no real world distibution, so I chose all finites to be
equally likely.

In real-world usage apart from debugging stage there are no inf,
nan or y=zero. x=zero happens, but with lower probability that 6%. Denormals also happen, but with even lower probability than x=zero.

As I said if I chose 100% finites from the 1 to 0x7FEFFFFFFFFFFFFFu
range I'm still 2.4 times faster.

Also in majority of real-world scenarios huge x/y ratios either not
happen at all or are extremely rare.

You didn't answer the second point, which is critical.
In your fully random scenario 48.7% of cases are huge x/y. That is
completely unrealistic.

I can easily improve speed of huge x/y at cost of less simple code and
of small slowdown of more typical case, but I consider it
counterproductive. It seems, authors of standard libraries agree with
my judgment.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Bonita Montero on Sun Mar 2 22:58:45 2025

On Sun, 2 Mar 2025 19:57:30 +0100
Bonita Montero <Bonita.Montero@gmail.com> wrote:

Am 02.03.2025 um 19:09 schrieb Michael S:

You didn't answer the second point, which is critical.
In your fully random scenario 48.7% of cases are huge x/y. That is completely unrealistic.
I can easily improve speed of huge x/y at cost of less simple code
and of small slowdown of more typical case, but I consider it counterproductive. It seems, authors of standard libraries agree
with my judgment.

And you use fesetround, which takes about 40 clock cycles on my
CPU under Linux (WSL2). Better chose _mm_getcsr() and _mm_setcsr()
for that, which directly sets the FPU control word for SSE / AVX*
/ AVX-512. This is multiple times faster. For the x87-FPU you'd
have to chose different code, but the x87-FPU is totally broken
anywax.

If it was on the fast path, I'd consider it.
But improving speed of unimportant slow path at cost of portability?
Nah.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Bonita Montero on Mon Mar 3 02:01:25 2025

On Sun, 2 Mar 2025 18:55:22 +0100
Bonita Montero <Bonita.Montero@gmail.com> wrote:

Am 02.03.2025 um 18:52 schrieb Michael S:

Your distribution is very different from what one would expect in real-world usage.

There's no real world distibution, so I chose all finites to be
equally likely.

In real-world usage apart from debugging stage there are no inf,
nan or y=zero. x=zero happens, but with lower probability that 6%. Denormals also happen, but with even lower probability than x=zero.

As I said if I chose 100% finites from the 1 to 0x7FEFFFFFFFFFFFFFu
range I'm still 2.4 times faster.

On my CPU [for huge ratios] it is indeed faster than your previous
attempt, but still 1.25x slower than standard library. And for
non-huge ratios there is no improvement - still 3.3 times slower than
standard library.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Bonita Montero on Mon Mar 3 18:10:08 2025

On Mon, 3 Mar 2025 12:52:51 +0100
Bonita Montero <Bonita.Montero@gmail.com> wrote:

Am 02.03.2025 um 21:58 schrieb Michael S:

If it was on the fast path, I'd consider it.
But improving speed of unimportant slow path at cost of portability?
Nah.

For the 75% random finites case I've shown your code becomes about
28% faster with _mm_getcsr() and _mm_setcsr().

It seems that major slowdown is specific to combination of msys2
libraries with Zen3/4 CPU.
I see even worse slowness of get/set rounding mode on msys2/Zen3.
The same msys-compiled binary on Intel CPUs is o.k., at least
relatively to other heavy things going on on the slow path.
On Zen3 with Microsoft's compiler/library it is also o.k.

As long as it only affects slow path there is nothing to agitated
about.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Bonita Montero on Mon Mar 3 18:20:09 2025

On Sun, 2 Mar 2025 19:57:30 +0100
Bonita Montero <Bonita.Montero@gmail.com> wrote:

For the x87-FPU you'd
have to chose different code, but the x87-FPU is totally broken
anywax.

x87 is not broken relatively to its own specifications. It just happens
to be slightly different from IEEE-754 specifications. Which is not
surprising considering that it predates IEEE-754 Standard by several
years.
Today there are very few reasons to still use x87 in new software.
However back in it's time x87 was an astonishingly good piece of work,
less so in performance (it was not fast, even by standards of its time)
more so for features, precision and especially for consistency of its arithmetic.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Bonita Montero on Mon Mar 3 19:27:32 2025

On Mon, 3 Mar 2025 17:30:39 +0100
Bonita Montero <Bonita.Montero@gmail.com> wrote:

Am 03.03.2025 um 17:20 schrieb Michael S:

x87 is not broken relatively to its own specifications. It just
happens to be slightly different from IEEE-754 specifications.
Which is not surprising considering that it predates IEEE-754
Standard by several years.

You can reduce the with of the mantissa to 53 or 24 bit, but the
exponent is always 15 bit; that's not up to any specification.

That's up to x87 specification. Which predates IEEE-754.

Today there are very few reasons to still use x87 in new software.
However back in it's time x87 was an astonishingly good piece of
work, less so in performance (it was not fast, even by standards of
its time) more so for features, precision and especially for
consistency of its arithmetic.

There are compiler-settings which enforce consistency by storing
values with reduced precision and re-loading them to give expectable
results when you use values < long double. That's a mess.

It's a mess only if you try to be very compatible with IEEE-754 specs.
If you don't try to be compatible, you just enjoy higher precision
and higher dynamic range for as long as you can. If you want it all the
time, nothing prevents you from storing full 80-bit numbers in memory.
Back in the era of 16-bit buses it was only 10-25% slower than storing
64-bit results. For a full application difference in speed between
80-bit and 64-bit precision was typically just few per cents.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Michael S on Mon Mar 3 23:55:08 2025

On Mon, 3 Mar 2025 18:10:08 +0200
Michael S <already5chosen@yahoo.com> wrote:

On Mon, 3 Mar 2025 12:52:51 +0100
Bonita Montero <Bonita.Montero@gmail.com> wrote:

Am 02.03.2025 um 21:58 schrieb Michael S:

If it was on the fast path, I'd consider it.
But improving speed of unimportant slow path at cost of
portability? Nah.

For the 75% random finites case I've shown your code becomes about
28% faster with _mm_getcsr() and _mm_setcsr().

It seems that major slowdown is specific to combination of msys2
libraries with Zen3/4 CPU.
I see even worse slowness of get/set rounding mode on msys2/Zen3.
The same msys-compiled binary on Intel CPUs is o.k., at least
relatively to other heavy things going on on the slow path.
On Zen3 with Microsoft's compiler/library it is also o.k.

As long as it only affects slow path there is nothing to agitated
about.

I can think about half a dozen of different ways of avoiding the need to
change rounding. However most of them are boring. Only one is fun.

#include <math.h>
#include <fenv.h>

double my_fmod(double x, double y)
{
if (isnan(x))
return x;

// pre-process y
if (y < 0)
y = -y;
else if (y > 0)
;
else {
if (isnan(y))
return y;
// y==0
feraiseexcept(FE_INVALID);
return NAN;
}

// y in (0:+inf]
double ax = fabs(x);

// Quick path
if (ax * 0x1p-53 < y) {
// among other things, x guaranteed to be finite
if (ax < y)
return x; // case y=+-inf covered here
double d = floor(ax/y);
double res = fma(-y, d, ax);
if (res < 0) {
// overshoot because of unfortunate division rounding
// it is extremely rare for small x/y,
// but not rare when x/y is close to 2**53
res += y;
}
if (x < 0)
res = -res;
return res;
}

// slow path
if (isinf(x)) {
feraiseexcept(FE_INVALID);
return NAN;
}

int flipflop = 0;
do {
double yy = y;
while (yy < ax * 0x1p-1022)
yy *= 0x1p1021;

do {
ax = fma(-yy, floor(ax/yy), ax);
flipflop ^= (ax < 0);
ax = fabs(ax);
} while (ax >= yy);
} while (ax >= y);
if (flipflop)
ax = y - ax;
if (x < 0)
ax = -ax;
return ax;
}

To my surprise, in case of insane x/y ratios it was faster than original
not only on Zen3/Msys, but on Intel CPUs and MSVC platform as well.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Bonita Montero on Sun Mar 9 01:31:27 2025

On Sun, 2 Mar 2025 17:10:37 +0100
Bonita Montero <Bonita.Montero@gmail.com> wrote:

This is my code, improved by the _udiv128-intrinsic of MSVC which
provides a 128 / 64 division. With that my algorithm becomes nearly
tree times as fast as before. I'll provide a g++ / clang++ compatible
version with inline-assembly later.

template<bool _32 = false>
double xMyFmod( double x, double y )
{
constexpr uint64_t
SIGN = 1ull << 63,
IMPLICIT = 1ull << 52,
MANT = IMPLICIT - 1,
QBIT = 1ull << 51;
uint64_t const
binX = bit_cast<uint64_t>( x ),
binY = bit_cast<uint64_t>( y );
static auto abs = []( uint64_t m ) { return m & ~SIGN; };
auto isNaN = []( uint64_t m ) { return abs( m ) >=
0x7FF0000000000001u; }; auto isSig = []( uint64_t m ) { return !(m &
QBIT); }; if( isNaN( binX ) ) [[unlikely]] // x == NaN
#if defined(_MSC_VER)
return bit_cast<double>( isNaN( binY ) ? binY | binX
& binY & QBIT : binX );
#else
{
if( isSig( binX ) || isNaN( binY ) && isSig( binY ) ) [[unlikely]] feraiseexcept( FE_INVALID );
return bit_cast<double>( binX | QBIT );
}
#endif
auto isInf = []( uint64_t m ) { return abs( m ) ==
0x7FF0000000000000u; }; if( isNaN( binY ) ) [[unlikely]] // x != NaN
|| y == NaN #if defined(_MSC_VER)
{
if constexpr( _32 )
if( isInf( binX ) )
feraiseexcept( FE_INVALID );
return y;
}
#else
{
if( isSig( binY ) ) [[unlikely]]
feraiseexcept( FE_INVALID );
return bit_cast<double>( binY | QBIT );
}
#endif
if( isInf( binX ) ) // x == Inf
{
feraiseexcept( FE_INVALID );
#if defined(_MSC_VER)
return bit_cast<double>( binX & ~MANT | QBIT );
#else
return -numeric_limits<double>::quiet_NaN();
#endif
}
if( !abs( binY ) ) [[unlikely]] // y == 0
{
feraiseexcept( FE_INVALID );
#if defined(_MSC_VER)
return numeric_limits<double>::quiet_NaN();
#else
return -numeric_limits<double>::quiet_NaN();
#endif
}
if( !abs( binX ) || isInf( binY ) ) [[unlikely]] // x == 0 ||
y == Inf return x;
auto exp = []( uint64_t b ) -> int { return b >> 52 & 0x7FF;
}; int
expX = exp( binX ),
expY = exp( binY );
auto mant = []( uint64_t b ) { return b & MANT; };
uint64_t
mantX = mant( binX ),
mantY = mant( binY );
int headBits = 11;
static auto normalize = [&]( int &exp, uint64_t &mant )
{
unsigned shift = countl_zero( mant ) - headBits;
mant <<= shift;
exp -= shift;
};
auto build = []( int &exp, uint64_t &mant )
{
if( exp ) [[likely]]
mant |= IMPLICIT;
else
{
exp = 1;
normalize( exp, mant );
}
};
build( expX, mantX );
build( expY, mantY );
int
tailX = countr_zero( mantX ),
tailY = countr_zero( mantY ),
tailBits = tailX <= tailY ? tailX : tailY;
headBits += tailBits;
mantX >>= tailBits;
mantY >>= tailBits;
uint64_t signX = binX & SIGN;
int expDiff;
#if defined(_MSC_VER)
while( (expDiff = expX - expY) > 63 )
{
unsigned long long hi = mantX >> 1, lo = mantX << 63, remainder; (void)_udiv128( hi, lo, mantY, &remainder );
expX -= 63;
mantX = remainder;
normalize( expX, mantX );
}
#endif
while( (expDiff = expX - expY) > 0 )
{
unsigned bits = expDiff <= headBits ? expDiff :
headBits; if( !(mantX = (mantX << bits) % mantY) ) [[unlikely]]
return bit_cast<double>( signX );
expX -= bits;
normalize( expX, mantX );
}
if( !expDiff && mantX >= mantY ) [[unlikely]]
if( (mantX -= mantY) ) [[likely]]
normalize( expX, mantX );
else
return bit_cast<double>( signX );
mantX <<= tailBits;
mantY <<= tailBits;
if( expX <= 0 ) [[unlikely]]
{
assert(expX >= -51);
mantX = mantX >> (unsigned)(-expX + 1);
expX = 0;
}
return bit_cast<double>( signX | (uint64_t)expX << 52 | mantX
& MANT ); }

double myFmod( double x, double y )
{
return xMyFmod( x, y );
}

inline float myFmod( float x, float y )
{
return (float)xMyFmod<true>( (double)x, (double)y );
}

This code does not work in plenty of cases. It seems, your test vectors
have poor coverage.
Try, for example, x=1.8037919852882307, y=2.22605637008665934e-194

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Bonita Montero on Sun Mar 9 11:08:01 2025

On Sun, 9 Mar 2025 07:41:26 +0100
Bonita Montero <Bonita.Montero@gmail.com> wrote:

Am 09.03.2025 um 00:31 schrieb Michael S:

This code does not work in plenty of cases. It seems, your test
vectors have poor coverage.
Try, for example, x=1.8037919852882307, y=2.22605637008665934e-194

cout << hexfloat << myFmod( 1.8037919852882307, 2.22605637008665934e-194 ) << endl;
cout << hexfloat << fmod( 1.8037919852882307,
2.22605637008665934e-194 ) << endl;

Prints the same result under Linux and Windows.

What it prints under Linux is irrelevant. Under Linux it compiles your
original version or its close equivalent that is slow, ugly, but not
buggy.
What it prints under Windows when compiled with clang or gcc is also
irrelevant for the same reason.
The bug is in the new code that is exposed only when compiled with MSVC compiler.

-- foo.cpp
#include <cstdint>
#include <cassert>
#include <cfenv>
#include <limits>
#include <bit>
using namespace std;

template<bool _32 = false>
double xMyFmod( double x, double y )
{
constexpr uint64_t
SIGN = 1ull << 63,
IMPLICIT = 1ull << 52,
MANT = IMPLICIT - 1,
QBIT = 1ull << 51;
uint64_t const
binX = bit_cast<uint64_t>( x ),
binY = bit_cast<uint64_t>( y );
static auto abs = []( uint64_t m ) { return m & ~SIGN; };
auto isNaN = []( uint64_t m ) { return abs( m ) >=
0x7FF0000000000001u; }; auto isSig = []( uint64_t m ) { return !(m &
QBIT); }; if( isNaN( binX ) ) [[unlikely]] // x == NaN
#if defined(_MSC_VER)
return bit_cast<double>( isNaN( binY ) ? binY | binX &
binY & QBIT : binX );
#else
{
if( isSig( binX ) || isNaN( binY ) && isSig( binY ) ) [[unlikely]] feraiseexcept( FE_INVALID );
return bit_cast<double>( binX | QBIT );
}
#endif
auto isInf = []( uint64_t m ) { return abs( m ) ==
0x7FF0000000000000u; }; if( isNaN( binY ) ) [[unlikely]] // x != NaN ||
y == NaN #if defined(_MSC_VER)
{
if constexpr( _32 )
if( isInf( binX ) )
feraiseexcept( FE_INVALID );
return y;
}
#else
{
if( isSig( binY ) ) [[unlikely]]
feraiseexcept( FE_INVALID );
return bit_cast<double>( binY | QBIT );
}
#endif
if( isInf( binX ) ) // x == Inf
{
feraiseexcept( FE_INVALID );
#if defined(_MSC_VER)
return bit_cast<double>( binX & ~MANT | QBIT );
#else
return -numeric_limits<double>::quiet_NaN();
#endif
}
if( !abs( binY ) ) [[unlikely]] // y == 0
{
feraiseexcept( FE_INVALID );
#if defined(_MSC_VER)
return numeric_limits<double>::quiet_NaN();
#else
return -numeric_limits<double>::quiet_NaN();
#endif
}
if( !abs( binX ) || isInf( binY ) ) [[unlikely]] // x == 0 || y
== Inf return x;
auto exp = []( uint64_t b ) -> int { return b >> 52 & 0x7FF; };
int
expX = exp( binX ),
expY = exp( binY );
auto mant = []( uint64_t b ) { return b & MANT; };
uint64_t
mantX = mant( binX ),
mantY = mant( binY );
int headBits = 11;
static auto normalize = [&]( int &exp, uint64_t &mant )
{
unsigned shift = countl_zero( mant ) - headBits;
mant <<= shift;
exp -= shift;
};
auto build = []( int &exp, uint64_t &mant )
{
if( exp ) [[likely]]
mant |= IMPLICIT;
else
{
exp = 1;
normalize( exp, mant );
}
};
build( expX, mantX );
build( expY, mantY );
int
tailX = countr_zero( mantX ),
tailY = countr_zero( mantY ),
tailBits = tailX <= tailY ? tailX : tailY;
headBits += tailBits;
mantX >>= tailBits;
mantY >>= tailBits;
uint64_t signX = binX & SIGN;
int expDiff;
#if defined(_MSC_VER)
while( (expDiff = expX - expY) > 63 )
{
unsigned long long hi = mantX >> 1, lo = mantX << 63, remainder; (void)_udiv128( hi, lo, mantY, &remainder );
expX -= 63;
mantX = remainder;
normalize( expX, mantX );
}
#endif
while( (expDiff = expX - expY) > 0 )
{
unsigned bits = expDiff <= headBits ? expDiff :
headBits; if( !(mantX = (mantX << bits) % mantY) ) [[unlikely]]
return bit_cast<double>( signX );
expX -= bits;
normalize( expX, mantX );
}
if( !expDiff && mantX >= mantY ) [[unlikely]]
if( (mantX -= mantY) ) [[likely]]
normalize( expX, mantX );
else
return bit_cast<double>( signX );
mantX <<= tailBits;
mantY <<= tailBits;
if( expX <= 0 ) [[unlikely]]
{
assert(expX >= -51);
mantX = mantX >> (unsigned)(-expX + 1);
expX = 0;
}
return bit_cast<double>( signX | (uint64_t)expX << 52 | mantX &
MANT ); }

double myFmod( double x, double y )
{
return xMyFmod( x, y );
}

inline float myFmod( float x, float y )
{
return (float)xMyFmod<true>( (double)x, (double)y );
}

-- end of foo.cpp

-- bar.cpp
#include <iostream>
using namespace std;

double myFmod( double x, double y );

int main()
{
cout << hexfloat << myFmod(
1.8037919852882307,
2.22605637008665934e-194 ) << endl;
cout << hexfloat << fmod( 1.8037919852882307,
2.22605637008665934e-194 ) << endl;
}

-- end of foo.cpp

W:\foobar>cl
Microsoft (R) C/C++ Optimizing Compiler Version 19.41.34120 for x64
Copyright (C) Microsoft Corporation. All rights reserved.

usage: cl [ option... ] filename... [ /link linkoption... ]

W:\foobar>cl -nologo -O2 -EHsc -std:c++20 foo.cpp bar.cpp
foo.cpp
bar.cpp
Generating Code...

W:\foobar>foo
0x1.0000000000000p-696
0x0.0000000000000p+0

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Bonita Montero on Sun Mar 9 11:46:24 2025

On Sun, 9 Mar 2025 10:31:02 +0100
Bonita Montero <Bonita.Montero@gmail.com> wrote:

This prints the same result (0.0) under Windows and Linux:

I am no longer going to look at your code until you start posting full
files, with all includes and using directives.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Bonita Montero on Sun Mar 9 12:09:31 2025

On Sun, 9 Mar 2025 10:54:40 +0100
Bonita Montero <Bonita.Montero@gmail.com> wrote:

Am 09.03.2025 um 10:46 schrieb Michael S:

This prints the same result (0.0) under Windows and Linux:

I am no longer going to look at your code until you start posting
full files, with all includes and using directives.

You could simply replace the single function I've shown.

I can. I don't want to do it.
You want me to look at/test your code? You post full code.
Simple, isn't it?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Bonita Montero on Sun Mar 9 14:09:32 2025

On Sun, 9 Mar 2025 11:21:55 +0100
Bonita Montero <Bonita.Montero@gmail.com> wrote:

Am 09.03.2025 um 11:09 schrieb Michael S:

On Sun, 9 Mar 2025 10:54:40 +0100
Bonita Montero <Bonita.Montero@gmail.com> wrote:

Am 09.03.2025 um 10:46 schrieb Michael S:

This prints the same result (0.0) under Windows and Linux:

I am no longer going to look at your code until you start posting
full files, with all includes and using directives.

You could simply replace the single function I've shown.

I can. I don't want to do it.
You want me to look at/test your code? You post full code.
Simple, isn't it?

I've read you don't trust my tests, so use your own with myFmod.

ok. I was too curios :(
This version produces correct results both when compiled under MSVC and
when compiled with other compilers. It is a little faster too.
With MSVC on old Intel CPU it is only 2.5 times slower than standard
library in relevant range of x/y. Previous version was 3.4 times
slower.
With gcc and clang it is still more than 6 times slower than standard
library.
The coding style is now less insane.

Measurements in nsec.
First result - Intel Skylake at 4.25 GHz
Second result - AMD Zen3 at 3.7 GHz

abs(x/y) in range that matters [0.5:2**53]:
Standard MSVC Library - 11.1 10.4
Standard gnu Library - 5.4 10.7
Yours (MSVc) - 27.6 11.5
Yours (gcc) - 36.4 23.7
Yours (clang) - 37.4 24.3

abs(x/y) in full range [2**-2090:2**2090]:
Standard MSVC Library - 109.4 153.5
Standard glib Library - 102.3 155.5
Yours (MSVc) - 134.9 52.6
Yours (gcc) - 284.7 151.8
Yours (clang) - 285.2 156.5

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Bonita Montero on Sun Mar 9 14:23:03 2025

On Sun, 9 Mar 2025 12:56:55 +0100
Bonita Montero <Bonita.Montero@gmail.com> wrote:

Am 09.03.2025 um 11:51 schrieb wij:

From the view of implementing myFmod, I think using C-like coding
style would be better, but all depending on what you want to
achieve.

A C coding style would result in about two times the code.

So far all we had see from you is 2-3 times longer than C code (real C,
not C-style C++) that I posted here few days ago. And my code had more
comments than yours, so difference in code itself is even bigger.

Yes, part of it is because my algorithm is simpler. But that is only
part.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Bonita Montero on Sun Mar 9 14:56:25 2025

On Sun, 9 Mar 2025 13:39:40 +0100
Bonita Montero <Bonita.Montero@gmail.com> wrote:

Am 09.03.2025 um 13:23 schrieb Michael S:

So far all we had see from you is 2-3 times longer than C code
(real C, not C-style C++) ...

Not true since I save a lot of redundant-code with [&]-lambdas.

Every heard of wc? It does not lie.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Mr Flibble@21:1/5 to already5chosen@yahoo.com on Sun Mar 9 15:26:29 2025

On Sun, 9 Mar 2025 14:09:32 +0200, Michael S
<already5chosen@yahoo.com> wrote:

On Sun, 9 Mar 2025 11:21:55 +0100
Bonita Montero <Bonita.Montero@gmail.com> wrote:

Am 09.03.2025 um 11:09 schrieb Michael S:

On Sun, 9 Mar 2025 10:54:40 +0100
Bonita Montero <Bonita.Montero@gmail.com> wrote:

Am 09.03.2025 um 10:46 schrieb Michael S:

This prints the same result (0.0) under Windows and Linux:

I am no longer going to look at your code until you start posting
full files, with all includes and using directives.

You could simply replace the single function I've shown.

I can. I don't want to do it.
You want me to look at/test your code? You post full code.
Simple, isn't it?

I've read you don't trust my tests, so use your own with myFmod.

ok. I was too curios :(
This version produces correct results both when compiled under MSVC and
when compiled with other compilers. It is a little faster too.
With MSVC on old Intel CPU it is only 2.5 times slower than standard
library in relevant range of x/y. Previous version was 3.4 times
slower.
With gcc and clang it is still more than 6 times slower than standard >library.
The coding style is now less insane.

Measurements in nsec.
First result - Intel Skylake at 4.25 GHz
Second result - AMD Zen3 at 3.7 GHz

abs(x/y) in range that matters [0.5:2**53]:
Standard MSVC Library - 11.1 10.4
Standard gnu Library - 5.4 10.7
Yours (MSVc) - 27.6 11.5
Yours (gcc) - 36.4 23.7
Yours (clang) - 37.4 24.3

abs(x/y) in full range [2**-2090:2**2090]:
Standard MSVC Library - 109.4 153.5
Standard glib Library - 102.3 155.5
Yours (MSVc) - 134.9 52.6
Yours (gcc) - 284.7 151.8
Yours (clang) - 285.2 156.5

So it is slow ergo a pointless alternative to what we already have.

/Flibble

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Bonita Montero on Sun Mar 9 19:55:21 2025

On Sun, 9 Mar 2025 17:02:32 +0100
Bonita Montero <Bonita.Montero@gmail.com> wrote:

Am 09.03.2025 um 16:26 schrieb Mr Flibble:

So it is slow ergo a pointless alternative to what we already have.

glibc does it nearly in the same way I do it because the FMA-solution
isn't portable.
If fma( a, b, c ) is substituted with a * b + c
because there's no proper CPU-instruction the whole issue doesn't
work.

FMA solution is portable to any standard-complaint C environment.
By standard, it has to work regardless of presence of absence of
FMA hardware.
In absence of FMA hardware it is expected to be rather slow, but still
has to produce correct results.

Unfortunately, Microsoft's library implementation of fma() does not work correctly on non-FMA hardware. In case when x*y and z have different
signs it fairs somewhat better than when they have the same sign, but
still not good enough.
So it goes ):
Since under msys2 both gcc and clang rely on Microsoft's library, they
also do not work correctly on non-FMA CPUs.

It would be interesting to test if glibc is better in that regard. Unfortunately, right now I have no access to any glibc-based system
that runs on sufficiently old hardware.

Anyway, despite all their problems, fma() based solutions are written
in standard C language. The same can't be said about solutions that use
any variant of 128-bit integer arithmetic, either in form of Gnu
__int128 or in form of Microsoft's intrinsic functions.
I think that an absence of standardized 128-bit integer math in both C
and C++ sucks, but I can not change the fact.

And with support for _udiv128 my solution has about the same
performance like the Michael's solution with clang++ 18.1.7.

I readily admitted that from practical perspective all 3 solutions
that I posted here (one of each is incorrect) are pointless.

I do have variants that are usefully faster than Standard library for
relevant x/y ratios, but I didn't post them here.
They achieve the speed boost via direct manipulation of binary64 bit
patterns. I am not sure that such solutions are on topic in c.l.c++.

Still, what you say about relative speed is true only on newer CPUs and
only when compiled with MSVC. On Older CPUs, like Intel Skylake, which
is pretty fast CPU on absolute scale and which still constitutes big
portion of installed base, your code is significantly slower than mine.
Same for newer CPUs with compilers that do not support native long
division.

Here is comparison vs code that I posted here at 2025-03-03 18:10:08
+0200:

Measurements in nsec.
First result - Intel Skylake at 4.25 GHz
Second result - AMD Zen3 at 3.7 GHz

abs(x/y) in range that matters [0.5:2**53]:
Standard MSVC Library - 11.1 10.4
Standard gnu Library - 5.4 10.7
Yours (MSVc) - 27.6 11.5
Yours (gcc) - 36.4 23.7
Yours (clang) - 37.4 24.3
my (MSVc) - 10.7 11.3
my (gcc) - 7.7 7.6
my (clang) - 6.3 7.5

abs(x/y) in full range [2**-2090:2**2090]:
Standard MSVC Library - 109.4 153.5
Standard glib Library - 102.3 155.5
Yours (MSVc) - 134.9 52.6
Yours (gcc) - 284.7 151.8
Yours (clang) - 285.2 156.5
my (MSVc) - 62.1 61.1
my (gcc) - 60.8 59.1
my (clang) - 59.9 59.3

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Bonita Montero on Sun Mar 9 20:04:07 2025

On Sun, 9 Mar 2025 17:23:26 +0100
Bonita Montero <Bonita.Montero@gmail.com> wrote:

Am 09.03.2025 um 13:09 schrieb Michael S:

Measurements in nsec.
First result - Intel Skylake at 4.25 GHz
Second result - AMD Zen3 at 3.7 GHz

abs(x/y) in range that matters [0.5:2**53]:
Standard MSVC Library - 11.1 10.4
Standard gnu Library - 5.4 10.7
Yours (MSVc) - 27.6 11.5
Yours (gcc) - 36.4 23.7
Yours (clang) - 37.4 24.3

abs(x/y) in full range [2**-2090:2**2090]:
Standard MSVC Library - 109.4 153.5
Standard glib Library - 102.3 155.5
Yours (MSVc) - 134.9 52.6
Yours (gcc) - 284.7 151.8
Yours (clang) - 285.2 156.5

With MSVC and an arbitrary combination of finite x and y on my
Zen4-machine:

your fmod: 77.1214
my: 38.4486

With MSVC and an arbitrary combination of finite x with exponents
ranging from 0x3FF to 0x433 (close exponents) on my Zen4-machine:

your fmod: 23.6423
my: 9.79146

It looks like you didn't pay attention to a "flipflop" version that I
posted at 2025-03-03 18:10:08

This is a nearly proper implementation of your idea with
FMA-intrinsics and SSE/AVX control register access:

double fmody( double x, double y )
{
if( isnan( x ) ) [[unlikely]]
return x;
if( isnan( y ) ) [[unlikely]]
return y;
if( isinf( x ) || !y ) [[unlikely]]
{
feraiseexcept( FE_INVALID );
return numeric_limits<double>::quiet_NaN();
}
if( !x || isinf( y ) ) [[unlikely]]
return x;
uint64_t sign = bit_cast<uint64_t>( x ) & numeric_limits<int64_t>::min(); x = abs( x );
y = abs( y );
int oldCsr = _mm_getcsr();
constexpr int CHOP = 0x6000;
_mm_setcsr( oldCsr | CHOP );
constexpr uint64_t
EXP = -(1ll << 52),
MANT = ~EXP;
uint64_t binY = bit_cast<uint64_t>( y );
int64_t expY = binY & EXP;
if( !expY ) [[unlikely]]
expY = (uint64_t)(0 - (countl_zero( binY & MANT ) -
12)) << 52; while( x >= y )
{
uint64_t yExpAdd = 0;
double div = x / y;
if( div < 0x1.FFFFFFFFFFFFFp+1023 ) [[likely]]
div = xtrunc( div );
else
{
uint64_t
binX = bit_cast<uint64_t>( x ),
newExp = expY + (54ull << 52);
yExpAdd = (binX & EXP) - newExp;
div = xtrunc( bit_cast<double>( newExp | binX
& MANT ) / y ); }
__m128d mult1, mult2, add;
#if defined(_MSC_VER)
mult1.m128d_f64[0] = div;
mult2.m128d_f64[0] = -bit_cast<double>( binY +
yExpAdd ); add.m128d_f64[0] = x;
x = _mm_fmadd_sd( mult1, mult2, add ).m128d_f64[0];
#else
mult1[0] = div;
mult2[0] = -bit_cast<double>( binY + yExpAdd );
add[0] = x;
x = _mm_fmadd_sd( mult1, mult2, add )[0];
#endif
if( !x ) [[unlikely]]
return bit_cast<double>( sign );
}
_mm_setcsr( oldCsr );
return bit_cast<double>( sign | bit_cast<uint64_t>( x ) );
}

The only thing that doesn't work currently is the support for denormal values.

I already said that I don't approve non-portable constructs like _mm_getcsr()/_mm_setcsr() except when they help important cases and
help ALOT. Neither applies here.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Bonita Montero on Mon Mar 10 13:17:27 2025

On Mon, 10 Mar 2025 07:24:18 +0100
Bonita Montero <Bonita.Montero@gmail.com> wrote:

Am 09.03.2025 um 19:04 schrieb Michael S:

I already said that I don't approve non-portable constructs like _mm_getcsr()/_mm_setcsr() except when they help important cases and
help ALOT. Neither applies here.

I dropped getting and setting the MSCSR-register to set the roun-
ding mode. Now I set the rounding mode directly with the intrinsics _mm_div_round_sd and and _mm_fmadd_round_sd. Now this more manual
code does help "ALOT", i.e. the solution is 2/3 faster than your
initial solution with clang++-18 under Linux.

I can't check your claim about speed, because the code does not compile. Compiler has no idea WTF is xtrunc. But considering that so far all
your claims about speed were false, I can safely assume that this one
is false as well.

But there's still a problem with the denormals.

On y or on x?
Subnormal values of x can be nicely handled by quick path with 0
additional characters of code.

BTW, for last 20-25 years IEEE-754 prefers to call binary floating
point numbers in range (0:DBL_MIN) 'subnormal' rather than 'denormal'.
I'd guess that it is because the term 'denormal' has wider meaning.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Bonita Montero on Mon Mar 10 14:26:48 2025

On Mon, 10 Mar 2025 10:46:59 +0100
Bonita Montero <Bonita.Montero@gmail.com> wrote:

I tried to use a unsigned __int128 / uint64_t division and I expected
that the compiler calls a library function which does the subtract and
shift manually. But g++ as well as clang++ handle this as a 128 : 64

64#64 division somehow.

Somehow?
It should be an obvious thing to anybody who cared to think for 15
seconds.
A library-or-compiler starts with hi1=rem(0:x_hi, y). Now hi1 is
guaranteed to be smaller than y, so it's safe to do rem(hi1:x_lo, y).
It is not *very* slow, but still there are 2 dependent division
operations.
So, on machines with slow integer division, like Skylake, it is 1.5-1.7
times slower than a single long division.
On machines with fast long division, like Zen3/4/5 or "performance"
cores on newer Intel CPUs, it is approximately twice slower than a
single long division. Plus call overhead, so 2.5x or so slower overall.

And now my original solution is about 23% faster than your solution
for close exponents (exponent difference <= 53) and for arbitrary
exponent differences your solution is about 3% faster.

That's absolutely not what I see.
Here are my measurements:

Measurements in nsec.
First result - Intel Skylake at 4.25 GHz
Second result - AMD Zen3 at 3.7 GHz

abs(x/y) in range that matters [0.5:2**53]:
Standard MSVC Library - 11.1 10.4
Standard gnu Library - 5.4 10.7
Yours (MSVc) - 27.6 11.5
Yours (gcc) - 36.4 23.7
Yours (clang) - 37.4 24.3
my (MSVc) - 10.7 11.3
my (gcc) - 7.7 7.6
my (clang) - 6.3 7.5
Your last (MSVc) - 27.6 11.6
Your last (gcc) - 33.8 28.6
Your last (clang) - 32.6 26.9

abs(x/y) in full range [2**-2090:2**2090]:
Standard MSVC Library - 109.4 153.5
Standard glib Library - 102.3 155.5
Yours (MSVc) - 134.9 52.6
Yours (gcc) - 284.7 151.8
Yours (clang) - 285.2 156.5
my (MSVc) - 62.1 61.1
my (gcc) - 60.8 59.1
my (clang) - 59.9 59.3
Your last (MSVc) - 135.0 52.5
Your last (gcc) - 172.1 137.3
Your last (clang) - 167.7 126.3

It looks like you are not using version that I posted at 2025-03-03
as a reference.
The only case where you code is running at approximately the same speed
as my code is MSVC-generated code on CPUs with fast integer division.
And it was like that since your previous version. I see no changes in
that regard.

double fmodO( double x, double y )
{
constexpr uint64_t
SIGN = 1ull << 63,
IMPLICIT = 1ull << 52,
MANT = IMPLICIT - 1,
QBIT = 1ull << 51;
uint64_t const
binX = bit_cast<uint64_t>( x ),
binY = bit_cast<uint64_t>( y );
static auto abs = []( uint64_t m ) { return m & ~SIGN; };
auto isNaN = []( uint64_t m ) { return abs( m ) >=
0x7FF0000000000001u; }; auto isSig = []( uint64_t m ) { return !(m &
QBIT); }; if( isNaN( binX ) ) [[unlikely]] // x == NaN
#if defined(_MSC_VER)
return bit_cast<double>( isNaN( binY ) ? binY | binX
& binY & QBIT : binX );
#else
{
if( isSig( binX ) || isNaN( binY ) && isSig( binY ) ) [[unlikely]] feraiseexcept( FE_INVALID );
return bit_cast<double>( binX | QBIT );
}
#endif
if( isNaN( binY ) ) [[unlikely]] // x != NaN || y == NaN
#if defined(_MSC_VER)
return y;
#else
{
if( isSig( binY ) ) [[unlikely]]
feraiseexcept( FE_INVALID );
return bit_cast<double>( binY | QBIT );
}
#endif
auto isInf = []( uint64_t m ) { return abs( m ) ==
0x7FF0000000000000u; }; if( isInf( binX ) ) // x == Inf
{
feraiseexcept( FE_INVALID );
#if defined(_MSC_VER)
return bit_cast<double>( binX & ~MANT | QBIT );
#else
return -numeric_limits<double>::quiet_NaN();
#endif
}
if( !abs( binY ) ) [[unlikely]] // y == 0
{
feraiseexcept( FE_INVALID );
#if defined(_MSC_VER)
return numeric_limits<double>::quiet_NaN();
#else
return -numeric_limits<double>::quiet_NaN();
#endif
}
if( !abs( binX ) || isInf( binY ) ) [[unlikely]] // x == 0 ||
y == Inf return x;
auto exp = []( uint64_t b ) -> int { return b >> 52 & 0x7FF;
}; int
expX = exp( binX ),
expY = exp( binY );
auto mant = []( uint64_t b ) { return b & MANT; };
uint64_t
mantX = mant( binX ),
mantY = mant( binY );
int headBits = 11;
static auto normalize = [&]( int &exp, uint64_t &mant )
{
unsigned shift = countl_zero( mant ) - headBits;
mant <<= shift;
exp -= shift;
};
auto build = []( int &exp, uint64_t &mant )
{
if( exp ) [[likely]]
mant |= IMPLICIT;
else
{
exp = 1;
normalize( exp, mant );
}
};
build( expX, mantX );
build( expY, mantY );
int
tailX = countr_zero( mantX ),
tailY = countr_zero( mantY ),
tailBits = tailX <= tailY ? tailX : tailY;
mantX >>= tailBits;
mantY >>= tailBits;
headBits += tailBits;
uint64_t signX = binX & SIGN;
int expDiff;
#if defined(_MSC_VER) && !defined(__llvm__) && defined(_M_X64)
while( (expDiff = expX - expY) > 0 )
{
unsigned bits = expDiff <= 63 ? expDiff : 63;
unsigned long long hi = mantX >> 64 - bits, lo =
mantX << bits, remainder; (void)_udiv128( hi, lo, mantY, &remainder );
if( !remainder ) [[unlikely]]
return bit_cast<double>( signX );
mantX = remainder;
expX -= bits;
normalize( expX, mantX );
}
#elif defined(__GNUC__) || defined(__clang__)
while( (expDiff = expX - expY) > 0 )
{
unsigned bits = expDiff <= 63 ? expDiff : 63;
unsigned __int128 dividend = (unsigned __int128)mantX
<< bits; mantX = (uint64_t)(dividend % mantY);
if( !mantX ) [[unlikely]]
return bit_cast<double>( signX );
expX -= bits;
normalize( expX, mantX );
}
#else
while( (expDiff = expX - expY) > 0 )
{
unsigned bits = expDiff <= headBits ? expDiff :
headBits; if( !(mantX = (mantX << bits) % mantY) ) [[unlikely]]
return bit_cast<double>( signX );
expX -= bits;
normalize( expX, mantX );
}
#endif
if( !expDiff && mantX >= mantY ) [[unlikely]]
if( (mantX -= mantY) ) [[likely]]
normalize( expX, mantX );
else
return bit_cast<double>( signX );
mantX <<= tailBits;
mantY <<= tailBits;
if( expX <= 0 ) [[unlikely]]
{
assert(expX >= -51);
mantX = mantX >> (unsigned)(-expX + 1);
expX = 0;
}
return bit_cast<double>( signX | (uint64_t)expX << 52 | mantX
& MANT ); }

You code would be much, much cleaner and more readable if you replace
lambdas with proper functions. Or just write simple expressions like 'x
& MANT' in place.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Bonita Montero on Mon Mar 10 16:59:25 2025

On Mon, 10 Mar 2025 15:46:32 +0100
Bonita Montero <Bonita.Montero@gmail.com> wrote:

Am 10.03.2025 um 15:33 schrieb Michael S:

That's what I also guessed, but maybe we've not the same
glibc-version. Or the code runs just more efficiently on my Zen4-CPU-

For my code it's strangely slow. On 4+ GHz Zen4 I would expect ~5
nsec.

I want your crystal ball.

One does not need crystal ball to extrapolate speed of simple integer
code from 3.7 GHz Zen3 to 4+ GHz Zen4 (probably 4.7 or 4.8 GHz).
But your repetitive avoidance of answering my direct questions about
what exactly you are using as "my code" makes me even more suspicious.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Bonita Montero on Mon Mar 10 16:33:04 2025

On Mon, 10 Mar 2025 13:45:26 +0100
Bonita Montero <Bonita.Montero@gmail.com> wrote:

Am 10.03.2025 um 13:26 schrieb Michael S:

It should be an obvious thing to anybody who cared to think for 15
seconds.

I was wrong and it absolutels isn't obvious. The compiler calls the
glibc function __umodti3 of glibc which has a shortcut for results
which fit into 64 bit. Although there's an additional call on Linux
the code with clang++-18 is still a bit faster than my Windows
solution with the _udiv128-intrinsic; that's really surprisng.

A library-or-compiler starts with hi1=rem(0:x_hi, y). Now hi1 is
guaranteed to be smaller than y, so it's safe to do rem(hi1:x_lo,
y). It is not *very* slow, but still there are 2 dependent division operations.

Both parameters are variable so there could be no static evaluation
at compile tiime.

https://github.com/llvm/llvm-project/blob/main/compiler-rt/lib/builtins/udivmodti4.c
On line 114 they specialize the case of 64-bit divisor.
On line 116 the further specialize our specific case of x.hi < y.
So, at the end they use the same single division instruction as MSVC.
The only difference is an overhead of cal and of to very predictable
branches.

Measurements in nsec.
First result - Intel Skylake at 4.25 GHz
Second result - AMD Zen3 at 3.7 GHz

abs(x/y) in range that matters [0.5:2**53]:
Standard MSVC Library - 11.1 10.4
Standard gnu Library - 5.4 10.7
Yours (MSVc) - 27.6 11.5
Yours (gcc) - 36.4 23.7
Yours (clang) - 37.4 24.3
my (MSVc) - 10.7 11.3
my (gcc) - 7.7 7.6
my (clang) - 6.3 7.5
Your last (MSVc) - 27.6 11.6
Your last (gcc) - 33.8 28.6
Your last (clang) - 32.6 26.9

abs(x/y) in full range [2**-2090:2**2090]:
Standard MSVC Library - 109.4 153.5
Standard glib Library - 102.3 155.5
Yours (MSVc) - 134.9 52.6
Yours (gcc) - 284.7 151.8
Yours (clang) - 285.2 156.5
my (MSVc) - 62.1 61.1
my (gcc) - 60.8 59.1
my (clang) - 59.9 59.3
Your last (MSVc) - 135.0 52.5
Your last (gcc) - 172.1 137.3
Your last (clang) - 167.7 126.3

This are the clang++-18 results on my Zen4-computer under WSL2 with
close exponents (0x3ff to 0x433):

fmodO: 9.42929
fmodM: 11.7907

For my code it's strangely slow. On 4+ GHz Zen4 I would expect ~5 nsec.
Are you sure that you took my code from 2025-03-03 as is, compiled it
as separate file C file (not C++), without touching it?

So my code is about 23% faster on my computer.

This are the results for arbitrary exponents:

fmodO: 41.9115
fmodM: 41.2062

Good job by LLVM. Unfortunately, on msys2 clang appears to use Gnu implementation of compiler support library instead of their own. Right
now gnu is not as smart. Hopefully, they will catch up soon.

Exactly what I already mentioned.

Maybe that depends on the glibc-version because a different glibc
version might have different efficient __umodti3 functions.

Compiler supports functions like udivmodti4 are not part of glibc.
They reside in separate library. In case of gcc it is called libgcc or libgcc_s.
My educated guess is that on Linux clang does not use libgcc.

You code would be much, much cleaner and more readable if you
replace lambdas with proper functions. ..

For me that doesn't make a difference.

But it makes difference for your potential readers.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Bonita Montero on Mon Mar 10 17:43:05 2025

On Mon, 10 Mar 2025 16:11:39 +0100
Bonita Montero <Bonita.Montero@gmail.com> wrote:

Am 10.03.2025 um 15:59 schrieb Michael S:

One does not need crystal ball to extrapolate speed of simple
integer code from 3.7 GHz Zen3 to 4+ GHz Zen4 (probably 4.7 or 4.8
GHz).

If one core only computes the clock is even 5.7GHz.
But the results aren't better than shown.

But your repetitive avoidance of answering my direct questions about
what exactly you are using as "my code" makes me even more
suspicious.

I've shown the latest code of fmodO;

That's not the question I am asking for 4th or 5th time.
My question is what *exactly* is fmodM.

you can easily inegrate it into
your own benchmark.

I did. And presented the results. I am fully willing to believe that
the difference between our clang results explained by difference in
compiler support library.
But so far I find no explanation for why results for what you claim to
be *my* code are so much slower in your measurements, despite your
faster CPU.
BTW, the number you did not publish at all was the speed of fmod()
routine from standard library. My estimation is that on CPU like
yours for close exponent range it should be around 7-8 nsec, both for
msys2 and for MSVC. I have no idea about glibc on Linux.

I don't use unfiform_real_distrubution for the
random numbers but uniform_int_distribution with bounds of 0x3FFull
<< 52 and 0x433ull << 52 for the close exponent case. The whole test
code nearly hasn't changed over my initial post.

Assuming that the exponent of y is fixed at 1023 that is approximately
the same as my own test.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Bonita Montero on Mon Mar 10 19:51:28 2025

On Mon, 10 Mar 2025 17:13:31 +0100
Bonita Montero <Bonita.Montero@gmail.com> wrote:

Am 10.03.2025 um 16:43 schrieb Michael S:

That's not the question I am asking for 4th or 5th time.
My question is what *exactly* is fmodM.

fmodM is your code, M like Michael.

What my code?
Post it.

I did. And presented the results. I am fully willing to believe that
the difference between our clang results explained by difference in compiler support library.

Or maybe the different CPU.

But so far I find no explanation for why results for what you claim
to be *my* code are so much slower in your measurements, despite
your faster CPU.

I compiled with -O2 and march=native, that should be sufficient.

BTW, the number you did not publish at all was the speed of fmod()
routine from standard library. ...

I've the fmod code of glibc 2.31, which is rather slow since it
does the subtract and shifts manually - code from Sun of the 90s.

Assuming that the exponent of y is fixed at 1023 that is
approximately the same as my own test.

Yes, but as you said earlier close exponents are more relevant.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Bonita Montero on Mon Mar 10 20:38:18 2025

On Mon, 10 Mar 2025 19:00:06 +0100
Bonita Montero <Bonita.Montero@gmail.com> wrote:

Am 10.03.2025 um 18:51 schrieb Michael S:

What my code?
Post it.

I just changed the function name and that the code uses xtrunc()
instead of trunc() since trunc() is slow with MSVC. I removed my
improvement _mm_getcsr() / _mm_setcsr() since the speedup was
noticeable but not significant, unlike the xtrunc() optimization,
which made a speedup of about +100% with MSVC.

double fmodM( double x, double y )
{
if( isnan( x ) )
return x;

// pre-process y
if( isless( y, 0 ) )
y = -y;
else if( isgreater( y, 0 ) )
;
else {
if( isnan( y ) )
return y;
// y==0
feraiseexcept( FE_INVALID );
return nan( "y0" );
}

// y in (0:+inf]

// Quick path
double xx = x * 0x1p-53;
if( xx > -y && xx < y ) {
// among other things, x guaranteed to be finite
if( x > -y && x < y )
return x; // case y=+-inf covered here
double d = xtrunc( x / y );
double res = fma( -y, d, x );
if( signbit( x ) != signbit( res ) ) {
// overshoot because of unfortunate division
rounding // it is extremely rare for small x/y,
// but not rare when x/y is close to 2**53
res = fma( -y, d + (signbit( x ) * 2 - 1), x
); }
return res;
}

// slow path
if( isinf( x ) ) {
feraiseexcept( FE_INVALID );
return nan( "xinf" );
}

int oldRnd = fegetround();
fesetround( FE_TOWARDZERO );
double ax = fabs( x );
do {
double yy = y;
while( yy < ax * 0x1p-1022 )
yy *= 0x1p1021;

do
ax = fma( -yy, xtrunc( ax / yy ), ax );
while( ax >= yy );

} while( ax >= y );

ax = copysign( ax, x );
fesetround( oldRnd );
return ax;
}

That is *not* a "flipflop" code that I consider relevant for
approximately a weak.
The relevant code is the one posted a weak ago. I am posting it for
the second time:

#include <math.h>
#include <fenv.h>

double my_fmod(double x, double y)
{
if (isnan(x))
return x;

// pre-process y
if (y < 0)
y = -y;
else if (y > 0)
;
else {
if (isnan(y))
return y;
// y==0
feraiseexcept(FE_INVALID);
return NAN;
}

// y in (0:+inf]
double ax = fabs(x);

// Quick path
if (ax * 0x1p-53 < y) {
// among other things, x guaranteed to be finite
if (ax < y)
return x; // case y=+-inf covered here
double d = floor(ax/y);
double res = fma(-y, d, ax);
if (res < 0) {
// overshoot because of unfortunate division rounding
// it is extremely rare for small x/y,
// but not rare when x/y is close to 2**53
res += y;
}
if (x < 0)
res = -res;
return res;
}

// slow path
if (isinf(x)) {
feraiseexcept(FE_INVALID);
return NAN;
}

int flipflop = 0;
do {
double yy = y;
while (yy < ax * 0x1p-1022)
yy *= 0x1p1021;

do {
ax = fma(-yy, floor(ax/yy), ax);
flipflop ^= (ax < 0);
ax = fabs(ax);
} while (ax >= yy);
} while (ax >= y);
if (flipflop)
ax = y - ax;
if (x < 0)
ax = -ax;
return ax;
}

Your idea is really elegant

I'd rather call it "simple" or "straightforward". "Elegant" in my book
is something else. For example, the code above is closer to what I
consider elegant.
May be, later today or tomorrow, I'll show you solution that I consider
bright. Bright, but impractical.

and as I've shown it could be
significantly imporved with SSE 4.1 along with FMA3. But at the point
where I noticed how performant a 128 : 64 modulo division through the
glibc is and as this is superior over the FMA-solution I dropped the
whole idea and removed the SSE-FMA-code from my test program.

Even if compiler generates good code for long division, there are still multiple problems with this approach:

1. Long division is very slow on majority of older CPUs. That includes
CPUs that are quite fast in the absolute sense, like Intel Skylake,
with all its subvariants (Caby Lake, Coffee Lake, Whisky Lake, etc...)
and AMD Zen2.

2. The source language is not a standard C (or C++ for that matter). One
has to use ether gnu extensions or Microsoft's extensions. In the latter
case, it became non-portable to ARM64.

3. It is slow under msys2/mingw64 and probably slow under Linux with
gnu compiler. You can easily test if the latter is true and to tell me.

4. Even on CPUs with fast long and division and with
compilers/libraries that are able to generate long division it is
measurably slower than fdiv/floor/fma in the case that corresponds to
my quick path. And slower than standard library in this case. That is
less visible with MSVC, but quite obvious with other compilers. I don't
know an exact reason for that, but would guess that this new CPUs, esp.
new CPUs from AMD, do not handle dual transition of data between
domains FF->Integer->FP particularly well. So, when the work is
short, it ends up better doing everything on the floating-point side,
even if calculation is a little longer.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Bonita Montero on Tue Mar 11 12:29:02 2025

On Tue, 11 Mar 2025 09:34:06 +0100
Bonita Montero <Bonita.Montero@gmail.com> wrote:

Partitially wrong alarm: I forget to convert my xtrunc() function into
an xfloor() function. These are the accuracy results now compared to
fmod() of MSVC:

53 bits shared accuracy
equal results: 100%
equal exceptions: 91.017%
equal NaN signs: 96.475%
equal NaN-types: 99.78%
equal NaNs: 96.253%

These are the accuracy results compared to glibc:

53 bits shared accuracy
equal results: 100%
equal exceptions: 99.901%
equal NaN signs: 87.224%
equal NaN-types: 93.181%
equal NaNs: 80.405%

Pay attention that fmod() has no requirements w.r.t. to such exceptions
as FE_INEXACT, FE_UNDERFLOW and non-standard FE_DENORMAL.
Strictly speaking, even raising FE_OVERFLOW is not illegal, but doing
so would be bad quality of implementation.
Also spec does not say what happens to FE_INVALID when one of the
inputs is signalling NAN.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Bonita Montero on Tue Mar 11 12:47:06 2025

On Mon, 10 Mar 2025 22:36:31 +0100
Bonita Montero <Bonita.Montero@gmail.com> wrote:

Am 10.03.2025 um 22:34 schrieb Bonita Montero:

And for aribitrary exponets (0x1 to 0x7FE):

fmodO: 9.29622
fmodM: 11.4518

Sorry, the copy-buffer wasn't refreshed with the new results:

fmodO: 40.4702
fmodM: 40.1652

Let's establish common measurement methodology.
Here is my throughput test bench (I have different test bench for
correctness tests and plan to build yet different one for latency
tests).

#include <cstdio>
#include <cstring>
#include <cmath>
#include <cfenv>
#include <vector>
#include <random>
#include <chrono>

extern "C" double my_fmod(double x, double y);

int main(void)
{
const int VEC_LEN = 100000;
const int N_IT = 31;

std::vector<double> xy(VEC_LEN*2);
std::vector<double> res(VEC_LEN);
std::vector<double> ref(VEC_LEN);
std::mt19937_64 rndGen;
for (int rep = 0; rep < 2; ++rep) {
if (rep == 0) {
const uint64_t EXP_MASK = 2047ull << 52;
for (int i = 0; i < VEC_LEN*2; ++i) {
uint64_t u = rndGen();
uint64_t exp = 1023;
if (i % 2 == 0) { // x
uint64_t exp0 = (u >> 52) & 2047;
exp += exp0 % 52;
}
u = (u & ~EXP_MASK) | (exp << 52);
double d;
memcpy(&d, &u, sizeof(d));
xy[i] = d;
}
} else {
for (int i = 0; i < VEC_LEN*2; ++i) {
uint64_t u = rndGen();
double d;
memcpy(&d, &u, sizeof(d));
xy[i] = d;
}
}

auto t00 = std::chrono::steady_clock::now();
const double* pXY = xy.data();
double* pRef = ref.data();
double* pRes = res.data();
std::vector<int64_t> tref(N_IT);
std::vector<int64_t> tres(N_IT);
for (int it = 0; it < N_IT; ++it) {
auto t0 = std::chrono::steady_clock::now();
for (int i = 0; i < VEC_LEN; ++i)
pRef[i] = fmod(pXY[i*2+0], pXY[i*2+1]);
auto t1 = std::chrono::steady_clock::now();
for (int i = 0; i < VEC_LEN; ++i)
pRes[i] = my_fmod(pXY[i*2+0], pXY[i*2+1]);
auto t2 = std::chrono::steady_clock::now();

tref[it] =
std::chrono::duration_cast<std::chrono::nanoseconds>(t1 -
t0).count(); tres[it] =
std::chrono::duration_cast<std::chrono::nanoseconds>(t2 -
t1).count();

for (int i = 0; i < VEC_LEN; ++i) {
if (pRef[i] != pRes[i]) {
if (!std::isnan(pRef[i]) || !std::isnan(pRes[i])) {
printf(
"Mismatch. fmod(%.17e, %.17e).\n"
"ref %.17e\n"
"my %.17e\n"
,xy[i*2+0]
,xy[i*2+1]
,ref[i]
,res[i]
);
return 1;
}
}
}
}

auto t11 = std::chrono::steady_clock::now();
int64_t dt =
std::chrono::duration_cast<std::chrono::nanoseconds>(t11 -
t00).count();

std::nth_element(tref.begin(), tref.begin()+N_IT/2, tref.end());
std::nth_element(tres.begin(), tres.begin()+N_IT/2, tres.end());
printf("fmod %6.2f nsec. my_fmod %6.2f nsec. Test time %7.3f msec\n"
,double(tref[N_IT/2]) / VEC_LEN
,double(tres[N_IT/2]) / VEC_LEN
,double(dt)*1e-6
);
}

return 0;
}

What happens when you take this code 'as is' compile it and run
it in each of 3 environments with following options:

MSVC:
cl -nologo -O2 -W4 -arch:AVX2 -std:c++20 -MD -EHsc

gcc/clang under msys2:

C:
gcc -O2 -Wall -std=c17 -march=haswell
or
clang -O2 -Wall -std=c17 -march=haswell

C++:
g++ -O2 -Wall -std=c++20 -march=haswell
or
clang++ -O2 -Wall -std=c17 -march=haswell

gcc/clang under msys2:
The same as above with addition of -lm

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Bonita Montero on Tue Mar 11 13:34:58 2025

On Tue, 11 Mar 2025 12:12:31 +0100
Bonita Montero <Bonita.Montero@gmail.com> wrote:

Am 11.03.2025 um 11:29 schrieb Michael S:

Pay attention that fmod() has no requirements w.r.t. to such
exceptions as FE_INEXACT, FE_UNDERFLOW and non-standard
FE_DENORMAL.

Yes, that's while I evaluate FE_INVALID only. But your code also can
set FE_INEXACT due to your "rounding" with sign change. MSVC seems
also try to do the math with the FPU with a integer-fallback, because
with exponent differences <= 53 MSVC's fmod() often sets FE_INECACT;
but I ignore that because that shouldn't be part of fmod();

Strictly speaking, even raising FE_OVERFLOW is not illegal,
but doing so would be bad quality of implementation.

Couldn't FE_OVERFLOW happen with your implementation when the
exponents are too far away that you get inf from the division ?

It couldn't happen. Loop within loop exists for this reason exactly: to
prevent overflow.

Also spec does not say what happens to FE_INVALID when one of the
inputs is signalling NAN.

See my code; I return MSVC and glibc compatible NaNs and I return
the same exceptions. MSVC sets FE_INVALID only when x is inf or y
is zero, glibc in addition raises FE_INVALID when either operand
is a signalling NaN.

Exactly. Both options are legal. MS's decision to not set FE_INVALID is
as good as glibc decision to set it.
So, test bench should accept both variants as correct.
BTW, what is the output of MS library in that case? SNAN or QNAN?
I would think that it should be SNAN even when the other argument is
QNAN. But even that is probably not required by the Standard.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Bonita Montero on Tue Mar 11 13:51:27 2025

On Tue, 11 Mar 2025 12:12:31 +0100
Bonita Montero <Bonita.Montero@gmail.com> wrote:

Am 11.03.2025 um 11:29 schrieb Michael S:

Pay attention that fmod() has no requirements w.r.t. to such
exceptions as FE_INEXACT, FE_UNDERFLOW and non-standard
FE_DENORMAL.

Yes, that's while I evaluate FE_INVALID only.

I think that testing that FE_DIVEDEBYZERO is not set is also a good
idea.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Bonita Montero on Tue Mar 11 14:55:43 2025

On Tue, 11 Mar 2025 13:10:31 +0100
Bonita Montero <Bonita.Montero@gmail.com> wrote:

Am 11.03.2025 um 12:34 schrieb Michael S:

Exactly. Both options are legal. MS's decision to not set
FE_INVALID is as good as glibc decision to set it.

If I do a SSE-/AVX-operation where either operand is a signalling NaN
I get a FE_INVALID; since the FPU behaves this way the MSVC runtime
should do that also.

BTW, what is the output of MS library in that case? SNAN or QNAN?

Results with SNaN parameters are always QNaN, that shoud be common
with any FPU.

But not when library routine does not use FPU. Or uses FPU only for
comparison ops.
The point is, it does not sound right if SNAN is *silently* converted
to QNAN. That type of conversion has to be loud i.e. accompanied by
setting of FE_INVALID.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Michael S on Tue Mar 11 15:46:47 2025

On Tue, 11 Mar 2025 14:55:43 +0200
Michael S <already5chosen@yahoo.com> wrote:

On Tue, 11 Mar 2025 13:10:31 +0100
Bonita Montero <Bonita.Montero@gmail.com> wrote:

Am 11.03.2025 um 12:34 schrieb Michael S:

Exactly. Both options are legal. MS's decision to not set
FE_INVALID is as good as glibc decision to set it.

If I do a SSE-/AVX-operation where either operand is a signalling
NaN I get a FE_INVALID; since the FPU behaves this way the MSVC
runtime should do that also.

BTW, what is the output of MS library in that case? SNAN or QNAN?

Results with SNaN parameters are always QNaN, that shoud be common
with any FPU.

But not when library routine does not use FPU. Or uses FPU only for comparison ops.
The point is, it does not sound right if SNAN is *silently* converted
to QNAN. That type of conversion has to be loud i.e. accompanied by
setting of FE_INVALID.

I tested. It appears that MSVC implementation made a mistake in
cases of fmod(snan, qnan):

MSVC gcc
x y result FE_INVALID) result FE_INVALID
snan 1 snan 0 qnan 1
snan 0 snan 0 qnan 1
snan inf snan 0 qnan 1
snan qnan qnan 0 !!! qnan 1
1 snan snan 0 qnan 1
0 snan snan 0 qnan 1
inf snan snan 0 qnan 1
qnan snan snan 0 qnan 1

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Michael S on Tue Mar 11 20:28:12 2025

On Mon, 10 Mar 2025 20:38:18 +0200
Michael S <already5chosen@yahoo.com> wrote:

On Mon, 10 Mar 2025 19:00:06 +0100
Bonita Montero <Bonita.Montero@gmail.com> wrote:

Your idea is really elegant

I'd rather call it "simple" or "straightforward". "Elegant" in my book
is something else. For example, the code above is closer to what I
consider elegant.
May be, later today or tomorrow, I'll show you solution that I
consider bright. Bright, but impractical.

Here, here!
A bright part is in lines 18 to 29. The rest are hopefully competent technicalities.

#include <string.h>
#include <stdint.h>
#include <math.h>
#include <fenv.h>

static uint64_t
umulrem(uint64_t x, uint64_t y, uint64_t den) {
#ifdef _MSC_VER
uint64_t hi, lo = _umul128(x, y, &hi);
uint64_t rem;
_udiv128(hi, lo, den, &rem);
return rem;
#else
return ((unsigned __int128)x * y) % den;
#endif
}

// Calculate mod(2**e, y) where y < 2**53
static uint64_t pow2_mod(uint64_t y, unsigned e) {
if (e < 64) {
uint64_t x = (uint64_t)1 << e;
if (x < y)
return x;
return x % y;
}
uint64_t x1 = pow2_mod(y, e/2);
uint64_t x2 = x1 << (e & 1);
return umulrem(x1, x2, y);
}

static double u2d(uint64_t u) {
double d;
memcpy(&d, &u, sizeof(d));
return d;
}

static uint64_t d2u(double d) {
uint64_t u;
memcpy(&u, &d, sizeof(u));
return u;
}

// raise FE_INVALID and return nan
static double raise_fe_invalid_ret_nan(double x)
{
const uint64_t SNAN_BITS = ~(1ull << 51);
double snan = u2d(SNAN_BITS);
#ifndef __clang__
return snan + x;
#else
volatile double v_snan = snan;
return v_snan + x;
#endif
}

double my_fmod(double x, double y)
{
const uint64_t INF_EXP = 2047;
const uint64_t INF2 = INF_EXP << 53;
const uint64_t HIDDEN_BIT = (uint64_t)1 << 52;
const uint64_t MANT_MASK = HIDDEN_BIT - 1;
const uint64_t SIGN_BIT = (uint64_t)1 << 63;

uint64_t ux = d2u(x);
uint64_t uy = d2u(y);
uint64_t ux2 = ux*2;
uint64_t uy2 = uy*2;
uint64_t sx = ux & SIGN_BIT;

// process non-finite x
if (ux2 >= INF2) { // x is inf or nan
if (ux2 > INF2) // x is nan
return x + y; // raises FE_INVALID when either x or y is sNAN
// x is inf
if (uy2 > INF2) // y is nan
return x + y;
// y is finite or inf
return raise_fe_invalid_ret_nan(x);
}
// x is finite

// process non-finite and zero y
if (uy2-1 >= INF2-1) { // y is inf or nan or 0
if (uy2 > INF2) // y is nan
return y;
// x is inf
if (uy2 == INF2) // y is inf
return x;
// y is 0
return raise_fe_invalid_ret_nan(x);
}

// y is finite non-zero
if (ux2 < uy2)
return x; // abs(x) < abs(y)

// extract mantissa and exponent
uint64_t mantX = (ux2 >> 1) & MANT_MASK;
uint64_t mantY = (uy2 >> 1) & MANT_MASK;
int expX = ux2 >> 53;
if (expX == 0) { // X subnormal
// Y is also subnormal, so we can use simple integer reduction
return u2d((mantX % mantY) | sx);
}

int expY = uy2 >> 53;
if (expY == 0) { // Y subnormal
mantY |= HIDDEN_BIT; // removed below
expY = 1;
}

mantY ^= HIDDEN_BIT;
mantX ^= HIDDEN_BIT;
if (mantX >= mantY) {
mantX -= mantY;
if (mantX >= mantY) // can happen when y is subnormal
mantX %= mantY;
}

int dExp = expX - expY;
uint64_t f = (dExp <= 63) ?
(uint64_t)1 << dExp : // quick path
pow2_mod(mantY, dExp); // slow path
mantX = umulrem(mantX, f, mantY);

// apply exponent of Y to mantX
uint64_t ures0 = ((uint64_t)expY << 52) | sx;
uint64_t ures = ures0 | (mantX & MANT_MASK);
if (mantX & HIDDEN_BIT)
ures0 = 0;
return u2d(ures) - u2d(ures0);
}

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From James Kuyper@21:1/5 to Michael S on Tue Mar 11 14:21:02 2025

On 3/11/25 06:29, Michael S wrote:
...

Pay attention that fmod() has no requirements w.r.t. to such exceptions
as FE_INEXACT, FE_UNDERFLOW and non-standard FE_DENORMAL.
Strictly speaking, even raising FE_OVERFLOW is not illegal, but doing
so would be bad quality of implementation.
Also spec does not say what happens to FE_INVALID when one of the
inputs is signalling NAN.

I've got Bonita killfiled, so the oldest message I can see on this
thread is one posted by you that indicated you were interested in IEEE
754 (==IOS/IEC 60559) conformance.

The C++ standard cross-references the C standard for such issues. The C standard specifies that, for an implementation which pre#defines __STDC__IEC_60559__BFP__, floating point exception handling is very
tightly specified for conformance with ISO/IEC 60559:

"The double version of fmod behaves as though implemented by
#include <math.h>
#include <fenv.h>
#pragma STDC FENV_ACCESS ON
double fmod(double x, double y)
{
double result;
result = remainder(fabs(x), (y = fabs(y)));
if (signbit(result)) result += y;
return copysign(result, x);
}
" (F10.7.1)

"Operations defined in 6.5 and functions and macros defined for the
standard libraries change floating-point status flags and control modes
just as indicated by their specifications (including conformance to IEC
60559). They do not change flags or modes (so as to be detectable by the
user) in any other cases." (F8.6)

"... signbit ... raise[s] no floating-point exceptions, even if an
argument is a signaling NaN." (F3p6)

"fabs(x) raises no floating-point exceptions, even if x is a signaling
NaN." (F10.4.3)

"— remainder(x, y) returns a NaN and raises the "invalid" floating-point exception for x infinite or y zero (and neither is a NaN)." (F.10.7.2)

"copysign(x, y) raises no floating-point exceptions, even if x or y is a signaling NaN." (F10.8.1)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Michael S on Sun Mar 16 13:48:03 2025

On Tue, 11 Mar 2025 20:28:12 +0200
Michael S <already5chosen@yahoo.com> wrote:

On Mon, 10 Mar 2025 20:38:18 +0200
Michael S <already5chosen@yahoo.com> wrote:

On Mon, 10 Mar 2025 19:00:06 +0100
Bonita Montero <Bonita.Montero@gmail.com> wrote:

Your idea is really elegant

I'd rather call it "simple" or "straightforward". "Elegant" in my
book is something else. For example, the code above is closer to
what I consider elegant.
May be, later today or tomorrow, I'll show you solution that I
consider bright. Bright, but impractical.

Here, here!
A bright part is in lines 18 to 29. The rest are hopefully competent technicalities.

And here is non-recursive implementation of the same algorithm that has following potentially useful properties:
1. It does not use compiler-specific extensions, only standard C.
2. It does not use FMA, so gives correct results on implementations
with broken fam(), like MSVC on pre-AVX computers.

#include <string.h>
#include <stdint.h>
#include <math.h>

static double u2d(uint64_t u) {
double d;
memcpy(&d, &u, sizeof(d));
return d;
}

static uint64_t d2u(double d) {
uint64_t u;
memcpy(&u, &d, sizeof(u));
return u;
}

// raise FE_INVALID and return nan
static double raise_fe_invalid_ret_nan(double x)
{
const uint64_t SNAN_BITS = ~(1ull << 51);
double snan = u2d(SNAN_BITS);
#ifndef __clang__
return snan + x;
#else
volatile double v_snan = snan;
return v_snan + x;
#endif
}

double my_fmod(double x, double y)
{
const uint64_t INF_EXP = 2047;
const uint64_t INF2 = INF_EXP << 53;
const uint64_t HIDDEN_BIT = (uint64_t)1 << 52;
const uint64_t MANT_MASK = HIDDEN_BIT - 1;
const uint64_t SIGN_BIT = (uint64_t)1 << 63;

uint64_t ux = d2u(x);
uint64_t uy = d2u(y);
uint64_t ux2 = ux*2;
uint64_t uy2 = uy*2;
uint64_t sx = ux & SIGN_BIT;

// process non-finite x
if (ux2 >= INF2) { // x is inf or nan
if (ux2 > INF2) // x is nan
return x + y; // raises FE_INVALID when either x or y is sNAN
// x is inf
if (uy2 > INF2) // y is nan
return x + y;
// y is finite or inf
return raise_fe_invalid_ret_nan(x);
}
// x is finite

// process non-finite and zero y
if (uy2-1 >= INF2-1) { // y is inf or nan or 0
if (uy2 > INF2) // y is nan
return y;
// x is inf
if (uy2 == INF2) // y is inf
return x;
// y is 0
return raise_fe_invalid_ret_nan(x);
}

// y is finite non-zero
if (ux2 < uy2)
return x; // abs(x) < abs(y)

// extract mantissa and exponent
int64_t mantX = ((ux2 >> 1) & MANT_MASK)+HIDDEN_BIT;
int64_t mantY = ((uy2 >> 1) & MANT_MASK)+HIDDEN_BIT;
int expX = ux2 >> 53;
int expY = uy2 >> 53;
unsigned dExp = expX - expY;
double ax = fabs(x);
double ay = fabs(y);
if (ax*0x1p-53 <= ay) {
// Quick path
int64_t d = (int64_t)(ax/ay);
if (expY == 0) { // Y subnormal
// don't normalize
mantY -= HIDDEN_BIT;
expY = 1;
if (expX == 0) { // X subnormal
mantX -= HIDDEN_BIT;
expX = 1;
}
dExp = expX - expY;
}
mantX = (mantX << dExp) - mantY*(uint64_t)d;
} else {
if (expY == 0) { // Y subnormal
// Normalize
uy = d2u(y*0x1p52);
mantY = (uy & MANT_MASK) | HIDDEN_BIT;
expY = ((int)(uy >> 52) & 2047) - 52;
dExp = expX - expY;
}
if (mantY == (int64_t)HIDDEN_BIT)
return u2d(sx); // Y is power of 2

// Calculate rem(2**dExp, mantY)
unsigned e0, n_steps;
for (n_steps = 0, e0 = dExp; e0 > 105; ++n_steps)
e0 /= 2;

double ry = 1.0/mantY;
uint64_t mantRy = (d2u(ry) & MANT_MASK) | HIDDEN_BIT;
uint64_t d = mantRy >> (105-e0);
int64_t f = (((uint64_t)1 << 52) << (e0-52)) - mantY*d;
// f = rem(2**e0, mantY) + a*mantY where -1 <= a <= 1
if (n_steps > 0) {
int next_bit = n_steps-1;
const uint64_t F_MAX = (uint64_t)3 << 54;
do {
double df = (double)f;
d = (int64_t)(df*df*ry);
f = (uint64_t)f*(uint64_t)f - mantY*d;
f <<= (dExp >> next_bit) & 1;
if (f+F_MAX > F_MAX*2)
f -= (int64_t)(f*ry)*mantY;
--next_bit;
} while (next_bit >= 0);
}
if (mantX >= mantY)
mantX -= mantY;
d = (int64_t)((double)f*(int64_t)mantX*ry);
mantX = (uint64_t)f*(uint64_t)mantX - mantY*(uint64_t)d;
}
while ((uint64_t)mantX >= (uint64_t)mantY) {
if (mantX < 0)
mantX += mantY;
else
mantX -= mantY;
}

if (expY <= 1) { // Y subnormal
mantX >>= 1 - expY;
return u2d(mantX | sx);
}

// Apply exponent of Y to mantX
uint64_t ures0 = ((uint64_t)expY << 52) | sx;
uint64_t ures = ures0 | (mantX & MANT_MASK);
if (mantX & HIDDEN_BIT)
ures0 = 0;
return u2d(ures) - u2d(ures0);
}

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Gretchiie
  Wed Sep 17 08:54:03 2025
  from Derry, Nh via Telnet
- Bob Worm
  Wed Sep 17 08:43:18 2025
  from Wales, Uk via Telnet
- Bob Worm
  Wed Sep 17 08:14:37 2025
  from Wales, Uk via Telnet
- Volatile_Memory
  Wed Sep 17 07:20:57 2025
  from Des Moines, Iowa via SSH
- Volatile_Memory
  Wed Sep 17 07:17:26 2025
  from Des Moines, Iowa via SSH
- Bob Worm
  Tue Sep 16 21:01:27 2025
  from Wales, Uk via Telnet
- Bob Worm
  Tue Sep 16 15:15:42 2025
  from Wales, Uk via Telnet
- Gretchiie
  Tue Sep 16 05:20:21 2025
  from Derry, Nh via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	546
Nodes:	16 (2 / 14)
Uptime:	55:13:43
Calls:	10,397
Calls today:	5
Files:	14,067
Messages:	6,417,425
Posted today:	1

Re: a MSVC and glibc-compatible fmod()

Who's Online

Recent Visitors

System Info