I wanted to optimize fmod to be a bit faster. This is my C++20 solution.[snip]
double myFmod( double x, double y )
It's about six times faster than the glibc 2.31 solution in my
benchmark. The returned NaNs and the raised exceptions are MSVC-
and glibc-compatible.
Am 24.02.2025 um 13:00 schrieb Muttley@DastardlyHQ.org:
On Mon, 24 Feb 2025 11:48:08 +0100
Bonita Montero <Bonita.Montero@gmail.com> wibbled:
I wanted to optimize fmod to be a bit faster. This is my C++20 solution. >>>[snip]
double myFmod( double x, double y )
It's about six times faster than the glibc 2.31 solution in my
benchmark. The returned NaNs and the raised exceptions are MSVC-
and glibc-compatible.
double myFmod(double x, double y)
{
double div = x / y;
return y * (div - (long)div);
}
If the exponent difference between x and y is large enough this
returns results which are larger than y. glibc does it completley
with integer-operations also.
On Mon, 24 Feb 2025 13:09:50 -0000 (UTC)
Muttley@DastardlyHQ.org wrote:
If the values are so large or small that you start to get floating
point based errors then you should probably be using integer
arthmetic or a large number library anyway like GMP anyway.
Your method will sometimes produce results that are 1 LSB off
relatively to IEEE-754 prescription when values are neither small nor
large.
And sometimes 1 LSB off means that result is 2x off.
For example, for x=0.9999999999999999, y=0.9999999999999998 your method >produces 2.2204460492503126e-16. A correct result is, of course, >1.1102230246251565e-16
Also, I don't think that your method is any faster than correct methods.
Am 24.02.2025 um 14:09 schrieb Muttley@DastardlyHQ.org:
If the values are so large or small that you start to get floating
point based errors then you should probably be using integer
arthmetic or a large number library anyway like GMP anyway.
There's no need for large integer arithmetics since each calculation
step results in a mantissa with equal or less bits than the divisor.
Even if the exponents are close enough to have not missing integer
-bits the following multiplication is very likely to have a precision
-loss. All current solutions (MSVC, libstdc++) work with integer-ope-
tations and are always 100% precise.
On Mon, 24 Feb 2025 13:48:02 +0100
Bonita Montero <Bonita.Montero@gmail.com> wibbled:
Am 24.02.2025 um 13:00 schrieb Muttley@DastardlyHQ.org:
On Mon, 24 Feb 2025 11:48:08 +0100
Bonita Montero <Bonita.Montero@gmail.com> wibbled:
I wanted to optimize fmod to be a bit faster. This is my C++20[snip]
solution.
double myFmod( double x, double y )
It's about six times faster than the glibc 2.31 solution in my
benchmark. The returned NaNs and the raised exceptions are MSVC-
and glibc-compatible.
double myFmod(double x, double y)
{
double div = x / y;
return y * (div - (long)div);
}
If the exponent difference between x and y is large enough this
returns results which are larger than y. glibc does it completley
with integer-operations also.
If the values are so large or small that you start to get floating
point based errors then you should probably be using integer
arthmetic or a large number library anyway like GMP anyway.
Am 24.02.2025 um 16:13 schrieb Michael S:
Do you have real application for fast fmod() or just playing?
I experimented with x87 FPREM and wanted to know whether it is
precise; it isn't
and the results can be > y.
So I developed my own
routine which is always 100% precise.
For as long as y is positive
and abs(x/y) <= 2**53, a very simple
formula will produce precise result: fma(trunc(x/fabs(y)),
-fabs(y), x).
The multiplication mostly will drop bits so that the difference might
become larger than y.
On Mon, 24 Feb 2025 16:22:46 +0200
Michael S <already5chosen@yahoo.com> wibbled:
On Mon, 24 Feb 2025 13:09:50 -0000 (UTC)
Muttley@DastardlyHQ.org wrote:
If the values are so large or small that you start to get floating
point based errors then you should probably be using integer
arthmetic or a large number library anyway like GMP anyway.
Your method will sometimes produce results that are 1 LSB off
relatively to IEEE-754 prescription when values are neither small nor >large.
And sometimes 1 LSB off means that result is 2x off.
For example, for x=0.9999999999999999, y=0.9999999999999998 your
method produces 2.2204460492503126e-16. A correct result is, of
course, 1.1102230246251565e-16
Frankly I doubt anyone would care, its zero in all but name.
Also, I don't think that your method is any faster than correct
methods.
Don't know, but its only 3 mathematical operations all of which can
be done by the hardware so its going to be pretty fast.
On Mon, 24 Feb 2025 15:10:53 -0000 (UTC)
Muttley@DastardlyHQ.org wrote:
Don't know, but its only 3 mathematical operations all of which can
be done by the hardware so its going to be pretty fast.
Looks like 4 operations to me - division, truncation, subtraction, >multiplication. If compiler takes it literally, which he probably
Nevertheless, after a bit of thinking I concur that your formula is
faster than 100% correct methods. Initially, I didn't took into account
all difficulties that correct methods have to face in cases of very
large x to y ratios.
However your method is approximately the same speed as *mostly correct* >method shown in my post above. May be, yours is even a little slower,
at least as long as we use good optimizing compiler and target modern
CPUs that support trunc() and fma() as fast hardware instructions.
I wanted to optimize fmod to be a bit faster. This is my C++20 solution.
double myFmod( double x, double y )
{
constexpr uint64_t
SIGN = 1ull << 63,
IMPLICIT = 1ull << 52,
MANT = IMPLICIT - 1,
QBIT = 1ull << 51;
uint64_t const
binX = bit_cast<uint64_t>( x ),
binY = bit_cast<uint64_t>( y );
static auto abs = []( uint64_t m ) { return m & ~SIGN; };
auto isNaN = []( uint64_t m ) { return abs( m ) >= 0x7FF0000000000001u; }; auto isSig = []( uint64_t m ) { return !(m & QBIT); };
if( isNaN( binX ) ) [[unlikely]] // x == NaN
#if defined(_MSC_VER)
return bit_cast<double>( isNaN( binY ) ? binY | binX & binY & QBIT :
binX );
#else
{
if( isSig( binX ) || isNaN( binY ) && isSig( binY ) ) [[unlikely]]
feraiseexcept( FE_INVALID );
return bit_cast<double>( binX | QBIT );
}
#endif
if( isNaN( binY ) ) [[unlikely]] // x != NaN || y == NaN
#if defined(_MSC_VER)
return y;
#else
{
if( isSig( binY ) ) [[unlikely]]
feraiseexcept( FE_INVALID );
return bit_cast<double>( binY | QBIT );
}
#endif
auto isInf = []( uint64_t m ) { return abs( m ) == 0x7FF0000000000000u; }; if( isInf( binX ) ) // x == Inf
{
feraiseexcept( FE_INVALID );
#if defined(_MSC_VER)
return bit_cast<double>( binX & ~MANT | QBIT );
#else
return -numeric_limits<double>::quiet_NaN();
#endif
}
if( !abs( binY ) ) [[unlikely]] // y == 0
{
feraiseexcept( FE_INVALID );
#if defined(_MSC_VER)
return numeric_limits<double>::quiet_NaN();
#else
return -numeric_limits<double>::quiet_NaN();
#endif
}
if( !abs( binX ) || isInf( binY ) ) [[unlikely]] // x == 0 || y == Inf
return x;
auto exp = []( uint64_t b ) -> int { return b >> 52 & 0x7FF; };
int
expX = exp( binX ),
expY = exp( binY );
auto mant = []( uint64_t b ) { return b & MANT; };
uint64_t
mantX = mant( binX ),
mantY = mant( binY );
static auto normalize = []( int &exp, uint64_t &mant )
{
unsigned shift = countl_zero( mant ) - 11;
mant <<= shift;
exp -= shift;
};
auto build = []( int &exp, uint64_t &mant )
{
if( exp ) [[likely]]
mant |= IMPLICIT;
else
{
exp = 1;
normalize( exp, mant );
}
};
build( expX, mantX );
build( expY, mantY );
uint64_t signX = binX & SIGN;
int expDiff;
while( (expDiff = expX - expY) > 0 )
{
unsigned bits = expDiff <= 11 ? expDiff : 11;
if( !(mantX = (mantX << bits) % mantY) ) [[unlikely]]
return bit_cast<double>( signX );
expX -= bits;
normalize( expX, mantX );
}
if( !expDiff && mantX >= mantY ) [[unlikely]]
if( (mantX -= mantY) ) [[likely]]
normalize( expX, mantX );
else
return bit_cast<double>( signX );
if( expX <= 0 ) [[unlikely]]
{
assert(expX >= -51);
mantX = mantX >> (unsigned)(-expX + 1);
expX = 0;
}
return bit_cast<double>( signX | (uint64_t)expX << 52 | mantX & MANT );
}
It's about six times faster than the glibc 2.31 solution in my
benchmark. The returned NaNs and the raised exceptions are MSVC-
and glibc-compatible.
On Mon, 24 Feb 2025 11:48:08 +0100, Bonita Montero
<Bonita.Montero@gmail.com> wrote:
It's about six times faster than the glibc 2.31 solution in my
benchmark. The returned NaNs and the raised exceptions are MSVC-
and glibc-compatible.
double myFmod(double x, double y)
{
double div = x / y;
return y * (div - std::round(div));
}
Am 24.02.2025 um 16:52 schrieb Michael S:
That is why I don't use multiplication. Did you ever asked yourself
what is the meaning of 'f' in 'fma' ?
The FMA-instructions produce the same results:
#include <iostream>
#include <random>
#include <bit>
#include <cmath>
#include <iomanip>
#include <intrin.h>
using namespace std;
int main()
{
auto fma = []( double a, double b, double c )
{
__m128d mA, mB, mC;
mA.m128d_f64[0] = a;
mB.m128d_f64[0] = b;
mC.m128d_f64[0] = c;
return _mm_fmadd_pd( mA, mB, mC ).m128d_f64[0];
};
mt19937_64 mt;
uniform_int_distribution<uint64_t> finites( 1,
0x7FEFFFFFFFFFFFFFu ); auto rnd = [&]() -> double { return
bit_cast<double>( finites( mt ) ); }; ptrdiff_t nEQs = 0;
for( ptrdiff_t r = 0; r != 1'000'000; ++r )
{
double
a = rnd(), b = rnd(), c = rnd(),
rA = fma( a, b, c ),
rB = a * b + c;
nEQs = rA != rB;
}
cout << hexfloat << nEQs / 1.0e6 << endl;
}
Am 25.02.2025 um 16:26 schrieb Michael S:
On Tue, 25 Feb 2025 09:09:21 +0100
Bonita Montero <Bonita.Montero@gmail.com> wrote:
Am 24.02.2025 um 16:52 schrieb Michael S:
That is why I don't use multiplication. Did you ever asked
yourself what is the meaning of 'f' in 'fma' ?
The FMA-instructions produce the same results:
#include <iostream>
#include <random>
#include <bit>
#include <cmath>
#include <iomanip>
#include <intrin.h>
using namespace std;
int main()
{
auto fma = []( double a, double b, double c )
{
__m128d mA, mB, mC;
mA.m128d_f64[0] = a;
mB.m128d_f64[0] = b;
mC.m128d_f64[0] = c;
return _mm_fmadd_pd( mA, mB, mC ).m128d_f64[0];
};
mt19937_64 mt;
uniform_int_distribution<uint64_t> finites( 1,
0x7FEFFFFFFFFFFFFFu ); auto rnd = [&]() -> double { return
bit_cast<double>( finites( mt ) ); }; ptrdiff_t nEQs = 0;
for( ptrdiff_t r = 0; r != 1'000'000; ++r )
{
double
a = rnd(), b = rnd(), c = rnd(),
rA = fma( a, b, c ),
rB = a * b + c;
nEQs = rA != rB;
}
cout << hexfloat << nEQs / 1.0e6 << endl;
}
GIGO.
Do a proper test then you'd get a proper answer.
The test is proper with MSVC since MSVC doesn't replace the
"a * b + c"-operation with a FMA-operation.
With your code
it isn't guaranteed that the CPU-specific FMA-operations are
used.
I'm using the SSE FMA operation explicitly and I'm using
it for a million random finite double-values.
Am 24.02.2025 um 23:21 schrieb Mr Flibble:
double myFmod(double x, double y)
{
double div = x / y;
return y * (div - std::round(div));
}
Doesn't work, not only for the reasons already mentioned.
Am 25.02.2025 um 18:17 schrieb Michael S:
Don't invent your own fma(). Use one provided by library.
Then MSVC will do what it is prescribed to do by the standard.
I want to be sure that I'm using the SSE FMA operation
and
not a conventional substitute of two instructions.
Originally, I didn't even try to investigate what garbage exactly
you are feeding to your test. Now I took a look. It seems that you
are doing something fundamentally stupid, like all fma inputs
positive.
I extended the test to incorporate all possible double
bitrepresentations (taken von mt()) - with no difference.
On Tue, 25 Feb 2025 07:37:23 +0100, Bonita Montero
<Bonita.Montero@gmail.com> wrote:
Am 24.02.2025 um 23:21 schrieb Mr Flibble:
double myFmod(double x, double y)
{
double div = x / y;
return y * (div - std::round(div));
}
Doesn't work, not only for the reasons already mentioned.
double myFmod(double x, double y)
{
double div = x / y;
return y * (div - std::trunc(div));
}
/Flibble
double myFmod(double x, double y)
{
double div = x / y;
return y * (div - std::trunc(div));
}
Here is an example of the program that prints 250508 on Intel Haswell
CPU, but prints 0 on Intel Ivy Bridge.
Compiled as 'cl -O1 -W4 -MD fma_tst0.c'.
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
static
unsigned long long rnd(void)
{
unsigned long long x = 0;
for (int i = 0; i < 5; ++i)
x = (x << 15) + (rand() & 0x7FFF);
return x;
}
int main(void)
{
srand(1);
int n = 0;
for (int i = 0; i < 1000000; ++i) {
double x = rnd() * 0x1p-64;
double y = rnd() * 0x1p-64;
double z = rnd() * 0x1p-114;
double r1 = x*y + z;
double r2 = fma(x, y, z);
n += r1 != r2;
}
printf("%d\n", n);
return 0;
}
It is certainly worth a bug report, but I am afraid that Microsoft will
do nothing to fix it, likely claiming that they don't care about old >hardware.
On Wed, 26 Feb 2025 00:36:19 +0200
Michael S <already5chosen@yahoo.com> wibbled:
Here is an example of the program that prints 250508 on Intel Haswell
CPU, but prints 0 on Intel Ivy Bridge.
Compiled as 'cl -O1 -W4 -MD fma_tst0.c'.
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
static
unsigned long long rnd(void)
{
unsigned long long x = 0;
for (int i = 0; i < 5; ++i)
x = (x << 15) + (rand() & 0x7FFF);
return x;
}
int main(void)
{
srand(1);
int n = 0;
for (int i = 0; i < 1000000; ++i) {
double x = rnd() * 0x1p-64;
double y = rnd() * 0x1p-64;
double z = rnd() * 0x1p-114;
double r1 = x*y + z;
double r2 = fma(x, y, z);
n += r1 != r2;
}
printf("%d\n", n);
return 0;
}
It is certainly worth a bug report, but I am afraid that Microsoft
will do nothing to fix it, likely claiming that they don't care
about old hardware.
Just FYI - it also returns 0 when compiled by Clang on an ARM Mac.
On Wed, 26 Feb 2025 08:16:01 -0000 (UTC)
Muttley@DastardlyHQ.org wrote:
I played a little on godbolt and it seems that the bug is relatively
new. clang 13 still generates correct code. clang 14 does not. I.e.
slightly less than 3 years.
I wanted to optimize fmod to be a bit faster. This is my C++20 solution.
double myFmod( double x, double y )
{
constexpr uint64_t
SIGN = 1ull << 63,
IMPLICIT = 1ull << 52,
MANT = IMPLICIT - 1,
QBIT = 1ull << 51;
uint64_t const
binX = bit_cast<uint64_t>( x ),
binY = bit_cast<uint64_t>( y );
static auto abs = []( uint64_t m ) { return m & ~SIGN; };
auto isNaN = []( uint64_t m ) { return abs( m ) >= 0x7FF0000000000001u; }; auto isSig = []( uint64_t m ) { return !(m & QBIT); };
if( isNaN( binX ) ) [[unlikely]] // x == NaN
#if defined(_MSC_VER)
return bit_cast<double>( isNaN( binY ) ? binY | binX & binY & QBIT :
binX );
#else
{
if( isSig( binX ) || isNaN( binY ) && isSig( binY ) ) [[unlikely]]
feraiseexcept( FE_INVALID );
return bit_cast<double>( binX | QBIT );
}
#endif
if( isNaN( binY ) ) [[unlikely]] // x != NaN || y == NaN
#if defined(_MSC_VER)
return y;
#else
{
if( isSig( binY ) ) [[unlikely]]
feraiseexcept( FE_INVALID );
return bit_cast<double>( binY | QBIT );
}
#endif
auto isInf = []( uint64_t m ) { return abs( m ) == 0x7FF0000000000000u; }; if( isInf( binX ) ) // x == Inf
{
feraiseexcept( FE_INVALID );
#if defined(_MSC_VER)
return bit_cast<double>( binX & ~MANT | QBIT );
#else
return -numeric_limits<double>::quiet_NaN();
#endif
}
if( !abs( binY ) ) [[unlikely]] // y == 0
{
feraiseexcept( FE_INVALID );
#if defined(_MSC_VER)
return numeric_limits<double>::quiet_NaN();
#else
return -numeric_limits<double>::quiet_NaN();
#endif
}
if( !abs( binX ) || isInf( binY ) ) [[unlikely]] // x == 0 || y == Inf
return x;
auto exp = []( uint64_t b ) -> int { return b >> 52 & 0x7FF; };
int
expX = exp( binX ),
expY = exp( binY );
auto mant = []( uint64_t b ) { return b & MANT; };
uint64_t
mantX = mant( binX ),
mantY = mant( binY );
static auto normalize = []( int &exp, uint64_t &mant )
{
unsigned shift = countl_zero( mant ) - 11;
mant <<= shift;
exp -= shift;
};
auto build = []( int &exp, uint64_t &mant )
{
if( exp ) [[likely]]
mant |= IMPLICIT;
else
{
exp = 1;
normalize( exp, mant );
}
};
build( expX, mantX );
build( expY, mantY );
uint64_t signX = binX & SIGN;
int expDiff;
while( (expDiff = expX - expY) > 0 )
{
unsigned bits = expDiff <= 11 ? expDiff : 11;
if( !(mantX = (mantX << bits) % mantY) ) [[unlikely]]
return bit_cast<double>( signX );
expX -= bits;
normalize( expX, mantX );
}
if( !expDiff && mantX >= mantY ) [[unlikely]]
if( (mantX -= mantY) ) [[likely]]
normalize( expX, mantX );
else
return bit_cast<double>( signX );
if( expX <= 0 ) [[unlikely]]
{
assert(expX >= -51);
mantX = mantX >> (unsigned)(-expX + 1);
expX = 0;
}
return bit_cast<double>( signX | (uint64_t)expX << 52 | mantX & MANT );
}
It's about six times faster than the glibc 2.31 solution in my
benchmark. The returned NaNs and the raised exceptions are MSVC-
and glibc-compatible.
double myFmod(double x, double y)
{
return x / y - std::trunc(x / y) * y;
}
/Flibble
On Thu, 27 Feb 2025 21:16:50 +0000
Mr Flibble <leigh@i42.co.uk> wrote:
double myFmod(double x, double y)
{
return x / y - std::trunc(x / y) * y;
}
/Flibble
Nonsense.
The one below is not nonsense, but still very bad.
double myFmod(double x, double y)
{
return x - trunc(x / y) * y;
}
Even ignoring potential overflow during division, this method is
very imprecise.
(1e3/9 - trunc(1e3/9))*9 = 1.000000000000028
(1e6/9 - trunc(1e6/9))*9 = 0.999999999985448
(1e9/9 - trunc(1e9/9))*9 = 0.999999940395355
(1e12/9 - trunc(1e12/9))*9 = 1.000030517578125
(1e15/9 - trunc(1e15/9))*9 = 0.984375
OTOH
1e3/9 - trunc(1e3/9)*9 = 1
1e6/9 - trunc(1e6/9)*9 = 1
1e9/9 - trunc(1e9/9)*9 = 1
1e12/9 - trunc(1e12/9)*9 = 1
1e15/9 - trunc(1e15/9)*9 = 1
I wanted to optimize fmod to be a bit faster. This is my C++20 solution.
double myFmod( double x, double y )
{
constexpr uint64_t
SIGN = 1ull << 63,
IMPLICIT = 1ull << 52,
MANT = IMPLICIT - 1,
QBIT = 1ull << 51;
uint64_t const
binX = bit_cast<uint64_t>( x ),
binY = bit_cast<uint64_t>( y );
static auto abs = []( uint64_t m ) { return m & ~SIGN; };
auto isNaN = []( uint64_t m ) { return abs( m ) >= 0x7FF0000000000001u; }; auto isSig = []( uint64_t m ) { return !(m & QBIT); };
if( isNaN( binX ) ) [[unlikely]] // x == NaN
#if defined(_MSC_VER)
return bit_cast<double>( isNaN( binY ) ? binY | binX & binY & QBIT :
binX );
#else
{
if( isSig( binX ) || isNaN( binY ) && isSig( binY ) ) [[unlikely]]
feraiseexcept( FE_INVALID );
return bit_cast<double>( binX | QBIT );
}
#endif
if( isNaN( binY ) ) [[unlikely]] // x != NaN || y == NaN
#if defined(_MSC_VER)
return y;
#else
{
if( isSig( binY ) ) [[unlikely]]
feraiseexcept( FE_INVALID );
return bit_cast<double>( binY | QBIT );
}
#endif
auto isInf = []( uint64_t m ) { return abs( m ) == 0x7FF0000000000000u; }; if( isInf( binX ) ) // x == Inf
{
feraiseexcept( FE_INVALID );
#if defined(_MSC_VER)
return bit_cast<double>( binX & ~MANT | QBIT );
#else
return -numeric_limits<double>::quiet_NaN();
#endif
}
if( !abs( binY ) ) [[unlikely]] // y == 0
{
feraiseexcept( FE_INVALID );
#if defined(_MSC_VER)
return numeric_limits<double>::quiet_NaN();
#else
return -numeric_limits<double>::quiet_NaN();
#endif
}
if( !abs( binX ) || isInf( binY ) ) [[unlikely]] // x == 0 || y == Inf
return x;
auto exp = []( uint64_t b ) -> int { return b >> 52 & 0x7FF; };
int
expX = exp( binX ),
expY = exp( binY );
auto mant = []( uint64_t b ) { return b & MANT; };
uint64_t
mantX = mant( binX ),
mantY = mant( binY );
static auto normalize = []( int &exp, uint64_t &mant )
{
unsigned shift = countl_zero( mant ) - 11;
mant <<= shift;
exp -= shift;
};
auto build = []( int &exp, uint64_t &mant )
{
if( exp ) [[likely]]
mant |= IMPLICIT;
else
{
exp = 1;
normalize( exp, mant );
}
};
build( expX, mantX );
build( expY, mantY );
uint64_t signX = binX & SIGN;
int expDiff;
while( (expDiff = expX - expY) > 0 )
{
unsigned bits = expDiff <= 11 ? expDiff : 11;
if( !(mantX = (mantX << bits) % mantY) ) [[unlikely]]
return bit_cast<double>( signX );
expX -= bits;
normalize( expX, mantX );
}
if( !expDiff && mantX >= mantY ) [[unlikely]]
if( (mantX -= mantY) ) [[likely]]
normalize( expX, mantX );
else
return bit_cast<double>( signX );
if( expX <= 0 ) [[unlikely]]
{
assert(expX >= -51);
mantX = mantX >> (unsigned)(-expX + 1);
expX = 0;
}
return bit_cast<double>( signX | (uint64_t)expX << 52 | mantX & MANT );
}
It's about six times faster than the glibc 2.31 solution in my
benchmark. The returned NaNs and the raised exceptions are MSVC-
and glibc-compatible.
Am 28.02.2025 um 19:30 schrieb Mr Flibble:
double my_fmod(double x, double y)
{
if (y == 0.0)
return x / y;
return x - std::trunc(x / y) * y;
}
This still sucks. Try it with this test:
#include <iostream>
#include <cmath>
#include <random>
#include <bit>
using namespace std;
double trivialFmod( double a, double b );
int main()
{
mt19937_64 mt;
uniform_int_distribution<uint64_t> gen( 1, 0x7FEFFFFFFFFFFFFFu );
size_t imprecise = 0, outOfRange = 0;
for( size_t r = 1'000'000; r; --r )
{
double
a = bit_cast<double>( gen( mt ) ),
b = bit_cast<double>( gen( mt ) ),
fm = fmod( a, b ),
tfm = trivialFmod( a, b );
imprecise += fm != tfm;
outOfRange += tfm >= b;
}
auto print = []( char const *what, size_t n ) { cout << what <<
(ptrdiff_t)n / (1.0e6 / 100) << "%" << endl; };
print( "imprecise: ", imprecise );
print( "out of range: ", outOfRange );
}
double trivialFmod( double a, double b )
{
return a - trunc( a / b ) * b;
}
Am 28.02.2025 um 20:47 schrieb Mr Flibble:
IEEE 754 does not define how std::fmod should behave, only
std::remainder.
There's only one way to do it for finite numbers.
Am 01.03.2025 um 15:37 schrieb Mr Flibble:
On Fri, 28 Feb 2025 20:49:38 +0100, Bonita Montero
<Bonita.Montero@gmail.com> wrote:
Am 28.02.2025 um 20:47 schrieb Mr Flibble:
IEEE 754 does not define how std::fmod should behave, only
std::remainder.
There's only one way to do it for finite numbers.
Not true as there is a fixed mantissa size, so finite precision,
making your test case useless if x is sufficiently large.
The way to do a modolo calculations for every floating point value
except inf or nan (finite numbers) is always the same for all imple- >mentations. And correct implementations are always without precision
los, i.e. exact.
As I've shown solutions like yours are only 50% exact and in 2% they
generate out of range results.
Am 01.03.2025 um 18:11 schrieb Mr Flibble:
Thus you are asserting that all finite numbers have an exact IEEE 754
floating point representation ...
But the result of floating-point operations usually have precision-loss; >except fmod(), which is always correct - when implemented properly. You >didn't implement it correctly.
Am 01.03.2025 um 21:32 schrieb Mr Flibble:
On Sat, 1 Mar 2025 21:02:32 +0100, Bonita Montero
<Bonita.Montero@gmail.com> wrote:
Am 01.03.2025 um 18:11 schrieb Mr Flibble:
Thus you are asserting that all finite numbers have an exact IEEE 754
floating point representation ...
But the result of floating-point operations usually have precision-loss; >>> except fmod(), which is always correct - when implemented properly. You
didn't implement it correctly.
False, see my other post.
/Flibble
This:
#include <iostream>
#include <cmath>
#include <random>
#include <bit>
using namespace std;
double my_fmod( double x, double y );
int main()
{
mt19937_64 mt;
uniform_int_distribution<uint64_t> gen( 1, 0x7FEFFFFFFFFFFFFFu );
size_t imprecise = 0, outOfRange = 0;
for( size_t r = 1'000'000; r; --r )
{
double
a = bit_cast<double>( gen( mt ) ),
b = bit_cast<double>( gen( mt ) ),
fm = fmod( a, b ),
tfm = my_fmod( a, b );
imprecise += fm != tfm;
outOfRange += tfm >= b;
}
auto print = []( char const *what, size_t n ) { cout << what <<
(ptrdiff_t)n / (1.0e6 / 100) << "%" << endl; };
print( "imprecise: ", imprecise );
print( "out of range: ", outOfRange );
}
double my_fmod( double x, double y )
{
if( y == 0.0 )
return x / y;
return x - std::trunc( x / y ) * y;
}
... prints this ...
imprecise: 49.9096%
out of range: 2.0039%
So your solution is unusable.
I wanted to optimize fmod to be a bit faster. This is my C++20
solution.
double myFmod( double x, double y )
{
constexpr uint64_t
SIGN = 1ull << 63,
IMPLICIT = 1ull << 52,
MANT = IMPLICIT - 1,
QBIT = 1ull << 51;
uint64_t const
binX = bit_cast<uint64_t>( x ),
binY = bit_cast<uint64_t>( y );
static auto abs = []( uint64_t m ) { return m & ~SIGN; };
auto isNaN = []( uint64_t m ) { return abs( m ) >=
0x7FF0000000000001u; }; auto isSig = []( uint64_t m ) { return !(m &
QBIT); }; if( isNaN( binX ) ) [[unlikely]] // x == NaN
#if defined(_MSC_VER)
return bit_cast<double>( isNaN( binY ) ? binY | binX
& binY & QBIT : binX );
#else
{
if( isSig( binX ) || isNaN( binY ) && isSig( binY ) ) [[unlikely]] feraiseexcept( FE_INVALID );
return bit_cast<double>( binX | QBIT );
}
#endif
if( isNaN( binY ) ) [[unlikely]] // x != NaN || y == NaN
#if defined(_MSC_VER)
return y;
#else
{
if( isSig( binY ) ) [[unlikely]]
feraiseexcept( FE_INVALID );
return bit_cast<double>( binY | QBIT );
}
#endif
auto isInf = []( uint64_t m ) { return abs( m ) ==
0x7FF0000000000000u; }; if( isInf( binX ) ) // x == Inf
{
feraiseexcept( FE_INVALID );
#if defined(_MSC_VER)
return bit_cast<double>( binX & ~MANT | QBIT );
#else
return -numeric_limits<double>::quiet_NaN();
#endif
}
if( !abs( binY ) ) [[unlikely]] // y == 0
{
feraiseexcept( FE_INVALID );
#if defined(_MSC_VER)
return numeric_limits<double>::quiet_NaN();
#else
return -numeric_limits<double>::quiet_NaN();
#endif
}
if( !abs( binX ) || isInf( binY ) ) [[unlikely]] // x == 0 ||
y == Inf return x;
auto exp = []( uint64_t b ) -> int { return b >> 52 & 0x7FF;
}; int
expX = exp( binX ),
expY = exp( binY );
auto mant = []( uint64_t b ) { return b & MANT; };
uint64_t
mantX = mant( binX ),
mantY = mant( binY );
static auto normalize = []( int &exp, uint64_t &mant )
{
unsigned shift = countl_zero( mant ) - 11;
mant <<= shift;
exp -= shift;
};
auto build = []( int &exp, uint64_t &mant )
{
if( exp ) [[likely]]
mant |= IMPLICIT;
else
{
exp = 1;
normalize( exp, mant );
}
};
build( expX, mantX );
build( expY, mantY );
uint64_t signX = binX & SIGN;
int expDiff;
while( (expDiff = expX - expY) > 0 )
{
unsigned bits = expDiff <= 11 ? expDiff : 11;
if( !(mantX = (mantX << bits) % mantY) ) [[unlikely]]
return bit_cast<double>( signX );
expX -= bits;
normalize( expX, mantX );
}
if( !expDiff && mantX >= mantY ) [[unlikely]]
if( (mantX -= mantY) ) [[likely]]
normalize( expX, mantX );
else
return bit_cast<double>( signX );
if( expX <= 0 ) [[unlikely]]
{
assert(expX >= -51);
mantX = mantX >> (unsigned)(-expX + 1);
expX = 0;
}
return bit_cast<double>( signX | (uint64_t)expX << 52 | mantX
& MANT ); }
It's about six times faster than the glibc 2.31 solution in my
benchmark. The returned NaNs and the raised exceptions are MSVC-
and glibc-compatible.
I wanted to optimize fmod to be a bit faster. This is my C++20
solution.
It's about six times faster than the glibc 2.31 solution in my
benchmark. The returned NaNs and the raised exceptions are MSVC-
and glibc-compatible.
On Mon, 24 Feb 2025 11:48:08 +0100
Bonita Montero <Bonita.Montero@gmail.com> wrote:
I wanted to optimize fmod to be a bit faster. This is my C++20
solution.
How about that?
Pay attention, it's C rather than C++. So 5 times shorter :-)
It's not the fastest for big x/y ratios, but rather simple and not
*too* slow. At least as long as hardware supports FMA.
For small x/y ratios it should be pretty close to best possible.
This is my code, improved by the _udiv128-intrinsic of MSVC which
provides a 128 / 64 division. With that my algorithm becomes nearly
tree times as fast as before. I'll provide a g++ / clang++ compatible
version with inline-assembly later.
With MSVC and the fairer random numbers I chose I'm 2.5 times
faster.
On Sun, 2 Mar 2025 17:10:37 +0100, Bonita Montero
<Bonita.Montero@gmail.com> wrote:
This is my code, improved by the _udiv128-intrinsic of MSVC whichStill slow tho.
provides a 128 / 64 division. With that my algorithm becomes nearly
tree times as fast as before. I'll provide a g++ / clang++ compatible >version with inline-assembly later.
/Flibble
Am 02.03.2025 um 18:09 schrieb Bonita Montero:
Am 02.03.2025 um 18:07 schrieb Michael S:
On Sun, 2 Mar 2025 17:26:18 +0100
Bonita Montero <Bonita.Montero@gmail.com> wrote:
With MSVC and the fairer random numbers I chose I'm 2.5 times
faster.
I don't trust your benchmarking skills.
I don't prefer your fast-path value combinations but I chose 75%
random finite combinations.
This generates the random values for me:
mt19937_64 mt;
uniform_int_distribution<uint64_t>
genType( 0, 15 ),
genFinite( 0x0010000000000000u, 0x7FEFFFFFFFFFFFFFu ),
genDen( 1, 0x000FFFFFFFFFFFFFu ),
genNaN( 0x7FF0000000000001u, 0x7FFFFFFFFFFFFFFFu );
auto get = [&]()
{
constexpr uint64_t
FINITE_THRESH = 4, // 75% finites
ZERO = 3, // 6.25% zeroes
DENORMALS = 2, // 6.25% denormals
INF = 1, // 6.25% Infs
NAN_ = 0; // 6.25% NaNs
uint64_t
sign = mt() & -
numeric_limits<int64_t>::min(), type = genType( mt );
if( type >= FINITE_THRESH )
return bit_cast<double>( sign | genFinite( mt
) ); if( type == ZERO )
return bit_cast<double>( sign );
if( type == DENORMALS )
return bit_cast<double>( sign | genDen( mt )
); if( type == INF )
return bit_cast<double>( sign |
0x7FF0000000000000u ); assert(type == NAN_);
return bit_cast<double>( sign | genNaN( mt ) );
};
Your coding-style is horrible.
Mine is "too beautiful" (my employer).
Am 02.03.2025 um 18:52 schrieb Michael S:
Your distribution is very different from what one would expect in real-world usage.
There's no real world distibution, so I chose all finites to be
equally likely.
In real-world usage apart from debugging stage there are no inf,
nan or y=zero. x=zero happens, but with lower probability that 6%. Denormals also happen, but with even lower probability than x=zero.
As I said if I chose 100% finites from the 1 to 0x7FEFFFFFFFFFFFFFu
range I'm still 2.4 times faster.
Also in majority of real-world scenarios huge x/y ratios either not
happen at all or are extremely rare.
Am 02.03.2025 um 19:09 schrieb Michael S:
You didn't answer the second point, which is critical.
In your fully random scenario 48.7% of cases are huge x/y. That is completely unrealistic.
I can easily improve speed of huge x/y at cost of less simple code
and of small slowdown of more typical case, but I consider it counterproductive. It seems, authors of standard libraries agree
with my judgment.
And you use fesetround, which takes about 40 clock cycles on my
CPU under Linux (WSL2). Better chose _mm_getcsr() and _mm_setcsr()
for that, which directly sets the FPU control word for SSE / AVX*
/ AVX-512. This is multiple times faster. For the x87-FPU you'd
have to chose different code, but the x87-FPU is totally broken
anywax.
Am 02.03.2025 um 18:52 schrieb Michael S:
Your distribution is very different from what one would expect in real-world usage.
There's no real world distibution, so I chose all finites to be
equally likely.
In real-world usage apart from debugging stage there are no inf,
nan or y=zero. x=zero happens, but with lower probability that 6%. Denormals also happen, but with even lower probability than x=zero.
As I said if I chose 100% finites from the 1 to 0x7FEFFFFFFFFFFFFFu
range I'm still 2.4 times faster.
Am 02.03.2025 um 21:58 schrieb Michael S:
If it was on the fast path, I'd consider it.
But improving speed of unimportant slow path at cost of portability?
Nah.
For the 75% random finites case I've shown your code becomes about
28% faster with _mm_getcsr() and _mm_setcsr().
For the x87-FPU you'd
have to chose different code, but the x87-FPU is totally broken
anywax.
Am 03.03.2025 um 17:20 schrieb Michael S:
x87 is not broken relatively to its own specifications. It just
happens to be slightly different from IEEE-754 specifications.
Which is not surprising considering that it predates IEEE-754
Standard by several years.
You can reduce the with of the mantissa to 53 or 24 bit, but the
exponent is always 15 bit; that's not up to any specification.
Today there are very few reasons to still use x87 in new software.
However back in it's time x87 was an astonishingly good piece of
work, less so in performance (it was not fast, even by standards of
its time) more so for features, precision and especially for
consistency of its arithmetic.
There are compiler-settings which enforce consistency by storing
values with reduced precision and re-loading them to give expectable
results when you use values < long double. That's a mess.
On Mon, 3 Mar 2025 12:52:51 +0100
Bonita Montero <Bonita.Montero@gmail.com> wrote:
Am 02.03.2025 um 21:58 schrieb Michael S:
If it was on the fast path, I'd consider it.
But improving speed of unimportant slow path at cost of
portability? Nah.
For the 75% random finites case I've shown your code becomes about
28% faster with _mm_getcsr() and _mm_setcsr().
It seems that major slowdown is specific to combination of msys2
libraries with Zen3/4 CPU.
I see even worse slowness of get/set rounding mode on msys2/Zen3.
The same msys-compiled binary on Intel CPUs is o.k., at least
relatively to other heavy things going on on the slow path.
On Zen3 with Microsoft's compiler/library it is also o.k.
As long as it only affects slow path there is nothing to agitated
about.
This is my code, improved by the _udiv128-intrinsic of MSVC which
provides a 128 / 64 division. With that my algorithm becomes nearly
tree times as fast as before. I'll provide a g++ / clang++ compatible
version with inline-assembly later.
template<bool _32 = false>
double xMyFmod( double x, double y )
{
constexpr uint64_t
SIGN = 1ull << 63,
IMPLICIT = 1ull << 52,
MANT = IMPLICIT - 1,
QBIT = 1ull << 51;
uint64_t const
binX = bit_cast<uint64_t>( x ),
binY = bit_cast<uint64_t>( y );
static auto abs = []( uint64_t m ) { return m & ~SIGN; };
auto isNaN = []( uint64_t m ) { return abs( m ) >=
0x7FF0000000000001u; }; auto isSig = []( uint64_t m ) { return !(m &
QBIT); }; if( isNaN( binX ) ) [[unlikely]] // x == NaN
#if defined(_MSC_VER)
return bit_cast<double>( isNaN( binY ) ? binY | binX
& binY & QBIT : binX );
#else
{
if( isSig( binX ) || isNaN( binY ) && isSig( binY ) ) [[unlikely]] feraiseexcept( FE_INVALID );
return bit_cast<double>( binX | QBIT );
}
#endif
auto isInf = []( uint64_t m ) { return abs( m ) ==
0x7FF0000000000000u; }; if( isNaN( binY ) ) [[unlikely]] // x != NaN
|| y == NaN #if defined(_MSC_VER)
{
if constexpr( _32 )
if( isInf( binX ) )
feraiseexcept( FE_INVALID );
return y;
}
#else
{
if( isSig( binY ) ) [[unlikely]]
feraiseexcept( FE_INVALID );
return bit_cast<double>( binY | QBIT );
}
#endif
if( isInf( binX ) ) // x == Inf
{
feraiseexcept( FE_INVALID );
#if defined(_MSC_VER)
return bit_cast<double>( binX & ~MANT | QBIT );
#else
return -numeric_limits<double>::quiet_NaN();
#endif
}
if( !abs( binY ) ) [[unlikely]] // y == 0
{
feraiseexcept( FE_INVALID );
#if defined(_MSC_VER)
return numeric_limits<double>::quiet_NaN();
#else
return -numeric_limits<double>::quiet_NaN();
#endif
}
if( !abs( binX ) || isInf( binY ) ) [[unlikely]] // x == 0 ||
y == Inf return x;
auto exp = []( uint64_t b ) -> int { return b >> 52 & 0x7FF;
}; int
expX = exp( binX ),
expY = exp( binY );
auto mant = []( uint64_t b ) { return b & MANT; };
uint64_t
mantX = mant( binX ),
mantY = mant( binY );
int headBits = 11;
static auto normalize = [&]( int &exp, uint64_t &mant )
{
unsigned shift = countl_zero( mant ) - headBits;
mant <<= shift;
exp -= shift;
};
auto build = []( int &exp, uint64_t &mant )
{
if( exp ) [[likely]]
mant |= IMPLICIT;
else
{
exp = 1;
normalize( exp, mant );
}
};
build( expX, mantX );
build( expY, mantY );
int
tailX = countr_zero( mantX ),
tailY = countr_zero( mantY ),
tailBits = tailX <= tailY ? tailX : tailY;
headBits += tailBits;
mantX >>= tailBits;
mantY >>= tailBits;
uint64_t signX = binX & SIGN;
int expDiff;
#if defined(_MSC_VER)
while( (expDiff = expX - expY) > 63 )
{
unsigned long long hi = mantX >> 1, lo = mantX << 63, remainder; (void)_udiv128( hi, lo, mantY, &remainder );
expX -= 63;
mantX = remainder;
normalize( expX, mantX );
}
#endif
while( (expDiff = expX - expY) > 0 )
{
unsigned bits = expDiff <= headBits ? expDiff :
headBits; if( !(mantX = (mantX << bits) % mantY) ) [[unlikely]]
return bit_cast<double>( signX );
expX -= bits;
normalize( expX, mantX );
}
if( !expDiff && mantX >= mantY ) [[unlikely]]
if( (mantX -= mantY) ) [[likely]]
normalize( expX, mantX );
else
return bit_cast<double>( signX );
mantX <<= tailBits;
mantY <<= tailBits;
if( expX <= 0 ) [[unlikely]]
{
assert(expX >= -51);
mantX = mantX >> (unsigned)(-expX + 1);
expX = 0;
}
return bit_cast<double>( signX | (uint64_t)expX << 52 | mantX
& MANT ); }
double myFmod( double x, double y )
{
return xMyFmod( x, y );
}
inline float myFmod( float x, float y )
{
return (float)xMyFmod<true>( (double)x, (double)y );
}
Am 09.03.2025 um 00:31 schrieb Michael S:
This code does not work in plenty of cases. It seems, your test
vectors have poor coverage.
Try, for example, x=1.8037919852882307, y=2.22605637008665934e-194
cout << hexfloat << myFmod( 1.8037919852882307, 2.22605637008665934e-194 ) << endl;
cout << hexfloat << fmod( 1.8037919852882307,
2.22605637008665934e-194 ) << endl;
Prints the same result under Linux and Windows.
This prints the same result (0.0) under Windows and Linux:
Am 09.03.2025 um 10:46 schrieb Michael S:
This prints the same result (0.0) under Windows and Linux:
I am no longer going to look at your code until you start posting
full files, with all includes and using directives.
You could simply replace the single function I've shown.
Am 09.03.2025 um 11:09 schrieb Michael S:
On Sun, 9 Mar 2025 10:54:40 +0100
Bonita Montero <Bonita.Montero@gmail.com> wrote:
Am 09.03.2025 um 10:46 schrieb Michael S:
This prints the same result (0.0) under Windows and Linux:
I am no longer going to look at your code until you start posting
full files, with all includes and using directives.
You could simply replace the single function I've shown.
I can. I don't want to do it.
You want me to look at/test your code? You post full code.
Simple, isn't it?
I've read you don't trust my tests, so use your own with myFmod.
Am 09.03.2025 um 11:51 schrieb wij:
From the view of implementing myFmod, I think using C-like coding
style would be better, but all depending on what you want to
achieve.
A C coding style would result in about two times the code.
Am 09.03.2025 um 13:23 schrieb Michael S:
So far all we had see from you is 2-3 times longer than C code
(real C, not C-style C++) ...
Not true since I save a lot of redundant-code with [&]-lambdas.
On Sun, 9 Mar 2025 11:21:55 +0100
Bonita Montero <Bonita.Montero@gmail.com> wrote:
Am 09.03.2025 um 11:09 schrieb Michael S:
On Sun, 9 Mar 2025 10:54:40 +0100
Bonita Montero <Bonita.Montero@gmail.com> wrote:
Am 09.03.2025 um 10:46 schrieb Michael S:
This prints the same result (0.0) under Windows and Linux:
I am no longer going to look at your code until you start posting
full files, with all includes and using directives.
You could simply replace the single function I've shown.
I can. I don't want to do it.
You want me to look at/test your code? You post full code.
Simple, isn't it?
I've read you don't trust my tests, so use your own with myFmod.
ok. I was too curios :(
This version produces correct results both when compiled under MSVC and
when compiled with other compilers. It is a little faster too.
With MSVC on old Intel CPU it is only 2.5 times slower than standard
library in relevant range of x/y. Previous version was 3.4 times
slower.
With gcc and clang it is still more than 6 times slower than standard >library.
The coding style is now less insane.
Measurements in nsec.
First result - Intel Skylake at 4.25 GHz
Second result - AMD Zen3 at 3.7 GHz
abs(x/y) in range that matters [0.5:2**53]:
Standard MSVC Library - 11.1 10.4
Standard gnu Library - 5.4 10.7
Yours (MSVc) - 27.6 11.5
Yours (gcc) - 36.4 23.7
Yours (clang) - 37.4 24.3
abs(x/y) in full range [2**-2090:2**2090]:
Standard MSVC Library - 109.4 153.5
Standard glib Library - 102.3 155.5
Yours (MSVc) - 134.9 52.6
Yours (gcc) - 284.7 151.8
Yours (clang) - 285.2 156.5
Am 09.03.2025 um 16:26 schrieb Mr Flibble:
So it is slow ergo a pointless alternative to what we already have.
glibc does it nearly in the same way I do it because the FMA-solution
isn't portable.
If fma( a, b, c ) is substituted with a * b + c
because there's no proper CPU-instruction the whole issue doesn't
work.
And with support for _udiv128 my solution has about the same
performance like the Michael's solution with clang++ 18.1.7.
Am 09.03.2025 um 13:09 schrieb Michael S:
Measurements in nsec.
First result - Intel Skylake at 4.25 GHz
Second result - AMD Zen3 at 3.7 GHz
abs(x/y) in range that matters [0.5:2**53]:
Standard MSVC Library - 11.1 10.4
Standard gnu Library - 5.4 10.7
Yours (MSVc) - 27.6 11.5
Yours (gcc) - 36.4 23.7
Yours (clang) - 37.4 24.3
abs(x/y) in full range [2**-2090:2**2090]:
Standard MSVC Library - 109.4 153.5
Standard glib Library - 102.3 155.5
Yours (MSVc) - 134.9 52.6
Yours (gcc) - 284.7 151.8
Yours (clang) - 285.2 156.5
With MSVC and an arbitrary combination of finite x and y on my
Zen4-machine:
your fmod: 77.1214
my: 38.4486
With MSVC and an arbitrary combination of finite x with exponents
ranging from 0x3FF to 0x433 (close exponents) on my Zen4-machine:
your fmod: 23.6423
my: 9.79146
This is a nearly proper implementation of your idea with
FMA-intrinsics and SSE/AVX control register access:
double fmody( double x, double y )
{
if( isnan( x ) ) [[unlikely]]
return x;
if( isnan( y ) ) [[unlikely]]
return y;
if( isinf( x ) || !y ) [[unlikely]]
{
feraiseexcept( FE_INVALID );
return numeric_limits<double>::quiet_NaN();
}
if( !x || isinf( y ) ) [[unlikely]]
return x;
uint64_t sign = bit_cast<uint64_t>( x ) & numeric_limits<int64_t>::min(); x = abs( x );
y = abs( y );
int oldCsr = _mm_getcsr();
constexpr int CHOP = 0x6000;
_mm_setcsr( oldCsr | CHOP );
constexpr uint64_t
EXP = -(1ll << 52),
MANT = ~EXP;
uint64_t binY = bit_cast<uint64_t>( y );
int64_t expY = binY & EXP;
if( !expY ) [[unlikely]]
expY = (uint64_t)(0 - (countl_zero( binY & MANT ) -
12)) << 52; while( x >= y )
{
uint64_t yExpAdd = 0;
double div = x / y;
if( div < 0x1.FFFFFFFFFFFFFp+1023 ) [[likely]]
div = xtrunc( div );
else
{
uint64_t
binX = bit_cast<uint64_t>( x ),
newExp = expY + (54ull << 52);
yExpAdd = (binX & EXP) - newExp;
div = xtrunc( bit_cast<double>( newExp | binX
& MANT ) / y ); }
__m128d mult1, mult2, add;
#if defined(_MSC_VER)
mult1.m128d_f64[0] = div;
mult2.m128d_f64[0] = -bit_cast<double>( binY +
yExpAdd ); add.m128d_f64[0] = x;
x = _mm_fmadd_sd( mult1, mult2, add ).m128d_f64[0];
#else
mult1[0] = div;
mult2[0] = -bit_cast<double>( binY + yExpAdd );
add[0] = x;
x = _mm_fmadd_sd( mult1, mult2, add )[0];
#endif
if( !x ) [[unlikely]]
return bit_cast<double>( sign );
}
_mm_setcsr( oldCsr );
return bit_cast<double>( sign | bit_cast<uint64_t>( x ) );
}
The only thing that doesn't work currently is the support for denormal values.
Am 09.03.2025 um 19:04 schrieb Michael S:
I already said that I don't approve non-portable constructs like _mm_getcsr()/_mm_setcsr() except when they help important cases and
help ALOT. Neither applies here.
I dropped getting and setting the MSCSR-register to set the roun-
ding mode. Now I set the rounding mode directly with the intrinsics _mm_div_round_sd and and _mm_fmadd_round_sd. Now this more manual
code does help "ALOT", i.e. the solution is 2/3 faster than your
initial solution with clang++-18 under Linux.
But there's still a problem with the denormals.
I tried to use a unsigned __int128 / uint64_t division and I expected
that the compiler calls a library function which does the subtract and
shift manually. But g++ as well as clang++ handle this as a 128 : 64
64#64 division somehow.
And now my original solution is about 23% faster than your solution
for close exponents (exponent difference <= 53) and for arbitrary
exponent differences your solution is about 3% faster.
double fmodO( double x, double y )
{
constexpr uint64_t
SIGN = 1ull << 63,
IMPLICIT = 1ull << 52,
MANT = IMPLICIT - 1,
QBIT = 1ull << 51;
uint64_t const
binX = bit_cast<uint64_t>( x ),
binY = bit_cast<uint64_t>( y );
static auto abs = []( uint64_t m ) { return m & ~SIGN; };
auto isNaN = []( uint64_t m ) { return abs( m ) >=
0x7FF0000000000001u; }; auto isSig = []( uint64_t m ) { return !(m &
QBIT); }; if( isNaN( binX ) ) [[unlikely]] // x == NaN
#if defined(_MSC_VER)
return bit_cast<double>( isNaN( binY ) ? binY | binX
& binY & QBIT : binX );
#else
{
if( isSig( binX ) || isNaN( binY ) && isSig( binY ) ) [[unlikely]] feraiseexcept( FE_INVALID );
return bit_cast<double>( binX | QBIT );
}
#endif
if( isNaN( binY ) ) [[unlikely]] // x != NaN || y == NaN
#if defined(_MSC_VER)
return y;
#else
{
if( isSig( binY ) ) [[unlikely]]
feraiseexcept( FE_INVALID );
return bit_cast<double>( binY | QBIT );
}
#endif
auto isInf = []( uint64_t m ) { return abs( m ) ==
0x7FF0000000000000u; }; if( isInf( binX ) ) // x == Inf
{
feraiseexcept( FE_INVALID );
#if defined(_MSC_VER)
return bit_cast<double>( binX & ~MANT | QBIT );
#else
return -numeric_limits<double>::quiet_NaN();
#endif
}
if( !abs( binY ) ) [[unlikely]] // y == 0
{
feraiseexcept( FE_INVALID );
#if defined(_MSC_VER)
return numeric_limits<double>::quiet_NaN();
#else
return -numeric_limits<double>::quiet_NaN();
#endif
}
if( !abs( binX ) || isInf( binY ) ) [[unlikely]] // x == 0 ||
y == Inf return x;
auto exp = []( uint64_t b ) -> int { return b >> 52 & 0x7FF;
}; int
expX = exp( binX ),
expY = exp( binY );
auto mant = []( uint64_t b ) { return b & MANT; };
uint64_t
mantX = mant( binX ),
mantY = mant( binY );
int headBits = 11;
static auto normalize = [&]( int &exp, uint64_t &mant )
{
unsigned shift = countl_zero( mant ) - headBits;
mant <<= shift;
exp -= shift;
};
auto build = []( int &exp, uint64_t &mant )
{
if( exp ) [[likely]]
mant |= IMPLICIT;
else
{
exp = 1;
normalize( exp, mant );
}
};
build( expX, mantX );
build( expY, mantY );
int
tailX = countr_zero( mantX ),
tailY = countr_zero( mantY ),
tailBits = tailX <= tailY ? tailX : tailY;
mantX >>= tailBits;
mantY >>= tailBits;
headBits += tailBits;
uint64_t signX = binX & SIGN;
int expDiff;
#if defined(_MSC_VER) && !defined(__llvm__) && defined(_M_X64)
while( (expDiff = expX - expY) > 0 )
{
unsigned bits = expDiff <= 63 ? expDiff : 63;
unsigned long long hi = mantX >> 64 - bits, lo =
mantX << bits, remainder; (void)_udiv128( hi, lo, mantY, &remainder );
if( !remainder ) [[unlikely]]
return bit_cast<double>( signX );
mantX = remainder;
expX -= bits;
normalize( expX, mantX );
}
#elif defined(__GNUC__) || defined(__clang__)
while( (expDiff = expX - expY) > 0 )
{
unsigned bits = expDiff <= 63 ? expDiff : 63;
unsigned __int128 dividend = (unsigned __int128)mantX
<< bits; mantX = (uint64_t)(dividend % mantY);
if( !mantX ) [[unlikely]]
return bit_cast<double>( signX );
expX -= bits;
normalize( expX, mantX );
}
#else
while( (expDiff = expX - expY) > 0 )
{
unsigned bits = expDiff <= headBits ? expDiff :
headBits; if( !(mantX = (mantX << bits) % mantY) ) [[unlikely]]
return bit_cast<double>( signX );
expX -= bits;
normalize( expX, mantX );
}
#endif
if( !expDiff && mantX >= mantY ) [[unlikely]]
if( (mantX -= mantY) ) [[likely]]
normalize( expX, mantX );
else
return bit_cast<double>( signX );
mantX <<= tailBits;
mantY <<= tailBits;
if( expX <= 0 ) [[unlikely]]
{
assert(expX >= -51);
mantX = mantX >> (unsigned)(-expX + 1);
expX = 0;
}
return bit_cast<double>( signX | (uint64_t)expX << 52 | mantX
& MANT ); }
Am 10.03.2025 um 15:33 schrieb Michael S:
That's what I also guessed, but maybe we've not the same
glibc-version. Or the code runs just more efficiently on my Zen4-CPU-
For my code it's strangely slow. On 4+ GHz Zen4 I would expect ~5
nsec.
I want your crystal ball.
Am 10.03.2025 um 13:26 schrieb Michael S:
It should be an obvious thing to anybody who cared to think for 15
seconds.
I was wrong and it absolutels isn't obvious. The compiler calls the
glibc function __umodti3 of glibc which has a shortcut for results
which fit into 64 bit. Although there's an additional call on Linux
the code with clang++-18 is still a bit faster than my Windows
solution with the _udiv128-intrinsic; that's really surprisng.
A library-or-compiler starts with hi1=rem(0:x_hi, y). Now hi1 is
guaranteed to be smaller than y, so it's safe to do rem(hi1:x_lo,
y). It is not *very* slow, but still there are 2 dependent division operations.
Both parameters are variable so there could be no static evaluation
at compile tiime.
Measurements in nsec.
First result - Intel Skylake at 4.25 GHz
Second result - AMD Zen3 at 3.7 GHz
abs(x/y) in range that matters [0.5:2**53]:
Standard MSVC Library - 11.1 10.4
Standard gnu Library - 5.4 10.7
Yours (MSVc) - 27.6 11.5
Yours (gcc) - 36.4 23.7
Yours (clang) - 37.4 24.3
my (MSVc) - 10.7 11.3
my (gcc) - 7.7 7.6
my (clang) - 6.3 7.5
Your last (MSVc) - 27.6 11.6
Your last (gcc) - 33.8 28.6
Your last (clang) - 32.6 26.9
abs(x/y) in full range [2**-2090:2**2090]:
Standard MSVC Library - 109.4 153.5
Standard glib Library - 102.3 155.5
Yours (MSVc) - 134.9 52.6
Yours (gcc) - 284.7 151.8
Yours (clang) - 285.2 156.5
my (MSVc) - 62.1 61.1
my (gcc) - 60.8 59.1
my (clang) - 59.9 59.3
Your last (MSVc) - 135.0 52.5
Your last (gcc) - 172.1 137.3
Your last (clang) - 167.7 126.3
This are the clang++-18 results on my Zen4-computer under WSL2 with
close exponents (0x3ff to 0x433):
fmodO: 9.42929
fmodM: 11.7907
So my code is about 23% faster on my computer.
This are the results for arbitrary exponents:
fmodO: 41.9115
fmodM: 41.2062
Exactly what I already mentioned.
Maybe that depends on the glibc-version because a different glibc
version might have different efficient __umodti3 functions.
You code would be much, much cleaner and more readable if you
replace lambdas with proper functions. ..
For me that doesn't make a difference.
Am 10.03.2025 um 15:59 schrieb Michael S:
One does not need crystal ball to extrapolate speed of simple
integer code from 3.7 GHz Zen3 to 4+ GHz Zen4 (probably 4.7 or 4.8
GHz).
If one core only computes the clock is even 5.7GHz.
But the results aren't better than shown.
But your repetitive avoidance of answering my direct questions about
what exactly you are using as "my code" makes me even more
suspicious.
I've shown the latest code of fmodO;
you can easily inegrate it into
your own benchmark.
I don't use unfiform_real_distrubution for the
random numbers but uniform_int_distribution with bounds of 0x3FFull
<< 52 and 0x433ull << 52 for the close exponent case. The whole test
code nearly hasn't changed over my initial post.
Am 10.03.2025 um 16:43 schrieb Michael S:
That's not the question I am asking for 4th or 5th time.
My question is what *exactly* is fmodM.
fmodM is your code, M like Michael.
I did. And presented the results. I am fully willing to believe that
the difference between our clang results explained by difference in compiler support library.
Or maybe the different CPU.
But so far I find no explanation for why results for what you claim
to be *my* code are so much slower in your measurements, despite
your faster CPU.
I compiled with -O2 and march=native, that should be sufficient.
BTW, the number you did not publish at all was the speed of fmod()
routine from standard library. ...
I've the fmod code of glibc 2.31, which is rather slow since it
does the subtract and shifts manually - code from Sun of the 90s.
Assuming that the exponent of y is fixed at 1023 that is
approximately the same as my own test.
Yes, but as you said earlier close exponents are more relevant.
Am 10.03.2025 um 18:51 schrieb Michael S:
What my code?
Post it.
I just changed the function name and that the code uses xtrunc()
instead of trunc() since trunc() is slow with MSVC. I removed my
improvement _mm_getcsr() / _mm_setcsr() since the speedup was
noticeable but not significant, unlike the xtrunc() optimization,
which made a speedup of about +100% with MSVC.
double fmodM( double x, double y )
{
if( isnan( x ) )
return x;
// pre-process y
if( isless( y, 0 ) )
y = -y;
else if( isgreater( y, 0 ) )
;
else {
if( isnan( y ) )
return y;
// y==0
feraiseexcept( FE_INVALID );
return nan( "y0" );
}
// y in (0:+inf]
// Quick path
double xx = x * 0x1p-53;
if( xx > -y && xx < y ) {
// among other things, x guaranteed to be finite
if( x > -y && x < y )
return x; // case y=+-inf covered here
double d = xtrunc( x / y );
double res = fma( -y, d, x );
if( signbit( x ) != signbit( res ) ) {
// overshoot because of unfortunate division
rounding // it is extremely rare for small x/y,
// but not rare when x/y is close to 2**53
res = fma( -y, d + (signbit( x ) * 2 - 1), x
); }
return res;
}
// slow path
if( isinf( x ) ) {
feraiseexcept( FE_INVALID );
return nan( "xinf" );
}
int oldRnd = fegetround();
fesetround( FE_TOWARDZERO );
double ax = fabs( x );
do {
double yy = y;
while( yy < ax * 0x1p-1022 )
yy *= 0x1p1021;
do
ax = fma( -yy, xtrunc( ax / yy ), ax );
while( ax >= yy );
} while( ax >= y );
ax = copysign( ax, x );
fesetround( oldRnd );
return ax;
}
Your idea is really elegant
and as I've shown it could be
significantly imporved with SSE 4.1 along with FMA3. But at the point
where I noticed how performant a 128 : 64 modulo division through the
glibc is and as this is superior over the FMA-solution I dropped the
whole idea and removed the SSE-FMA-code from my test program.
Partitially wrong alarm: I forget to convert my xtrunc() function into
an xfloor() function. These are the accuracy results now compared to
fmod() of MSVC:
53 bits shared accuracy
equal results: 100%
equal exceptions: 91.017%
equal NaN signs: 96.475%
equal NaN-types: 99.78%
equal NaNs: 96.253%
These are the accuracy results compared to glibc:
53 bits shared accuracy
equal results: 100%
equal exceptions: 99.901%
equal NaN signs: 87.224%
equal NaN-types: 93.181%
equal NaNs: 80.405%
Am 10.03.2025 um 22:34 schrieb Bonita Montero:
And for aribitrary exponets (0x1 to 0x7FE):
fmodO: 9.29622
fmodM: 11.4518
Sorry, the copy-buffer wasn't refreshed with the new results:
fmodO: 40.4702
fmodM: 40.1652
Am 11.03.2025 um 11:29 schrieb Michael S:
Pay attention that fmod() has no requirements w.r.t. to such
exceptions as FE_INEXACT, FE_UNDERFLOW and non-standard
FE_DENORMAL.
Yes, that's while I evaluate FE_INVALID only. But your code also can
set FE_INEXACT due to your "rounding" with sign change. MSVC seems
also try to do the math with the FPU with a integer-fallback, because
with exponent differences <= 53 MSVC's fmod() often sets FE_INECACT;
but I ignore that because that shouldn't be part of fmod();
Strictly speaking, even raising FE_OVERFLOW is not illegal,
but doing so would be bad quality of implementation.
Couldn't FE_OVERFLOW happen with your implementation when the
exponents are too far away that you get inf from the division ?
Also spec does not say what happens to FE_INVALID when one of the
inputs is signalling NAN.
See my code; I return MSVC and glibc compatible NaNs and I return
the same exceptions. MSVC sets FE_INVALID only when x is inf or y
is zero, glibc in addition raises FE_INVALID when either operand
is a signalling NaN.
Am 11.03.2025 um 11:29 schrieb Michael S:
Pay attention that fmod() has no requirements w.r.t. to such
exceptions as FE_INEXACT, FE_UNDERFLOW and non-standard
FE_DENORMAL.
Yes, that's while I evaluate FE_INVALID only.
Am 11.03.2025 um 12:34 schrieb Michael S:
Exactly. Both options are legal. MS's decision to not set
FE_INVALID is as good as glibc decision to set it.
If I do a SSE-/AVX-operation where either operand is a signalling NaN
I get a FE_INVALID; since the FPU behaves this way the MSVC runtime
should do that also.
BTW, what is the output of MS library in that case? SNAN or QNAN?
Results with SNaN parameters are always QNaN, that shoud be common
with any FPU.
On Tue, 11 Mar 2025 13:10:31 +0100
Bonita Montero <Bonita.Montero@gmail.com> wrote:
Am 11.03.2025 um 12:34 schrieb Michael S:
Exactly. Both options are legal. MS's decision to not set
FE_INVALID is as good as glibc decision to set it.
If I do a SSE-/AVX-operation where either operand is a signalling
NaN I get a FE_INVALID; since the FPU behaves this way the MSVC
runtime should do that also.
BTW, what is the output of MS library in that case? SNAN or QNAN?
Results with SNaN parameters are always QNaN, that shoud be common
with any FPU.
But not when library routine does not use FPU. Or uses FPU only for comparison ops.
The point is, it does not sound right if SNAN is *silently* converted
to QNAN. That type of conversion has to be loud i.e. accompanied by
setting of FE_INVALID.
On Mon, 10 Mar 2025 19:00:06 +0100
Bonita Montero <Bonita.Montero@gmail.com> wrote:
Your idea is really elegant
I'd rather call it "simple" or "straightforward". "Elegant" in my book
is something else. For example, the code above is closer to what I
consider elegant.
May be, later today or tomorrow, I'll show you solution that I
consider bright. Bright, but impractical.
Pay attention that fmod() has no requirements w.r.t. to such exceptions
as FE_INEXACT, FE_UNDERFLOW and non-standard FE_DENORMAL.
Strictly speaking, even raising FE_OVERFLOW is not illegal, but doing
so would be bad quality of implementation.
Also spec does not say what happens to FE_INVALID when one of the
inputs is signalling NAN.
On Mon, 10 Mar 2025 20:38:18 +0200
Michael S <already5chosen@yahoo.com> wrote:
On Mon, 10 Mar 2025 19:00:06 +0100
Bonita Montero <Bonita.Montero@gmail.com> wrote:
Your idea is really elegant
I'd rather call it "simple" or "straightforward". "Elegant" in my
book is something else. For example, the code above is closer to
what I consider elegant.
May be, later today or tomorrow, I'll show you solution that I
consider bright. Bright, but impractical.
Here, here!
A bright part is in lines 18 to 29. The rest are hopefully competent technicalities.
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 489 |
Nodes: | 16 (2 / 14) |
Uptime: | 40:20:32 |
Calls: | 9,670 |
Calls today: | 1 |
Files: | 13,716 |
Messages: | 6,169,677 |