Forum: >>> Magnum BBS <<<

FDIV and microcode

From Stefan Monnier@21:1/5 to All on Fri Jul 11 18:05:45 2025

I was just wondering: seeing how most/all recent hardware problems in
Intel CPUs seem to come together with discussions about "microcode
update to avoid the problematic case", I was wondering if the
introduction of such microcode and ability to update it "in the field"
was made as a result of lessons learned from the famous FDIV bug (for
which Intel had to actually replace the CPUs, which seems a lot more
costly).

Or is it just a coincidence?

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Stefan Monnier on Fri Jul 11 22:42:15 2025

On Fri, 11 Jul 2025 22:05:45 +0000, Stefan Monnier wrote:

I was just wondering: seeing how most/all recent hardware problems in
Intel CPUs seem to come together with discussions about "microcode
update to avoid the problematic case", I was wondering if the
introduction of such microcode and ability to update it "in the field"
was made as a result of lessons learned from the famous FDIV bug (for
which Intel had to actually replace the CPUs, which seems a lot more
costly).

Or is it just a coincidence?

The update part dates from ~1999. AMD put in a dozen-odd "words"
of writeable microcode--a lo of that was because we got wind that
Intel was putting writeable microcode.

I is STILL one of the reasons I lean towards programmable sequencers
in my designs--not in DECODE, but over in the FUs.

Writeable microcode saved VAX big time when they finally got around
to pasting 2 VAXen into one box and found the Enqueue and Dequeue
instructions had not been made ATOMIC.

At CMU, I wrote PDP-11/45 floating point in microcode for the
PDP-11/40 (circa 1974-5)

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to All on Fri Jul 11 23:16:11 2025

On Fri, 11 Jul 2025 22:42:15 +0000, MitchAlsup1 wrote:

Writeable microcode saved VAX big time when they finally got around to pasting 2 VAXen into one box and found the Enqueue and Dequeue
instructions had not been made ATOMIC.

Nothing that had to be “found”. INSQUE and REMQUE were never guaranteed to have bus interlocks, and they used absolute addresses for queue links.
Whereas the same original VAX architecture spec (I’m looking at the one
from 1979) also defines instructions for “self-relative” queues, which are clearly designed to be handle being located at different places in
different address spaces. And operations on which are bus-interlocked, and which require all queue entries to be quadword-aligned: INSQHI, INSQTI,
REMQHI, REMQTI.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Stefan Monnier on Fri Jul 11 23:17:23 2025

On Fri, 11 Jul 2025 18:05:45 -0400, Stefan Monnier wrote:

... I was wondering if the introduction of such microcode and
ability to update it "in the field" was made as a result of lessons
learned from the famous FDIV bug (for which Intel had to actually
replace the CPUs, which seems a lot more costly).

It seems to me this would depend on a lot on how flexible the
microarchitecture really was. How far can you go in redefining the
operation of a fixed set of hardware function units?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Stefan Monnier on Sat Jul 12 09:40:42 2025

Stefan Monnier <monnier@iro.umontreal.ca> writes:

I was just wondering: seeing how most/all recent hardware problems in
Intel CPUs seem to come together with discussions about "microcode
update to avoid the problematic case", I was wondering if the
introduction of such microcode and ability to update it "in the field"
was made as a result of lessons learned from the famous FDIV bug (for
which Intel had to actually replace the CPUs, which seems a lot more
costly).

I read one history of the development of an Intel CPU development (I
know I have read about the 386, P6 (Pentium Pro ff.), and maybe also
the Pentium 4), and IIRC in one of these histories it was discussed.
They put in extra RAM for storing some microcode updates, and of
course they had to argue it against those who said that it's
unnecessary because we did not need it before. IIRC the FDIV bug did
not play a role in the development of the feature (for P6 it would
have been late in the game); if it was on the P6, it might have
vindicated the decision after it was made.

In any case, chicken bits for disabling certain features that might
turn out to be buggy are common practice in hardware development, the
extra microcode space is just the microcode variant of a bunch of
chicken bits.

From what I have heard, this typically works by saying which
instruction is replaced, and then if that instruction occurs, the
replacement microcode is run.

There was this bug in IIRC Skylake about decoding branches just before
or crossing I-cache-line boundaries that was only discovered after
several years (so it also affected Kaby Lake etc.). This was fixed
with a firmware update that prevented putting such branches in the
microcode cache; the bug does not appear when the branch is coming
through the decoder. I wonder how Intel managed to influence the
allocation of the microcode cache in that specific way.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Stefan Monnier on Sat Jul 12 16:40:25 2025

Stefan Monnier wrote:

I was just wondering: seeing how most/all recent hardware problems in
Intel CPUs seem to come together with discussions about "microcode
update to avoid the problematic case", I was wondering if the
introduction of such microcode and ability to update it "in the field"
was made as a result of lessons learned from the famous FDIV bug (for
which Intel had to actually replace the CPUs, which seems a lot more
costly).

Or is it just a coincidence?

Possibly not:

We did of course discuss this a lot during the FDIV software replacement
work, at that time it was impossible to trap on any unpriviledged
opcode, so FDIV could never have been fixed that way.

If the microcode patch facility includes a mask to trap any arbitrary
opcode, then it would also be able to fix FDIV type bugs, but unless
they have significant space in that writeable microcode store, then it
possibly would not have been possible to make it fit:

A minimal patch working the same way as our SW would need to start by inspecting the top 10 bits of the divisor mantissa, and then fall
through for all except 5 different hits.

So at least 1024 bits for a lookup table, or relatively complicated
logic if implemented directly with gates.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Terje Mathisen on Sat Jul 12 16:52:16 2025

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

Stefan Monnier wrote:

I was just wondering: seeing how most/all recent hardware problems in
Intel CPUs seem to come together with discussions about "microcode
update to avoid the problematic case", I was wondering if the
introduction of such microcode and ability to update it "in the field"
was made as a result of lessons learned from the famous FDIV bug (for
which Intel had to actually replace the CPUs, which seems a lot more
costly).

Or is it just a coincidence?

Possibly not:

We did of course discuss this a lot during the FDIV software replacement work, at that time it was impossible to trap on any unpriviledged
opcode, so FDIV could never have been fixed that way.

If the microcode patch facility includes a mask to trap any arbitrary
opcode, then it would also be able to fix FDIV type bugs, but unless
they have significant space in that writeable microcode store, then it possibly would not have been possible to make it fit:

A minimal patch working the same way as our SW would need to start by inspecting the top 10 bits of the divisor mantissa, and then fall
through for all except 5 different hits.

So at least 1024 bits for a lookup table, or relatively complicated
logic if implemented directly with gates.

Gates are not possible in Microcode...

https://www.righto.com/2024/12/this-die-photo-of-pentium-shows.html

has a nice explanation of the bug and how they fixed it in silicon
(and simplified the PLA they used as a lookup table, as well -
apparently, it had not been optimized with don't cares, or
the bug might never have bitten).

--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Sat Jul 12 19:34:44 2025

On Fri, 11 Jul 2025 23:17:23 +0000, Lawrence D'Oliveiro wrote:

On Fri, 11 Jul 2025 18:05:45 -0400, Stefan Monnier wrote:

... I was wondering if the introduction of such microcode and
ability to update it "in the field" was made as a result of lessons
learned from the famous FDIV bug (for which Intel had to actually
replace the CPUs, which seems a lot more costly).

It seems to me this would depend on a lot on how flexible the microarchitecture really was. How far can you go in redefining the
operation of a fixed set of hardware function units?

A lot of microcode escape sequences are like ::

Check for special operand
if found trap to software
else run in hardware

Which can be "done" in one microcode "word"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Thomas Koenig on Sat Jul 12 19:39:55 2025

On Sat, 12 Jul 2025 16:52:16 +0000, Thomas Koenig wrote:

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

Stefan Monnier wrote:

I was just wondering: seeing how most/all recent hardware problems in
Intel CPUs seem to come together with discussions about "microcode
update to avoid the problematic case", I was wondering if the
introduction of such microcode and ability to update it "in the field"
was made as a result of lessons learned from the famous FDIV bug (for
which Intel had to actually replace the CPUs, which seems a lot more
costly).

Or is it just a coincidence?

Possibly not:

We did of course discuss this a lot during the FDIV software replacement
work, at that time it was impossible to trap on any unpriviledged
opcode, so FDIV could never have been fixed that way.

If the microcode patch facility includes a mask to trap any arbitrary
opcode, then it would also be able to fix FDIV type bugs, but unless
they have significant space in that writeable microcode store, then it
possibly would not have been possible to make it fit:

A minimal patch working the same way as our SW would need to start by
inspecting the top 10 bits of the divisor mantissa, and then fall
through for all except 5 different hits.

So at least 1024 bits for a lookup table, or relatively complicated
logic if implemented directly with gates.

Gates are not possible in Microcode...

Errrrrrrrrrr, no.

In Mc 68020 there was an XOR gate between the 2 nor planes of microcode.
We used this to perform logic that was not "small" when implemented in
pure PLA logic forms (saving space in ROM). In addition, both planes
could assert multiple "read" lines and the accesses "words" would be
ORed together. This was especially powerful in the address modes and
split A&D register files.

You can call this microcode space optimizer, or you can call it gates.
It is a lot closer to gates in the modern sense of the word.

https://www.righto.com/2024/12/this-die-photo-of-pentium-shows.html

has a nice explanation of the bug and how they fixed it in silicon
(and simplified the PLA they used as a lookup table, as well -
apparently, it had not been optimized with don't cares, or
the bug might never have bitten).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to All on Mon Jul 14 04:18:12 2025

On Sat, 12 Jul 2025 19:39:55 +0000, MitchAlsup1 wrote:

You can call this microcode space optimizer, or you can call it gates.
It is a lot closer to gates in the modern sense of the word.

Another example of the blurred boundary between “hardware” and “software”.
I seem to come across someone who is convinced there is a clear separation between the two about once every week or two ...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to All on Mon Jul 14 16:59:04 2025

MitchAlsup1 wrote:

On Fri, 11 Jul 2025 23:17:23 +0000, Lawrence D'Oliveiro wrote:

On Fri, 11 Jul 2025 18:05:45 -0400, Stefan Monnier wrote:

... I was wondering if the introduction of such microcode and
ability to update it "in the field" was made as a result of lessons
learned from the famous FDIV bug (for which Intel had to actually
replace the CPUs, which seems a lot more costly).

It seems to me this would depend on a lot on how flexible the
microarchitecture really was. How far can you go in redefining the
operation of a fixed set of hardware function units?

A lot of microcode escape sequences are like ::

    Check for special operand
    if found trap to software
    else run in hardware

Which can be "done" in one microcode "word"

That would actually work perfectly for the original FDIV bug:

Check top 10 mantissa bits, trap if one of the five patterns.

Without the call, store to memory & reload overhead, the actual cost of
such a workaround would have been just a cycle or two for a 40-cycle
operation that is performed very rarely in optimized sw.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Terje Mathisen on Tue Jul 15 20:20:00 2025

On Mon, 14 Jul 2025 14:59:04 +0000, Terje Mathisen wrote:

MitchAlsup1 wrote:

On Fri, 11 Jul 2025 23:17:23 +0000, Lawrence D'Oliveiro wrote:

On Fri, 11 Jul 2025 18:05:45 -0400, Stefan Monnier wrote:

... I was wondering if the introduction of such microcode and
ability to update it "in the field" was made as a result of lessons
learned from the famous FDIV bug (for which Intel had to actually
replace the CPUs, which seems a lot more costly).

It seems to me this would depend on a lot on how flexible the
microarchitecture really was. How far can you go in redefining the
operation of a fixed set of hardware function units?

A lot of microcode escape sequences are like ::

    Check for special operand
    if found trap to software
    else run in hardware

Which can be "done" in one microcode "word"

That would actually work perfectly for the original FDIV bug:

Check top 10 mantissa bits, trap if one of the five patterns.

Without the call, store to memory & reload overhead, the actual cost of
such a workaround would have been just a cycle or two for a 40-cycle operation that is performed very rarely in optimized sw.

At some_small_percent it is rare, but not "very rare" due to the
latency multiplier of ~40 cycles. 0.1% corresponds to 4% of CPU
time being FDIV. At 40 cycles of latency there simply HAS TO BE
some other instruction waiting for its result.

{All I am complaining about is the word "very"}

Terje

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to All on Wed Jul 16 14:37:51 2025

MitchAlsup1 wrote:

On Mon, 14 Jul 2025 14:59:04 +0000, Terje Mathisen wrote:

MitchAlsup1 wrote:

On Fri, 11 Jul 2025 23:17:23 +0000, Lawrence D'Oliveiro wrote:

On Fri, 11 Jul 2025 18:05:45 -0400, Stefan Monnier wrote:

... I was wondering if the introduction of such microcode and
ability to update it "in the field" was made as a result of lessons
learned from the famous FDIV bug (for which Intel had to actually
replace the CPUs, which seems a lot more costly).

It seems to me this would depend on a lot on how flexible the
microarchitecture really was. How far can you go in redefining the
operation of a fixed set of hardware function units?

A lot of microcode escape sequences are like ::

Â Â Â Check for special operand
Â Â Â if found trap to software
Â Â Â else run in hardware

Which can be "done" in one microcode "word"

That would actually work perfectly for the original FDIV bug:

Check top 10 mantissa bits, trap if one of the five patterns.

Without the call, store to memory & reload overhead, the actual cost of
such a workaround would have been just a cycle or two for a 40-cycle
operation that is performed very rarely in optimized sw.

At some_small_percent it is rare, but not "very rare" due to the
latency multiplier of ~40 cycles. 0.1% corresponds to 4% of CPU
time being FDIV. At 40 cycles of latency there simply HAS TO BE
some other instruction waiting for its result.

{All I am complaining about is the word "very"}

That's fine.

I was comparing to the real-world measurements we did in 1994/95, where
we found that the actual number of FDIVs in performance-sensitive inner
loops was in fact quite low.

The main outlier was probably perspective-correct 3D graphics, like
Quake which did an FDIV for every 16 pixels, but regular Fortran/C(++)
code had a measured slowdown in the range from fractional percent to
near zero.

My sw code needed to load the top mantissa bits and look them up, then
in 1019 out of 1024 cases just drop down at a cost of well under 10
clock cycles, since the lookup table would mostly stay in $L1.

In the bad case of a hit, the code would save the current precision &
rounding mode, then switch to 80-bit, multiply both divisor and dividend
by 15/16 (guaranteed to be exact!), then restore the original precision
before falling into the FDIV, so in this case we added 5-6 FMUL clock
cycles, plus two resets of the FPU precision mode which are typically
more costly.

In the absolute worst case of doing nothing but chained FDIVs, with bad divisors, we would add ~20 (afair) clock cycles to the basic 40, so a
50% slowdown.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	546
Nodes:	16 (2 / 14)
Uptime:	39:05:20
Calls:	10,392
Files:	14,064
Messages:	6,417,185

FDIV and microcode

Who's Online

System Info