• FDIV and microcode

    From Stefan Monnier@21:1/5 to All on Fri Jul 11 18:05:45 2025
    I was just wondering: seeing how most/all recent hardware problems in
    Intel CPUs seem to come together with discussions about "microcode
    update to avoid the problematic case", I was wondering if the
    introduction of such microcode and ability to update it "in the field"
    was made as a result of lessons learned from the famous FDIV bug (for
    which Intel had to actually replace the CPUs, which seems a lot more
    costly).

    Or is it just a coincidence?


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Stefan Monnier on Fri Jul 11 22:42:15 2025
    On Fri, 11 Jul 2025 22:05:45 +0000, Stefan Monnier wrote:

    I was just wondering: seeing how most/all recent hardware problems in
    Intel CPUs seem to come together with discussions about "microcode
    update to avoid the problematic case", I was wondering if the
    introduction of such microcode and ability to update it "in the field"
    was made as a result of lessons learned from the famous FDIV bug (for
    which Intel had to actually replace the CPUs, which seems a lot more
    costly).

    Or is it just a coincidence?

    The update part dates from ~1999. AMD put in a dozen-odd "words"
    of writeable microcode--a lo of that was because we got wind that
    Intel was putting writeable microcode.

    I is STILL one of the reasons I lean towards programmable sequencers
    in my designs--not in DECODE, but over in the FUs.

    Writeable microcode saved VAX big time when they finally got around
    to pasting 2 VAXen into one box and found the Enqueue and Dequeue
    instructions had not been made ATOMIC.

    At CMU, I wrote PDP-11/45 floating point in microcode for the
    PDP-11/40 (circa 1974-5)


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to All on Fri Jul 11 23:16:11 2025
    On Fri, 11 Jul 2025 22:42:15 +0000, MitchAlsup1 wrote:

    Writeable microcode saved VAX big time when they finally got around to pasting 2 VAXen into one box and found the Enqueue and Dequeue
    instructions had not been made ATOMIC.

    Nothing that had to be “found”. INSQUE and REMQUE were never guaranteed to have bus interlocks, and they used absolute addresses for queue links.
    Whereas the same original VAX architecture spec (I’m looking at the one
    from 1979) also defines instructions for “self-relative” queues, which are clearly designed to be handle being located at different places in
    different address spaces. And operations on which are bus-interlocked, and which require all queue entries to be quadword-aligned: INSQHI, INSQTI,
    REMQHI, REMQTI.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Stefan Monnier on Fri Jul 11 23:17:23 2025
    On Fri, 11 Jul 2025 18:05:45 -0400, Stefan Monnier wrote:

    ... I was wondering if the introduction of such microcode and
    ability to update it "in the field" was made as a result of lessons
    learned from the famous FDIV bug (for which Intel had to actually
    replace the CPUs, which seems a lot more costly).

    It seems to me this would depend on a lot on how flexible the
    microarchitecture really was. How far can you go in redefining the
    operation of a fixed set of hardware function units?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Stefan Monnier on Sat Jul 12 09:40:42 2025
    Stefan Monnier <monnier@iro.umontreal.ca> writes:
    I was just wondering: seeing how most/all recent hardware problems in
    Intel CPUs seem to come together with discussions about "microcode
    update to avoid the problematic case", I was wondering if the
    introduction of such microcode and ability to update it "in the field"
    was made as a result of lessons learned from the famous FDIV bug (for
    which Intel had to actually replace the CPUs, which seems a lot more
    costly).

    I read one history of the development of an Intel CPU development (I
    know I have read about the 386, P6 (Pentium Pro ff.), and maybe also
    the Pentium 4), and IIRC in one of these histories it was discussed.
    They put in extra RAM for storing some microcode updates, and of
    course they had to argue it against those who said that it's
    unnecessary because we did not need it before. IIRC the FDIV bug did
    not play a role in the development of the feature (for P6 it would
    have been late in the game); if it was on the P6, it might have
    vindicated the decision after it was made.

    In any case, chicken bits for disabling certain features that might
    turn out to be buggy are common practice in hardware development, the
    extra microcode space is just the microcode variant of a bunch of
    chicken bits.

    From what I have heard, this typically works by saying which
    instruction is replaced, and then if that instruction occurs, the
    replacement microcode is run.

    There was this bug in IIRC Skylake about decoding branches just before
    or crossing I-cache-line boundaries that was only discovered after
    several years (so it also affected Kaby Lake etc.). This was fixed
    with a firmware update that prevented putting such branches in the
    microcode cache; the bug does not appear when the branch is coming
    through the decoder. I wonder how Intel managed to influence the
    allocation of the microcode cache in that specific way.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Stefan Monnier on Sat Jul 12 16:40:25 2025
    Stefan Monnier wrote:
    I was just wondering: seeing how most/all recent hardware problems in
    Intel CPUs seem to come together with discussions about "microcode
    update to avoid the problematic case", I was wondering if the
    introduction of such microcode and ability to update it "in the field"
    was made as a result of lessons learned from the famous FDIV bug (for
    which Intel had to actually replace the CPUs, which seems a lot more
    costly).

    Or is it just a coincidence?

    Possibly not:

    We did of course discuss this a lot during the FDIV software replacement
    work, at that time it was impossible to trap on any unpriviledged
    opcode, so FDIV could never have been fixed that way.

    If the microcode patch facility includes a mask to trap any arbitrary
    opcode, then it would also be able to fix FDIV type bugs, but unless
    they have significant space in that writeable microcode store, then it
    possibly would not have been possible to make it fit:

    A minimal patch working the same way as our SW would need to start by inspecting the top 10 bits of the divisor mantissa, and then fall
    through for all except 5 different hits.

    So at least 1024 bits for a lookup table, or relatively complicated
    logic if implemented directly with gates.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Terje Mathisen on Sat Jul 12 16:52:16 2025
    Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
    Stefan Monnier wrote:
    I was just wondering: seeing how most/all recent hardware problems in
    Intel CPUs seem to come together with discussions about "microcode
    update to avoid the problematic case", I was wondering if the
    introduction of such microcode and ability to update it "in the field"
    was made as a result of lessons learned from the famous FDIV bug (for
    which Intel had to actually replace the CPUs, which seems a lot more
    costly).

    Or is it just a coincidence?

    Possibly not:

    We did of course discuss this a lot during the FDIV software replacement work, at that time it was impossible to trap on any unpriviledged
    opcode, so FDIV could never have been fixed that way.

    If the microcode patch facility includes a mask to trap any arbitrary
    opcode, then it would also be able to fix FDIV type bugs, but unless
    they have significant space in that writeable microcode store, then it possibly would not have been possible to make it fit:

    A minimal patch working the same way as our SW would need to start by inspecting the top 10 bits of the divisor mantissa, and then fall
    through for all except 5 different hits.

    So at least 1024 bits for a lookup table, or relatively complicated
    logic if implemented directly with gates.

    Gates are not possible in Microcode...

    https://www.righto.com/2024/12/this-die-photo-of-pentium-shows.html

    has a nice explanation of the bug and how they fixed it in silicon
    (and simplified the PLA they used as a lookup table, as well -
    apparently, it had not been optimized with don't cares, or
    the bug might never have bitten).

    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Sat Jul 12 19:34:44 2025
    On Fri, 11 Jul 2025 23:17:23 +0000, Lawrence D'Oliveiro wrote:

    On Fri, 11 Jul 2025 18:05:45 -0400, Stefan Monnier wrote:

    ... I was wondering if the introduction of such microcode and
    ability to update it "in the field" was made as a result of lessons
    learned from the famous FDIV bug (for which Intel had to actually
    replace the CPUs, which seems a lot more costly).

    It seems to me this would depend on a lot on how flexible the microarchitecture really was. How far can you go in redefining the
    operation of a fixed set of hardware function units?

    A lot of microcode escape sequences are like ::

    Check for special operand
    if found trap to software
    else run in hardware

    Which can be "done" in one microcode "word"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Thomas Koenig on Sat Jul 12 19:39:55 2025
    On Sat, 12 Jul 2025 16:52:16 +0000, Thomas Koenig wrote:

    Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
    Stefan Monnier wrote:
    I was just wondering: seeing how most/all recent hardware problems in
    Intel CPUs seem to come together with discussions about "microcode
    update to avoid the problematic case", I was wondering if the
    introduction of such microcode and ability to update it "in the field"
    was made as a result of lessons learned from the famous FDIV bug (for
    which Intel had to actually replace the CPUs, which seems a lot more
    costly).

    Or is it just a coincidence?

    Possibly not:

    We did of course discuss this a lot during the FDIV software replacement
    work, at that time it was impossible to trap on any unpriviledged
    opcode, so FDIV could never have been fixed that way.

    If the microcode patch facility includes a mask to trap any arbitrary
    opcode, then it would also be able to fix FDIV type bugs, but unless
    they have significant space in that writeable microcode store, then it
    possibly would not have been possible to make it fit:

    A minimal patch working the same way as our SW would need to start by
    inspecting the top 10 bits of the divisor mantissa, and then fall
    through for all except 5 different hits.

    So at least 1024 bits for a lookup table, or relatively complicated
    logic if implemented directly with gates.

    Gates are not possible in Microcode...

    Errrrrrrrrrr, no.

    In Mc 68020 there was an XOR gate between the 2 nor planes of microcode.
    We used this to perform logic that was not "small" when implemented in
    pure PLA logic forms (saving space in ROM). In addition, both planes
    could assert multiple "read" lines and the accesses "words" would be
    ORed together. This was especially powerful in the address modes and
    split A&D register files.

    You can call this microcode space optimizer, or you can call it gates.
    It is a lot closer to gates in the modern sense of the word.

    https://www.righto.com/2024/12/this-die-photo-of-pentium-shows.html

    has a nice explanation of the bug and how they fixed it in silicon
    (and simplified the PLA they used as a lookup table, as well -
    apparently, it had not been optimized with don't cares, or
    the bug might never have bitten).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to All on Mon Jul 14 04:18:12 2025
    On Sat, 12 Jul 2025 19:39:55 +0000, MitchAlsup1 wrote:

    You can call this microcode space optimizer, or you can call it gates.
    It is a lot closer to gates in the modern sense of the word.

    Another example of the blurred boundary between “hardware” and “software”.
    I seem to come across someone who is convinced there is a clear separation between the two about once every week or two ...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to All on Mon Jul 14 16:59:04 2025
    MitchAlsup1 wrote:
    On Fri, 11 Jul 2025 23:17:23 +0000, Lawrence D'Oliveiro wrote:

    On Fri, 11 Jul 2025 18:05:45 -0400, Stefan Monnier wrote:

    ... I was wondering if the introduction of such microcode and
    ability to update it "in the field" was made as a result of lessons
    learned from the famous FDIV bug (for which Intel had to actually
    replace the CPUs, which seems a lot more costly).

    It seems to me this would depend on a lot on how flexible the
    microarchitecture really was. How far can you go in redefining the
    operation of a fixed set of hardware function units?

    A lot of microcode escape sequences are like ::

        Check for special operand
        if found trap to software
        else run in hardware

    Which can be "done" in one microcode "word"

    That would actually work perfectly for the original FDIV bug:

    Check top 10 mantissa bits, trap if one of the five patterns.

    Without the call, store to memory & reload overhead, the actual cost of
    such a workaround would have been just a cycle or two for a 40-cycle
    operation that is performed very rarely in optimized sw.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Terje Mathisen on Tue Jul 15 20:20:00 2025
    On Mon, 14 Jul 2025 14:59:04 +0000, Terje Mathisen wrote:

    MitchAlsup1 wrote:
    On Fri, 11 Jul 2025 23:17:23 +0000, Lawrence D'Oliveiro wrote:

    On Fri, 11 Jul 2025 18:05:45 -0400, Stefan Monnier wrote:

    ... I was wondering if the introduction of such microcode and
    ability to update it "in the field" was made as a result of lessons
    learned from the famous FDIV bug (for which Intel had to actually
    replace the CPUs, which seems a lot more costly).

    It seems to me this would depend on a lot on how flexible the
    microarchitecture really was. How far can you go in redefining the
    operation of a fixed set of hardware function units?

    A lot of microcode escape sequences are like ::

        Check for special operand
        if found trap to software
        else run in hardware

    Which can be "done" in one microcode "word"

    That would actually work perfectly for the original FDIV bug:

    Check top 10 mantissa bits, trap if one of the five patterns.

    Without the call, store to memory & reload overhead, the actual cost of
    such a workaround would have been just a cycle or two for a 40-cycle operation that is performed very rarely in optimized sw.

    At some_small_percent it is rare, but not "very rare" due to the
    latency multiplier of ~40 cycles. 0.1% corresponds to 4% of CPU
    time being FDIV. At 40 cycles of latency there simply HAS TO BE
    some other instruction waiting for its result.

    {All I am complaining about is the word "very"}

    Terje

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to All on Wed Jul 16 14:37:51 2025
    MitchAlsup1 wrote:
    On Mon, 14 Jul 2025 14:59:04 +0000, Terje Mathisen wrote:

    MitchAlsup1 wrote:
    On Fri, 11 Jul 2025 23:17:23 +0000, Lawrence D'Oliveiro wrote:

    On Fri, 11 Jul 2025 18:05:45 -0400, Stefan Monnier wrote:

    ... I was wondering if the introduction of such microcode and
    ability to update it "in the field" was made as a result of lessons
    learned from the famous FDIV bug (for which Intel had to actually
    replace the CPUs, which seems a lot more costly).

    It seems to me this would depend on a lot on how flexible the
    microarchitecture really was. How far can you go in redefining the
    operation of a fixed set of hardware function units?

    A lot of microcode escape sequences are like ::

     Â Â Â  Check for special operand
     Â Â Â  if found trap to software
     Â Â Â  else run in hardware

    Which can be "done" in one microcode "word"

    That would actually work perfectly for the original FDIV bug:

    Check top 10 mantissa bits, trap if one of the five patterns.

    Without the call, store to memory & reload overhead, the actual cost of
    such a workaround would have been just a cycle or two for a 40-cycle
    operation that is performed very rarely in optimized sw.

    At some_small_percent it is rare, but not "very rare" due to the
    latency multiplier of ~40 cycles. 0.1% corresponds to 4% of CPU
    time being FDIV. At 40 cycles of latency there simply HAS TO BE
    some other instruction waiting for its result.

    {All I am complaining about is the word "very"}

    That's fine.

    I was comparing to the real-world measurements we did in 1994/95, where
    we found that the actual number of FDIVs in performance-sensitive inner
    loops was in fact quite low.

    The main outlier was probably perspective-correct 3D graphics, like
    Quake which did an FDIV for every 16 pixels, but regular Fortran/C(++)
    code had a measured slowdown in the range from fractional percent to
    near zero.

    My sw code needed to load the top mantissa bits and look them up, then
    in 1019 out of 1024 cases just drop down at a cost of well under 10
    clock cycles, since the lookup table would mostly stay in $L1.

    In the bad case of a hit, the code would save the current precision &
    rounding mode, then switch to 80-bit, multiply both divisor and dividend
    by 15/16 (guaranteed to be exact!), then restore the original precision
    before falling into the FDIV, so in this case we added 5-6 FMUL clock
    cycles, plus two resets of the FPU precision mode which are typically
    more costly.

    In the absolute worst case of doing nothing but chained FDIVs, with bad divisors, we would add ~20 (afair) clock cycles to the basic 40, so a
    50% slowdown.

    Terje


    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)