• Interesting hardware bug in 13th/14th generation Intel chips

    From Thomas Koenig@21:1/5 to All on Fri Jul 11 20:42:29 2025
    Here's an interesting blog post about a hardware problem on Intel
    chips:

    https://fgiesen.wordpress.com/2025/05/21/oodle-2-9-14-and-intel-13th-14th-gen-cpus/

    It seems that, after some time and wear, the CPUs start randomly
    confusing the ch and cl registers and generating wrong code
    by storing cl instead of ch.

    Nice piece debugging by the Oodle folks.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to All on Fri Jul 11 22:02:34 2025
    In the 1µ to 0.25µ era, effects such as described could be attributed
    to transistor degradation via hot carriers. The low voltage designs of
    today have essentially eliminated those problems.

    We also had the problem of "stringers" where wires at minimum pitch
    would re-connect themselves with minute fibers of aluminum and lead
    to higher drive loads and thus slower edge speeds.

    But, since others are not seeing similar long term degradation,
    Intel must be doing design differently. {AMD, ARM, ASIC, ...}

    Nor do I see why they attempt to solve this problem with micro-
    code patches, unless they understand which kinds of instruction
    execution exacerbates the underlying issue. In which case, it
    should not have shown up in more than 1 generation.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to mitchalsup@aol.com on Sat Jul 12 09:57:15 2025
    mitchalsup@aol.com (MitchAlsup1) writes:
    In the 1µ to 0.25µ era, effects such as described could be attributed
    to transistor degradation via hot carriers. The low voltage designs of
    today have essentially eliminated those problems.

    From what I read about Intel's 13th/14th generation problems, they are
    due to Intel's firmware driving the voltage too high. It's not clear
    to me if they wanted the voltage that high, and thought that that
    would not harm the hardware, or if this was unintended.

    Nor do I see why they attempt to solve this problem with micro-
    code patches

    The firmware is changed to not drive the voltage too high; that's not
    a change to the microcode of the P-Cores or E-Cores. This fix does
    not help hardware that has already degraded so far that it exhibits
    crashes and wrong execution, but it should stop or at least slow down
    the degradation of those CPUs that still work reliably.

    In which case, it
    should not have shown up in more than 1 generation.

    Intel releases a new "generation" every year. The "generation" can be
    a refresh of an existing design (with minimal changes), or something
    new. or some refresh and some new stuff; In particular, the 13th and
    14th generations consist of CPUs on the low end that look like they
    are just Alder Lake ("12th Generation") CPUs in every way, with a
    slightly higher clock rate, and higher-end CPUs that have a larger L2
    cache but otherwise look very similar to Alder Lake. The difference
    between the 13th and the 14th Generation is a slight clock rate
    increase and some additional E-Cores are enabled in some CPUs.

    Apparently there are differences between the small-L2 CPUs (Alder Lake
    variants reanamed to Raptor Lake) and large-L2 CPUs (Raptor Lake
    proper and refresh) beyond the cache size, maybe just in the firmware,
    or maybe the clock tree implementation was changed in some way that
    makes it more vulnerable to overvoltage. In any case, they took a
    long time until they found the cause of the problem, and by that time
    the "14th generation" was out. And given that the 13th and 14th
    "generation" are pretty much the same designs, it's not surprising that
    they are both affected.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Anton Ertl on Sat Jul 12 12:05:07 2025
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    mitchalsup@aol.com (MitchAlsup1) writes:
    In the 1µ to 0.25µ era, effects such as described could be attributed
    to transistor degradation via hot carriers. The low voltage designs of >>today have essentially eliminated those problems.

    From what I read about Intel's 13th/14th generation problems, they are
    due to Intel's firmware driving the voltage too high.

    This may be related, or it may be something else; the issue
    apparently was analyzed a few months ago.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to All on Sat Jul 12 10:05:39 2025
    MitchAlsup1 wrote:
    In the 1µ to 0.25µ era, effects such as described could be attributed
    to transistor degradation via hot carriers. The low voltage designs of
    today have essentially eliminated those problems.

    We also had the problem of "stringers" where wires at minimum pitch
    would re-connect themselves with minute fibers of aluminum and lead
    to higher drive loads and thus slower edge speeds.

    But, since others are not seeing similar long term degradation,
    Intel must be doing design differently. {AMD, ARM, ASIC, ...}

    Nor do I see why they attempt to solve this problem with micro-
    code patches, unless they understand which kinds of instruction
    execution exacerbates the underlying issue. In which case, it
    should not have shown up in more than 1 generation.

    Intel Core 13th and 14th Gen Desktop Instability Root Cause Update 09-25-2024 https://community.intel.com/t5/Blogs/Tech-Innovation/Client/Intel-Core-13th-and-14th-Gen-Desktop-Instability-Root-Cause/post/1633239

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Sat Jul 12 19:41:48 2025
    On Sat, 12 Jul 2025 14:05:39 +0000, EricP wrote:

    MitchAlsup1 wrote:
    In the 1µ to 0.25µ era, effects such as described could be attributed
    to transistor degradation via hot carriers. The low voltage designs of
    today have essentially eliminated those problems.

    We also had the problem of "stringers" where wires at minimum pitch
    would re-connect themselves with minute fibers of aluminum and lead
    to higher drive loads and thus slower edge speeds.

    But, since others are not seeing similar long term degradation,
    Intel must be doing design differently. {AMD, ARM, ASIC, ...}

    Nor do I see why they attempt to solve this problem with micro-
    code patches, unless they understand which kinds of instruction
    execution exacerbates the underlying issue. In which case, it
    should not have shown up in more than 1 generation.

    Intel Core 13th and 14th Gen Desktop Instability Root Cause Update
    09-25-2024 https://community.intel.com/t5/Blogs/Tech-Innovation/Client/Intel-Core-13th-and-14th-Gen-Desktop-Instability-Root-Cause/post/1633239

    Turning off Boost mode ?!?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Vir Campestris@21:1/5 to Thomas Koenig on Sat Jul 12 21:09:29 2025
    On 11/07/2025 21:42, Thomas Koenig wrote:
    Here's an interesting blog post about a hardware problem on Intel
    chips:

    https://fgiesen.wordpress.com/2025/05/21/oodle-2-9-14-and-intel-13th-14th-gen-cpus/

    It seems that, after some time and wear, the CPUs start randomly
    confusing the ch and cl registers and generating wrong code
    by storing cl instead of ch.

    Nice piece debugging by the Oodle folks.

    Fascinating. Thank you for sharing.

    Andy

    --
    Do not listen to rumour, but, if you do, do not believe it.
    Ghandi.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)