• Re: Delay in Switching to Stratum 16 After Local Reference Loss on ntpd

    From Harlan Stenn via questions Mailing@21:1/5 to MOUHOUNE Samir on Wed May 28 01:08:00 2025
    To: questions@lists.ntp.org
    Copy: davehart@gmail.com (davehart@gmail.com)

    Hi Samir,

    It looks like you are aware of the complexity here.

    As for is it a bug or a feature, I think the answer is "it depends", and
    it really depends on each situation.

    In no particular order:

    - what other sources of time does ntpd see? Just because a time source
    goes dark for a while shouldn't matter. PHI marches on, and at some
    point the root dispersion will grow and knock the time soure out of the running.
    - It seems like your scenario has an insufficient number of time sources.
    - If ntpd will call tsync often enough, what is the actual difference
    between letting it hit S16 faster/slower, and re-establishing tsync as a
    source when it is available again?
    - How robust is your set of orphan servers?

    The list goes on.

    And from my POV, as long as ntpd has sufficiently robust "mechanism" to
    allow you to tune its behavior to implement your "local policy" choices,
    that's sufficient. As long as the default behavior is sufficient for
    enough people, that seems OK.

    H

    On 5/27/2025 5:13 AM, MOUHOUNE Samir wrote:
    Dear NTP Community,

    We have observed a potentially unexpected behavior with |ntpd| version *4.2.8p18* concerning the delay in transitioning to stratum 16 when a
    local reference clock (tsync) loses synchronization.


    Issue Summary

    When our local reference clock (tsync) becomes unsynchronized, we expect |ntpd| to stop selecting it and switch the system to stratum 16
    relatively quickly, indicating the system is no longer a valid time source.

    However, on systems running *ntpd 4.2.8p18*, this transition appears
    *delayed by up to one hour*. During this time, |ntpd| continues to treat tsync as a valid source and *reports stratum 1*, even though
    synchronization is no longer valid.

    In comparison, this behavior does *not occur* in older versions like *4.2.8p15*, where the system transitions to stratum 16 within a few minutes.

    We suspect this may be due to internal changes in source selection and
    trust logic introduced in later versions, possibly making |ntpd| more conservative about declassifying known sources — even when they become unreliable.


    Temporary Workaround

    We experimented with the following configuration adjustments, which
    appear to mitigate the issue by making |ntpd| more responsive:
    */tos orphanwait 1
    tos mindist 0.05
    tinker stepout 10
    tinker panic 0
    minpoll 3
    maxpoll 4/*

    These parameters seem to accelerate response to changes in sync status.


    Questions

    1.

    Is this delay in downgrading to stratum 16 in 4.2.8p18 an *intended
    behavior*, or is it considered a *regression* compared to earlier
    versions?

    2.

    Are there *recommended configuration settings* or best practices to
    ensure timely transition to stratum 16 when a local reference
    becomes unreliable?

    3.

    Would it be appropriate to submit this as a *bug report*?

    We would appreciate any clarification or guidance you can provide on
    this matter.

    Best regards,

    --
    */Cordialement,
    /*
    */Samir MOUHOUNE/*

    --
    Harlan Stenn <stenn@ntp.org>
    NTP Project Lead. The NTP Project is part of
    https://www.nwtime.org/ - be a member!

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Jakob Bohm@21:1/5 to Harlan Stenn via questions Mailing on Mon Jun 23 20:50:24 2025
    On 2025-05-28 03:08, Harlan Stenn via questions Mailing List wrote:
    Hi Samir,

    It looks like you are aware of the complexity here.

    As for is it a bug or a feature, I think the answer is "it depends", and
    it really depends on each situation.

    In no particular order:

    - what other sources of time does ntpd see?  Just because a time source
    goes dark for a while shouldn't matter.  PHI marches on, and at some
    point the root dispersion will grow and knock the time soure out of the running.
    - It seems like your scenario has an insufficient number of time sources.
    - If ntpd will call tsync often enough, what is the actual difference
    between letting it hit S16 faster/slower, and re-establishing tsync as a source when it is available again?
    - How robust is your set of orphan servers?

    The list goes on.

    And from my POV, as long as ntpd has sufficiently robust "mechanism" to
    allow you to tune its behavior to implement your "local policy" choices, that's sufficient.  As long as the default behavior is sufficient for
    enough people, that seems OK.

    H

    On 5/27/2025 5:13 AM, MOUHOUNE Samir wrote:
    Dear NTP Community,

    We have observed a potentially unexpected behavior with |ntpd| version
    *4.2.8p18* concerning the delay in transitioning to stratum 16 when a
    local reference clock (tsync) loses synchronization.


          Issue Summary

    When our local reference clock (tsync) becomes unsynchronized, we
    expect |ntpd| to stop selecting it and switch the system to stratum 16
    relatively quickly, indicating the system is no longer a valid time
    source.

    However, on systems running *ntpd 4.2.8p18*, this transition appears
    *delayed by up to one hour*. During this time, |ntpd| continues to
    treat tsync as a valid source and *reports stratum 1*, even though
    synchronization is no longer valid.

    In comparison, this behavior does *not occur* in older versions like
    *4.2.8p15*, where the system transitions to stratum 16 within a few
    minutes.

    We suspect this may be due to internal changes in source selection and
    trust logic introduced in later versions, possibly making |ntpd| more
    conservative about declassifying known sources — even when they become
    unreliable.


          Temporary Workaround

    We experimented with the following configuration adjustments, which
    appear to mitigate the issue by making |ntpd| more responsive:
    */tos orphanwait 1
    tos mindist 0.05
    tinker stepout 10
    tinker panic 0
    minpoll 3
    maxpoll 4/*

    These parameters seem to accelerate response to changes in sync status.


          Questions

     1.

        Is this delay in downgrading to stratum 16 in 4.2.8p18 an *intended >>     behavior*, or is it considered a *regression* compared to earlier
        versions?

     2.

        Are there *recommended configuration settings* or best practices to >>     ensure timely transition to stratum 16 when a local reference
        becomes unreliable?

     3.

        Would it be appropriate to submit this as a *bug report*?

    We would appreciate any clarification or guidance you can provide on
    this matter.

    Best regards,

    --
    */Cordialement,
    /*
    */Samir MOUHOUNE/*


    You seem to have missed the point. This is clearly about the behavior
    on a STRATUM 1 server when the hardware(-ish) time source fails, such as
    a Maser failing. In these cases, ntpd needs to quickly lower its
    announced stratum so the stratum 2 NTP servers will select a different
    upstream until the issue is fixed . Orphan mode should rarely apply
    except in small sites with no Internet sources and no backup hardware
    sources . However the affected stratum 1 server might or might not
    generally benefit from using NTP time sources as a fallback for
    disciplining the local clock .

    Enjoy

    Jakob

    --
    Jakob Bohm, MSc.Eng., I speak only for myself, not my company
    This public discussion message is non-binding and may contain errors
    All trademarks and other things belong to their owners, if any.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From William Unruh@21:1/5 to Jakob Bohm on Wed Jun 25 23:36:35 2025
    On 2025-06-23, Jakob Bohm <egenagwemdimtapsar@jbohm.dk> wrote:
    On 2025-05-28 03:08, Harlan Stenn via questions Mailing List wrote:
    Hi Samir,

    It looks like you are aware of the complexity here.

    As for is it a bug or a feature, I think the answer is "it depends", and
    it really depends on each situation.

    In no particular order:

    - what other sources of time does ntpd see?  Just because a time source
    goes dark for a while shouldn't matter.  PHI marches on, and at some
    point the root dispersion will grow and knock the time soure out of the
    running.
    - It seems like your scenario has an insufficient number of time sources.
    - If ntpd will call tsync often enough, what is the actual difference
    between letting it hit S16 faster/slower, and re-establishing tsync as a
    source when it is available again?
    - How robust is your set of orphan servers?

    The list goes on.

    And from my POV, as long as ntpd has sufficiently robust "mechanism" to
    allow you to tune its behavior to implement your "local policy" choices,
    that's sufficient.  As long as the default behavior is sufficient for
    enough people, that seems OK.

    H

    On 5/27/2025 5:13 AM, MOUHOUNE Samir wrote:
    Dear NTP Community,

    We have observed a potentially unexpected behavior with |ntpd| version
    *4.2.8p18* concerning the delay in transitioning to stratum 16 when a
    local reference clock (tsync) loses synchronization.


          Issue Summary

    When our local reference clock (tsync) becomes unsynchronized, we
    expect |ntpd| to stop selecting it and switch the system to stratum 16
    relatively quickly, indicating the system is no longer a valid time
    source.

    However, on systems running *ntpd 4.2.8p18*, this transition appears
    *delayed by up to one hour*. During this time, |ntpd| continues to
    treat tsync as a valid source and *reports stratum 1*, even though
    synchronization is no longer valid.

    In comparison, this behavior does *not occur* in older versions like
    *4.2.8p15*, where the system transitions to stratum 16 within a few
    minutes.

    We suspect this may be due to internal changes in source selection and
    trust logic introduced in later versions, possibly making |ntpd| more
    conservative about declassifying known sources — even when they become >>> unreliable.


          Temporary Workaround

    We experimented with the following configuration adjustments, which
    appear to mitigate the issue by making |ntpd| more responsive:
    */tos orphanwait 1
    tos mindist 0.05
    tinker stepout 10
    tinker panic 0
    minpoll 3
    maxpoll 4/*

    These parameters seem to accelerate response to changes in sync status.


          Questions

     1.

        Is this delay in downgrading to stratum 16 in 4.2.8p18 an *intended >>>     behavior*, or is it considered a *regression* compared to earlier >>>     versions?

     2.

        Are there *recommended configuration settings* or best practices to >>>     ensure timely transition to stratum 16 when a local reference
        becomes unreliable?

     3.

        Would it be appropriate to submit this as a *bug report*?

    We would appreciate any clarification or guidance you can provide on
    this matter.

    Best regards,

    --
    */Cordialement,
    /*
    */Samir MOUHOUNE/*

    And how does ntp know that a maser has failed?. The only information it
    has is the return ntp packets (or lack thereof). And a return packet can
    fail to arrive if your computer's connection to the net has failed,
    temporarily or permanantly, or if the outside server has been
    disconnected or is temporarily bad. ntp waits a while tries again.
    An hour is no time at all. If you inow something has happened to the
    source, you need to go in and resetup ntp to take that into account.


    Make sure you have 5 or 7 independent sources, so the majority can vote
    out the misbehaving source.


    You seem to have missed the point. This is clearly about the behavior
    on a STRATUM 1 server when the hardware(-ish) time source fails, such as
    a Maser failing. In these cases, ntpd needs to quickly lower its

    announced stratum so the stratum 2 NTP servers will select a different upstream until the issue is fixed . Orphan mode should rarely apply
    except in small sites with no Internet sources and no backup hardware
    sources . However the affected stratum 1 server might or might not
    generally benefit from using NTP time sources as a fallback for
    disciplining the local clock .

    Enjoy

    Jakob


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Jakob Bohm@21:1/5 to William Unruh on Mon Jun 30 05:57:47 2025
    On 2025-06-26 01:36, William Unruh wrote:
    On 2025-06-23, Jakob Bohm <egenagwemdimtapsar@jbohm.dk> wrote:
    On 2025-05-28 03:08, Harlan Stenn via questions Mailing List wrote:
    Hi Samir,

    It looks like you are aware of the complexity here.

    As for is it a bug or a feature, I think the answer is "it depends", and >>> it really depends on each situation.

    In no particular order:

    - what other sources of time does ntpd see?  Just because a time source >>> goes dark for a while shouldn't matter.  PHI marches on, and at some
    point the root dispersion will grow and knock the time soure out of the
    running.
    - It seems like your scenario has an insufficient number of time sources. >>> - If ntpd will call tsync often enough, what is the actual difference
    between letting it hit S16 faster/slower, and re-establishing tsync as a >>> source when it is available again?
    - How robust is your set of orphan servers?

    The list goes on.

    And from my POV, as long as ntpd has sufficiently robust "mechanism" to
    allow you to tune its behavior to implement your "local policy" choices, >>> that's sufficient.  As long as the default behavior is sufficient for
    enough people, that seems OK.

    H

    On 5/27/2025 5:13 AM, MOUHOUNE Samir wrote:
    Dear NTP Community,

    We have observed a potentially unexpected behavior with |ntpd| version >>>> *4.2.8p18* concerning the delay in transitioning to stratum 16 when a
    local reference clock (tsync) loses synchronization.


          Issue Summary

    When our local reference clock (tsync) becomes unsynchronized, we
    expect |ntpd| to stop selecting it and switch the system to stratum 16 >>>> relatively quickly, indicating the system is no longer a valid time
    source.

    However, on systems running *ntpd 4.2.8p18*, this transition appears
    *delayed by up to one hour*. During this time, |ntpd| continues to
    treat tsync as a valid source and *reports stratum 1*, even though
    synchronization is no longer valid.

    In comparison, this behavior does *not occur* in older versions like
    *4.2.8p15*, where the system transitions to stratum 16 within a few
    minutes.

    We suspect this may be due to internal changes in source selection and >>>> trust logic introduced in later versions, possibly making |ntpd| more
    conservative about declassifying known sources — even when they become >>>> unreliable.


          Temporary Workaround

    We experimented with the following configuration adjustments, which
    appear to mitigate the issue by making |ntpd| more responsive:
    */tos orphanwait 1
    tos mindist 0.05
    tinker stepout 10
    tinker panic 0
    minpoll 3
    maxpoll 4/*

    These parameters seem to accelerate response to changes in sync status. >>>>

          Questions

     1.

        Is this delay in downgrading to stratum 16 in 4.2.8p18 an *intended
        behavior*, or is it considered a *regression* compared to earlier >>>>     versions?

     2.

        Are there *recommended configuration settings* or best practices to
        ensure timely transition to stratum 16 when a local reference
        becomes unreliable?

     3.

        Would it be appropriate to submit this as a *bug report*?

    We would appreciate any clarification or guidance you can provide on
    this matter.

    Best regards,

    --
    */Cordialement,
    /*
    */Samir MOUHOUNE/*

    And how does ntp know that a maser has failed?. The only information it
    has is the return ntp packets (or lack thereof). And a return packet can
    fail to arrive if your computer's connection to the net has failed, temporarily or permanantly, or if the outside server has been
    disconnected or is temporarily bad. ntp waits a while tries again.
    An hour is no time at all. If you inow something has happened to the
    source, you need to go in and resetup ntp to take that into account.


    In the scenario as I understand the OP, NTPD would know the difference
    because it connects to the hardware/outside time source via a protocol
    other than NTP, specifically any of the non-NTP protocols, such as PPS,
    GPS, WWWB, DCF77, Modem etc. Thread seems to be about how an NTPD
    instance tied exclusively to such a time source via appropriate hardware
    reacts to said hardware going offline . OP seems to complain that a
    recent patchlevel NTPD update made it react much more slowly to such a situation .


    Make sure you have 5 or 7 independent sources, so the majority can vote
    out the misbehaving source.


    Typically, the stratum 1 server in question would be one of those 4 to 7
    NTP time sources feeding another NTPD instance, and the problem is
    making sure NTP data sent to the stratum 2 server stops advertising
    stratum 1 as soon as the hardware time source goes away .


    You seem to have missed the point. This is clearly about the behavior
    on a STRATUM 1 server when the hardware(-ish) time source fails, such as
    a Maser failing. In these cases, ntpd needs to quickly lower its

    announced stratum so the stratum 2 NTP servers will select a different
    upstream until the issue is fixed . Orphan mode should rarely apply
    except in small sites with no Internet sources and no backup hardware
    sources . However the affected stratum 1 server might or might not
    generally benefit from using NTP time sources as a fallback for
    disciplining the local clock .


    Enjoy

    Jakob

    --
    Jakob Bohm, MSc.Eng., I speak only for myself, not my company
    This public discussion message is non-binding and may contain errors
    All trademarks and other things belong to their owners, if any.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From William Unruh@21:1/5 to Jakob Bohm on Mon Jun 30 16:09:30 2025
    On 2025-06-30, Jakob Bohm <egenagwemdimtapsar@jbohm.dk> wrote:

    In the scenario as I understand the OP, NTPD would know the difference because it connects to the hardware/outside time source via a protocol
    other than NTP, specifically any of the non-NTP protocols, such as PPS,
    GPS, WWWB, DCF77, Modem etc. Thread seems to be about how an NTPD
    instance tied exclusively to such a time source via appropriate hardware reacts to said hardware going offline . OP seems to complain that a
    recent patchlevel NTPD update made it react much more slowly to such a situation .


    Make sure you have 5 or 7 independent sources, so the majority can vote
    out the misbehaving source.


    Typically, the stratum 1 server in question would be one of those 4 to 7
    NTP time sources feeding another NTPD instance, and the problem is
    making sure NTP data sent to the stratum 2 server stops advertising
    stratum 1 as soon as the hardware time source goes away .

    Well, I am not sure about that. It would take a while for the stratum 1
    to forget its training by the stratum 0 source. All ntpd knows is that
    this time the stratum 0 did not respond (properly). It has no idea if
    next time it will respond again. And having the startum change rapidly
    is probably also not good. Again it has no knowlege whethere the
    harwareish time source did not respond because the maser failed. Or
    someone briefly and accidentaly (or purposely) interrupted the the connection.

    And if that source comes online again, how long should ntp wait for
    before it announces itself as stratum 1 again. If its other sources are
    poor, it might take a while befor the stratum 1 shakes of the
    disciplinng by the stratum 2 or stratum 15 sources and traces the
    stratum 0 again.




    You seem to have missed the point. This is clearly about the behavior
    on a STRATUM 1 server when the hardware(-ish) time source fails, such as >>> a Maser failing. In these cases, ntpd needs to quickly lower its

    announced stratum so the stratum 2 NTP servers will select a different
    upstream until the issue is fixed . Orphan mode should rarely apply
    except in small sites with no Internet sources and no backup hardware
    sources . However the affected stratum 1 server might or might not
    generally benefit from using NTP time sources as a fallback for
    disciplining the local clock .


    Enjoy

    Jakob


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Harlan Stenn via questions Mailing@21:1/5 to Dave Hart on Tue Jul 1 03:48:00 2025
    To: samir.mouhoune@gmail.com (MOUHOUNE Samir)
    Copy: questions@lists.ntp.org

    As I've said before, just because the behavior is different does not
    mean it's broken.

    The NTP algorithms do ongoing evaluations of established associations.

    If an association becomes non-responsive, it auto-degrades.

    At some point the association will drop.

    Dave Mills was, to the best of my recollection, very hesitant to throw
    out an established association "too soon".

    Let's get more information and understanding around what you're seeing.

    H

    On 6/30/2025 12:00 PM, Dave Hart wrote:

    On Tue, May 27, 2025 at 12:13 UTC MOUHOUNE Samir
    <samir.mouhoune@gmail.com <mailto:samir.mouhoune@gmail.com>> wrote:

    Dear NTP Community,

    We have observed a potentially unexpected behavior with |ntpd|
    version *4.2.8p18* concerning the delay in transitioning to stratum
    16 when a local reference clock (tsync) loses synchronization.


    Issue Summary

    When our local reference clock (tsync) becomes unsynchronized, we
    expect |ntpd| to stop selecting it and switch the system to stratum
    16 relatively quickly, indicating the system is no longer a valid
    time source.

    However, on systems running *ntpd 4.2.8p18*, this transition appears
    *delayed by up to one hour*. During this time, |ntpd| continues to
    treat tsync as a valid source and *reports stratum 1*, even though
    synchronization is no longer valid.

    In comparison, this behavior does *not occur* in older versions like
    *4.2.8p15*, where the system transitions to stratum 16 within a few
    minutes.

    We suspect this may be due to internal changes in source selection
    and trust logic introduced in later versions, possibly making |ntpd|
    more conservative about declassifying known sources — even when they
    become unreliable.

    There have been no changes to ntpd/refclock_tsyncpci.c since 4.2.8p5 in
    2016, so the issue might well affect other refclocks.  It sounds like something we need to fix, or at a minimum understand and justify as an improvement.


    Temporary Workaround

    We experimented with the following configuration adjustments, which
    appear to mitigate the issue by making |ntpd| more responsive:
    */tos orphanwait 1
    tos mindist 0.05
    tinker stepout 10
    tinker panic 0
    minpoll 3
    maxpoll 4/*

    These parameters seem to accelerate response to changes in sync status.


    That's a lot of different knobs turned.  Did you make all 6 changes at
    once and observe improvement, or one at a time, or ?
    I'm glad you found something to help out, and those might help point to
    code change(s) responsible, but given so many knobs changed, not as
    helpful as I might hope.


    Questions

    1.

    Is this delay in downgrading to stratum 16 in 4.2.8p18 an
    *intended behavior*, or is it considered a *regression* compared
    to earlier versions?

    It's hard to see why it would be intended, but we're far from
    understanding the issue well enough to be definitive.

    1. Are there *recommended configuration settings* or best practices
    to ensure timely transition to stratum 16 when a local reference
    becomes unreliable?

    Interesting renumbering of your questions is happening in the GMail web editor using Chrome on Windows.
    I think it's fair to say we're pretty weak on documented best practices
    or recommended configuration settings, but try https://doc.ntp.org/ <https://doc.ntp.org/>  You could also look at the archives of this list
    and its onetime evil twin newsgroup comp.protocols.time.ntp.

    1.

    Would it be appropriate to submit this as a *bug report*?

    Yes, please, by all means.  That's generally true if you think you've
    found a misbehavior, regression, suboptimal behavior, or just have a
    request to improve.  The only thing we don't welcome reports to https:// bugs.ntp.org/ <https://bugs.ntp.org/> about are reports of a security
    nature, such as a remote crash of ntpd based on a port 123 query, or nontrivial information disclosure or elevation of privileges, things
    that might merit a CVE.  In that case, please submit the report to security@ntp.org <mailto:security@ntp.org> to ensure the information is
    not made public before remediation can be done.

    I apologize for taking so long to respond.  I've had a lot going on in
    my non-NTP life and I choose to have a relative firehose of email.
    Thanks to Jakob for bubbling this up to my attention again.  Bug reports
    can be ignored too, but much less likely than email.

    Cheers,
    Dave Hart


    --
    Harlan Stenn <stenn@ntp.org>
    NTP Project Lead. The NTP Project is part of
    https://www.nwtime.org/ - be a member!

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Miroslav Lichvar@21:1/5 to questions@lists.ntp.org on Tue Jul 1 11:00:21 2025
    On 2025-07-01, Harlan Stenn via questions Mailing List <questions@lists.ntp.org> wrote:
    The NTP algorithms do ongoing evaluations of established associations.

    If an association becomes non-responsive, it auto-degrades.

    At some point the association will drop.

    Dave Mills was, to the best of my recollection, very hesitant to throw
    out an established association "too soon".

    Yes, what is reported in this thread as observed behavior of 4.2.8p15
    and 4.2.8p18 both sound wrong to me. NTPv4 servers are not supposed to
    claim they are unsynchronized (switch to stratum 16) when they lose a
    working association (it doesn't matter if it's a refclock or NTP
    server/peer). Quoting from RFC 5905 section 10:

    It is important to note that, unlike NTPv3, NTPv4 associations do not
    show a timeout condition by setting the stratum to 16 and leap
    indicator to 3. The association variables retain the values
    determined upon arrival of the last packet. In NTPv4, lambda
    increases with time, so eventually the synchronization distance
    exceeds the distance threshold MAXDIST, in which case the association
    is considered unfit for synchronization.

    It seems this changed between 4.2.8p14 and 4.2.8p15 as a result of
    fixing this bug:
    https://bugs.ntp.org/show_bug.cgi?id=3644

    The problem reported in that bug doesn't look like a bug to me.
    I think it was working as intended in NTPv4. The current behavior is
    a regression towards NTPv3.

    --
    Miroslav Lichvar

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)