Forum: >>> Magnum BBS <<<

Re: Delay in Switching to Stratum 16 After Local Reference Loss on ntpd

From Harlan Stenn via questions Mailing@21:1/5 to MOUHOUNE Samir on Wed May 28 01:08:00 2025

To: questions@lists.ntp.org
Copy: davehart@gmail.com (davehart@gmail.com)

Hi Samir,

It looks like you are aware of the complexity here.

As for is it a bug or a feature, I think the answer is "it depends", and
it really depends on each situation.

In no particular order:

- what other sources of time does ntpd see? Just because a time source
goes dark for a while shouldn't matter. PHI marches on, and at some
point the root dispersion will grow and knock the time soure out of the running.
- It seems like your scenario has an insufficient number of time sources.
- If ntpd will call tsync often enough, what is the actual difference
between letting it hit S16 faster/slower, and re-establishing tsync as a
source when it is available again?
- How robust is your set of orphan servers?

The list goes on.

And from my POV, as long as ntpd has sufficiently robust "mechanism" to
allow you to tune its behavior to implement your "local policy" choices,
that's sufficient. As long as the default behavior is sufficient for
enough people, that seems OK.

H

On 5/27/2025 5:13 AM, MOUHOUNE Samir wrote:

Dear NTP Community,

We have observed a potentially unexpected behavior with |ntpd| version *4.2.8p18* concerning the delay in transitioning to stratum 16 when a
local reference clock (tsync) loses synchronization.

Issue Summary

When our local reference clock (tsync) becomes unsynchronized, we expect |ntpd| to stop selecting it and switch the system to stratum 16
relatively quickly, indicating the system is no longer a valid time source.

However, on systems running *ntpd 4.2.8p18*, this transition appears
*delayed by up to one hour*. During this time, |ntpd| continues to treat tsync as a valid source and *reports stratum 1*, even though
synchronization is no longer valid.

In comparison, this behavior does *not occur* in older versions like *4.2.8p15*, where the system transitions to stratum 16 within a few minutes.

We suspect this may be due to internal changes in source selection and
trust logic introduced in later versions, possibly making |ntpd| more conservative about declassifying known sources — even when they become unreliable.

Temporary Workaround

We experimented with the following configuration adjustments, which
appear to mitigate the issue by making |ntpd| more responsive:
*/tos orphanwait 1
tos mindist 0.05
tinker stepout 10
tinker panic 0
minpoll 3
maxpoll 4/*

These parameters seem to accelerate response to changes in sync status.

Questions

1.

Is this delay in downgrading to stratum 16 in 4.2.8p18 an *intended
behavior*, or is it considered a *regression* compared to earlier
versions?

2.

Are there *recommended configuration settings* or best practices to
ensure timely transition to stratum 16 when a local reference
becomes unreliable?

3.

Would it be appropriate to submit this as a *bug report*?

We would appreciate any clarification or guidance you can provide on
this matter.

Best regards,

--
*/Cordialement,
/*
*/Samir MOUHOUNE/*

--
Harlan Stenn <stenn@ntp.org>
NTP Project Lead. The NTP Project is part of
https://www.nwtime.org/ - be a member!

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Jakob Bohm@21:1/5 to Harlan Stenn via questions Mailing on Mon Jun 23 20:50:24 2025

On 2025-05-28 03:08, Harlan Stenn via questions Mailing List wrote:

Hi Samir,

It looks like you are aware of the complexity here.

As for is it a bug or a feature, I think the answer is "it depends", and
it really depends on each situation.

In no particular order:

- what other sources of time does ntpd see? Just because a time source
goes dark for a while shouldn't matter. PHI marches on, and at some
point the root dispersion will grow and knock the time soure out of the running.
- It seems like your scenario has an insufficient number of time sources.
- If ntpd will call tsync often enough, what is the actual difference
between letting it hit S16 faster/slower, and re-establishing tsync as a source when it is available again?
- How robust is your set of orphan servers?

The list goes on.

And from my POV, as long as ntpd has sufficiently robust "mechanism" to
allow you to tune its behavior to implement your "local policy" choices, that's sufficient. As long as the default behavior is sufficient for
enough people, that seems OK.

H

On 5/27/2025 5:13 AM, MOUHOUNE Samir wrote:

Dear NTP Community,

We have observed a potentially unexpected behavior with |ntpd| version
*4.2.8p18* concerning the delay in transitioning to stratum 16 when a
local reference clock (tsync) loses synchronization.

      Issue Summary

When our local reference clock (tsync) becomes unsynchronized, we
expect |ntpd| to stop selecting it and switch the system to stratum 16
relatively quickly, indicating the system is no longer a valid time
source.

However, on systems running *ntpd 4.2.8p18*, this transition appears
*delayed by up to one hour*. During this time, |ntpd| continues to
treat tsync as a valid source and *reports stratum 1*, even though
synchronization is no longer valid.

In comparison, this behavior does *not occur* in older versions like
*4.2.8p15*, where the system transitions to stratum 16 within a few
minutes.

We suspect this may be due to internal changes in source selection and
trust logic introduced in later versions, possibly making |ntpd| more
conservative about declassifying known sources — even when they become
unreliable.

      Temporary Workaround

We experimented with the following configuration adjustments, which
appear to mitigate the issue by making |ntpd| more responsive:
*/tos orphanwait 1
tos mindist 0.05
tinker stepout 10
tinker panic 0
minpoll 3
maxpoll 4/*

These parameters seem to accelerate response to changes in sync status.

      Questions

1.

    Is this delay in downgrading to stratum 16 in 4.2.8p18 an *intended >>     behavior*, or is it considered a *regression* compared to earlier
    versions?

2.

    Are there *recommended configuration settings* or best practices to >>     ensure timely transition to stratum 16 when a local reference
    becomes unreliable?

3.

    Would it be appropriate to submit this as a *bug report*?

We would appreciate any clarification or guidance you can provide on
this matter.

Best regards,

--
*/Cordialement,
/*
*/Samir MOUHOUNE/*

You seem to have missed the point. This is clearly about the behavior
on a STRATUM 1 server when the hardware(-ish) time source fails, such as
a Maser failing. In these cases, ntpd needs to quickly lower its
announced stratum so the stratum 2 NTP servers will select a different
upstream until the issue is fixed . Orphan mode should rarely apply
except in small sites with no Internet sources and no backup hardware
sources . However the affected stratum 1 server might or might not
generally benefit from using NTP time sources as a fallback for
disciplining the local clock .

Enjoy

Jakob

--
Jakob Bohm, MSc.Eng., I speak only for myself, not my company
This public discussion message is non-binding and may contain errors
All trademarks and other things belong to their owners, if any.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From William Unruh@21:1/5 to Jakob Bohm on Wed Jun 25 23:36:35 2025

On 2025-06-23, Jakob Bohm <egenagwemdimtapsar@jbohm.dk> wrote:

On 2025-05-28 03:08, Harlan Stenn via questions Mailing List wrote:

Hi Samir,

It looks like you are aware of the complexity here.

As for is it a bug or a feature, I think the answer is "it depends", and
it really depends on each situation.

In no particular order:

- what other sources of time does ntpd see? Just because a time source
goes dark for a while shouldn't matter. PHI marches on, and at some
point the root dispersion will grow and knock the time soure out of the
running.
- It seems like your scenario has an insufficient number of time sources.
- If ntpd will call tsync often enough, what is the actual difference
between letting it hit S16 faster/slower, and re-establishing tsync as a
source when it is available again?
- How robust is your set of orphan servers?

The list goes on.

And from my POV, as long as ntpd has sufficiently robust "mechanism" to
allow you to tune its behavior to implement your "local policy" choices,
that's sufficient. As long as the default behavior is sufficient for
enough people, that seems OK.

H

On 5/27/2025 5:13 AM, MOUHOUNE Samir wrote:

Dear NTP Community,

We have observed a potentially unexpected behavior with |ntpd| version
*4.2.8p18* concerning the delay in transitioning to stratum 16 when a
local reference clock (tsync) loses synchronization.

      Issue Summary

When our local reference clock (tsync) becomes unsynchronized, we
expect |ntpd| to stop selecting it and switch the system to stratum 16
relatively quickly, indicating the system is no longer a valid time
source.

However, on systems running *ntpd 4.2.8p18*, this transition appears
*delayed by up to one hour*. During this time, |ntpd| continues to
treat tsync as a valid source and *reports stratum 1*, even though
synchronization is no longer valid.

In comparison, this behavior does *not occur* in older versions like
*4.2.8p15*, where the system transitions to stratum 16 within a few
minutes.

We suspect this may be due to internal changes in source selection and
trust logic introduced in later versions, possibly making |ntpd| more
conservative about declassifying known sources — even when they become >>> unreliable.

      Temporary Workaround

We experimented with the following configuration adjustments, which
appear to mitigate the issue by making |ntpd| more responsive:
*/tos orphanwait 1
tos mindist 0.05
tinker stepout 10
tinker panic 0
minpoll 3
maxpoll 4/*

These parameters seem to accelerate response to changes in sync status.

      Questions

1.

    Is this delay in downgrading to stratum 16 in 4.2.8p18 an *intended >>>     behavior*, or is it considered a *regression* compared to earlier >>>     versions?

2.

    Are there *recommended configuration settings* or best practices to >>>     ensure timely transition to stratum 16 when a local reference
    becomes unreliable?

3.

    Would it be appropriate to submit this as a *bug report*?

We would appreciate any clarification or guidance you can provide on
this matter.

Best regards,

--
*/Cordialement,
/*
*/Samir MOUHOUNE/*

And how does ntp know that a maser has failed?. The only information it
has is the return ntp packets (or lack thereof). And a return packet can
fail to arrive if your computer's connection to the net has failed,
temporarily or permanantly, or if the outside server has been
disconnected or is temporarily bad. ntp waits a while tries again.
An hour is no time at all. If you inow something has happened to the
source, you need to go in and resetup ntp to take that into account.

Make sure you have 5 or 7 independent sources, so the majority can vote
out the misbehaving source.

You seem to have missed the point. This is clearly about the behavior
on a STRATUM 1 server when the hardware(-ish) time source fails, such as
a Maser failing. In these cases, ntpd needs to quickly lower its

announced stratum so the stratum 2 NTP servers will select a different upstream until the issue is fixed . Orphan mode should rarely apply
except in small sites with no Internet sources and no backup hardware
sources . However the affected stratum 1 server might or might not
generally benefit from using NTP time sources as a fallback for
disciplining the local clock .

Enjoy

Jakob

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Jakob Bohm@21:1/5 to William Unruh on Mon Jun 30 05:57:47 2025

On 2025-06-26 01:36, William Unruh wrote:

On 2025-06-23, Jakob Bohm <egenagwemdimtapsar@jbohm.dk> wrote:

On 2025-05-28 03:08, Harlan Stenn via questions Mailing List wrote:

Hi Samir,

It looks like you are aware of the complexity here.

As for is it a bug or a feature, I think the answer is "it depends", and >>> it really depends on each situation.

In no particular order:

- what other sources of time does ntpd see? Just because a time source >>> goes dark for a while shouldn't matter. PHI marches on, and at some
point the root dispersion will grow and knock the time soure out of the
running.
- It seems like your scenario has an insufficient number of time sources. >>> - If ntpd will call tsync often enough, what is the actual difference
between letting it hit S16 faster/slower, and re-establishing tsync as a >>> source when it is available again?
- How robust is your set of orphan servers?

The list goes on.

And from my POV, as long as ntpd has sufficiently robust "mechanism" to
allow you to tune its behavior to implement your "local policy" choices, >>> that's sufficient. As long as the default behavior is sufficient for
enough people, that seems OK.

H

On 5/27/2025 5:13 AM, MOUHOUNE Samir wrote:

Dear NTP Community,

We have observed a potentially unexpected behavior with |ntpd| version >>>> *4.2.8p18* concerning the delay in transitioning to stratum 16 when a
local reference clock (tsync) loses synchronization.

      Issue Summary

When our local reference clock (tsync) becomes unsynchronized, we
expect |ntpd| to stop selecting it and switch the system to stratum 16 >>>> relatively quickly, indicating the system is no longer a valid time
source.

However, on systems running *ntpd 4.2.8p18*, this transition appears
*delayed by up to one hour*. During this time, |ntpd| continues to
treat tsync as a valid source and *reports stratum 1*, even though
synchronization is no longer valid.

In comparison, this behavior does *not occur* in older versions like
*4.2.8p15*, where the system transitions to stratum 16 within a few
minutes.

We suspect this may be due to internal changes in source selection and >>>> trust logic introduced in later versions, possibly making |ntpd| more
conservative about declassifying known sources — even when they become >>>> unreliable.

      Temporary Workaround

We experimented with the following configuration adjustments, which
appear to mitigate the issue by making |ntpd| more responsive:
*/tos orphanwait 1
tos mindist 0.05
tinker stepout 10
tinker panic 0
minpoll 3
maxpoll 4/*

These parameters seem to accelerate response to changes in sync status. >>>>

      Questions

1.

    Is this delay in downgrading to stratum 16 in 4.2.8p18 an *intended
    behavior*, or is it considered a *regression* compared to earlier >>>>     versions?

2.

    Are there *recommended configuration settings* or best practices to
    ensure timely transition to stratum 16 when a local reference
    becomes unreliable?

3.

    Would it be appropriate to submit this as a *bug report*?

We would appreciate any clarification or guidance you can provide on
this matter.

Best regards,

--
*/Cordialement,
/*
*/Samir MOUHOUNE/*

And how does ntp know that a maser has failed?. The only information it
has is the return ntp packets (or lack thereof). And a return packet can
fail to arrive if your computer's connection to the net has failed, temporarily or permanantly, or if the outside server has been
disconnected or is temporarily bad. ntp waits a while tries again.
An hour is no time at all. If you inow something has happened to the
source, you need to go in and resetup ntp to take that into account.

In the scenario as I understand the OP, NTPD would know the difference
because it connects to the hardware/outside time source via a protocol
other than NTP, specifically any of the non-NTP protocols, such as PPS,
GPS, WWWB, DCF77, Modem etc. Thread seems to be about how an NTPD
instance tied exclusively to such a time source via appropriate hardware
reacts to said hardware going offline . OP seems to complain that a
recent patchlevel NTPD update made it react much more slowly to such a situation .

Make sure you have 5 or 7 independent sources, so the majority can vote
out the misbehaving source.

Typically, the stratum 1 server in question would be one of those 4 to 7
NTP time sources feeding another NTPD instance, and the problem is
making sure NTP data sent to the stratum 2 server stops advertising
stratum 1 as soon as the hardware time source goes away .

You seem to have missed the point. This is clearly about the behavior
on a STRATUM 1 server when the hardware(-ish) time source fails, such as
a Maser failing. In these cases, ntpd needs to quickly lower its

announced stratum so the stratum 2 NTP servers will select a different
upstream until the issue is fixed . Orphan mode should rarely apply
except in small sites with no Internet sources and no backup hardware
sources . However the affected stratum 1 server might or might not
generally benefit from using NTP time sources as a fallback for
disciplining the local clock .

Enjoy

Jakob

--
Jakob Bohm, MSc.Eng., I speak only for myself, not my company
This public discussion message is non-binding and may contain errors
All trademarks and other things belong to their owners, if any.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From William Unruh@21:1/5 to Jakob Bohm on Mon Jun 30 16:09:30 2025

On 2025-06-30, Jakob Bohm <egenagwemdimtapsar@jbohm.dk> wrote:

In the scenario as I understand the OP, NTPD would know the difference because it connects to the hardware/outside time source via a protocol
other than NTP, specifically any of the non-NTP protocols, such as PPS,
GPS, WWWB, DCF77, Modem etc. Thread seems to be about how an NTPD
instance tied exclusively to such a time source via appropriate hardware reacts to said hardware going offline . OP seems to complain that a
recent patchlevel NTPD update made it react much more slowly to such a situation .

Make sure you have 5 or 7 independent sources, so the majority can vote
out the misbehaving source.

Typically, the stratum 1 server in question would be one of those 4 to 7
NTP time sources feeding another NTPD instance, and the problem is
making sure NTP data sent to the stratum 2 server stops advertising
stratum 1 as soon as the hardware time source goes away .

Well, I am not sure about that. It would take a while for the stratum 1
to forget its training by the stratum 0 source. All ntpd knows is that
this time the stratum 0 did not respond (properly). It has no idea if
next time it will respond again. And having the startum change rapidly
is probably also not good. Again it has no knowlege whethere the
harwareish time source did not respond because the maser failed. Or
someone briefly and accidentaly (or purposely) interrupted the the connection.

And if that source comes online again, how long should ntp wait for
before it announces itself as stratum 1 again. If its other sources are
poor, it might take a while befor the stratum 1 shakes of the
disciplinng by the stratum 2 or stratum 15 sources and traces the
stratum 0 again.

You seem to have missed the point. This is clearly about the behavior
on a STRATUM 1 server when the hardware(-ish) time source fails, such as >>> a Maser failing. In these cases, ntpd needs to quickly lower its

announced stratum so the stratum 2 NTP servers will select a different
upstream until the issue is fixed . Orphan mode should rarely apply
except in small sites with no Internet sources and no backup hardware
sources . However the affected stratum 1 server might or might not
generally benefit from using NTP time sources as a fallback for
disciplining the local clock .

Enjoy

Jakob

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Harlan Stenn via questions Mailing@21:1/5 to Dave Hart on Tue Jul 1 03:48:00 2025

To: samir.mouhoune@gmail.com (MOUHOUNE Samir)
Copy: questions@lists.ntp.org

As I've said before, just because the behavior is different does not
mean it's broken.

The NTP algorithms do ongoing evaluations of established associations.

If an association becomes non-responsive, it auto-degrades.

At some point the association will drop.

Dave Mills was, to the best of my recollection, very hesitant to throw
out an established association "too soon".

Let's get more information and understanding around what you're seeing.

H

On 6/30/2025 12:00 PM, Dave Hart wrote:

On Tue, May 27, 2025 at 12:13 UTC MOUHOUNE Samir
<samir.mouhoune@gmail.com <mailto:samir.mouhoune@gmail.com>> wrote:

Dear NTP Community,

We have observed a potentially unexpected behavior with |ntpd|
version *4.2.8p18* concerning the delay in transitioning to stratum
16 when a local reference clock (tsync) loses synchronization.

Issue Summary

When our local reference clock (tsync) becomes unsynchronized, we
expect |ntpd| to stop selecting it and switch the system to stratum
16 relatively quickly, indicating the system is no longer a valid
time source.

However, on systems running *ntpd 4.2.8p18*, this transition appears
*delayed by up to one hour*. During this time, |ntpd| continues to
treat tsync as a valid source and *reports stratum 1*, even though
synchronization is no longer valid.

In comparison, this behavior does *not occur* in older versions like
*4.2.8p15*, where the system transitions to stratum 16 within a few
minutes.

We suspect this may be due to internal changes in source selection
and trust logic introduced in later versions, possibly making |ntpd|
more conservative about declassifying known sources — even when they
become unreliable.

There have been no changes to ntpd/refclock_tsyncpci.c since 4.2.8p5 in
2016, so the issue might well affect other refclocks. It sounds like something we need to fix, or at a minimum understand and justify as an improvement.

Temporary Workaround

We experimented with the following configuration adjustments, which
appear to mitigate the issue by making |ntpd| more responsive:
*/tos orphanwait 1
tos mindist 0.05
tinker stepout 10
tinker panic 0
minpoll 3
maxpoll 4/*

These parameters seem to accelerate response to changes in sync status.

That's a lot of different knobs turned. Did you make all 6 changes at
once and observe improvement, or one at a time, or ?
I'm glad you found something to help out, and those might help point to
code change(s) responsible, but given so many knobs changed, not as
helpful as I might hope.

Questions

1.

Is this delay in downgrading to stratum 16 in 4.2.8p18 an
*intended behavior*, or is it considered a *regression* compared
to earlier versions?

It's hard to see why it would be intended, but we're far from
understanding the issue well enough to be definitive.

1. Are there *recommended configuration settings* or best practices
to ensure timely transition to stratum 16 when a local reference
becomes unreliable?

Interesting renumbering of your questions is happening in the GMail web editor using Chrome on Windows.
I think it's fair to say we're pretty weak on documented best practices
or recommended configuration settings, but try https://doc.ntp.org/ <https://doc.ntp.org/> You could also look at the archives of this list
and its onetime evil twin newsgroup comp.protocols.time.ntp.

1.

Would it be appropriate to submit this as a *bug report*?

Yes, please, by all means. That's generally true if you think you've
found a misbehavior, regression, suboptimal behavior, or just have a
request to improve. The only thing we don't welcome reports to https:// bugs.ntp.org/ <https://bugs.ntp.org/> about are reports of a security
nature, such as a remote crash of ntpd based on a port 123 query, or nontrivial information disclosure or elevation of privileges, things
that might merit a CVE. In that case, please submit the report to security@ntp.org <mailto:security@ntp.org> to ensure the information is
not made public before remediation can be done.

I apologize for taking so long to respond. I've had a lot going on in
my non-NTP life and I choose to have a relative firehose of email.
Thanks to Jakob for bubbling this up to my attention again. Bug reports
can be ignored too, but much less likely than email.

Cheers,
Dave Hart

--
Harlan Stenn <stenn@ntp.org>
NTP Project Lead. The NTP Project is part of
https://www.nwtime.org/ - be a member!

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Miroslav Lichvar@21:1/5 to questions@lists.ntp.org on Tue Jul 1 11:00:21 2025

On 2025-07-01, Harlan Stenn via questions Mailing List <questions@lists.ntp.org> wrote:

The NTP algorithms do ongoing evaluations of established associations.

If an association becomes non-responsive, it auto-degrades.

At some point the association will drop.

Dave Mills was, to the best of my recollection, very hesitant to throw
out an established association "too soon".

Yes, what is reported in this thread as observed behavior of 4.2.8p15
and 4.2.8p18 both sound wrong to me. NTPv4 servers are not supposed to
claim they are unsynchronized (switch to stratum 16) when they lose a
working association (it doesn't matter if it's a refclock or NTP
server/peer). Quoting from RFC 5905 section 10:

It is important to note that, unlike NTPv3, NTPv4 associations do not
show a timeout condition by setting the stratum to 16 and leap
indicator to 3. The association variables retain the values
determined upon arrival of the last packet. In NTPv4, lambda
increases with time, so eventually the synchronization distance
exceeds the distance threshold MAXDIST, in which case the association
is considered unfit for synchronization.

It seems this changed between 4.2.8p14 and 4.2.8p15 as a result of
fixing this bug:
https://bugs.ntp.org/show_bug.cgi?id=3644

The problem reported in that bug doesn't look like a bug to me.
I think it was working as intended in NTPv4. The current behavior is
a regression towards NTPv3.

--
Miroslav Lichvar

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Gretchiie
  Wed Sep 17 08:54:03 2025
  from Derry, Nh via Telnet
- Bob Worm
  Wed Sep 17 08:43:18 2025
  from Wales, Uk via Telnet
- Bob Worm
  Wed Sep 17 08:14:37 2025
  from Wales, Uk via Telnet
- Volatile_Memory
  Wed Sep 17 07:20:57 2025
  from Des Moines, Iowa via SSH
- Volatile_Memory
  Wed Sep 17 07:17:26 2025
  from Des Moines, Iowa via SSH
- Bob Worm
  Tue Sep 16 21:01:27 2025
  from Wales, Uk via Telnet
- Bob Worm
  Tue Sep 16 15:15:42 2025
  from Wales, Uk via Telnet
- Gretchiie
  Tue Sep 16 05:20:21 2025
  from Derry, Nh via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	546
Nodes:	16 (2 / 14)
Uptime:	49:39:59
Calls:	10,397
Calls today:	5
Files:	14,067
Messages:	6,417,314
Posted today:	1

Re: Delay in Switching to Stratum 16 After Local Reference Loss on ntpd

Who's Online

Recent Visitors

System Info