Dear NTP Community,
We have observed a potentially unexpected behavior with |ntpd| version *4.2.8p18* concerning the delay in transitioning to stratum 16 when a
local reference clock (tsync) loses synchronization.
Issue Summary
When our local reference clock (tsync) becomes unsynchronized, we expect |ntpd| to stop selecting it and switch the system to stratum 16
relatively quickly, indicating the system is no longer a valid time source.
However, on systems running *ntpd 4.2.8p18*, this transition appears
*delayed by up to one hour*. During this time, |ntpd| continues to treat tsync as a valid source and *reports stratum 1*, even though
synchronization is no longer valid.
In comparison, this behavior does *not occur* in older versions like *4.2.8p15*, where the system transitions to stratum 16 within a few minutes.
We suspect this may be due to internal changes in source selection and
trust logic introduced in later versions, possibly making |ntpd| more conservative about declassifying known sources — even when they become unreliable.
Temporary Workaround
We experimented with the following configuration adjustments, which
appear to mitigate the issue by making |ntpd| more responsive:
*/tos orphanwait 1
tos mindist 0.05
tinker stepout 10
tinker panic 0
minpoll 3
maxpoll 4/*
These parameters seem to accelerate response to changes in sync status.
Questions
1.
Is this delay in downgrading to stratum 16 in 4.2.8p18 an *intended
behavior*, or is it considered a *regression* compared to earlier
versions?
2.
Are there *recommended configuration settings* or best practices to
ensure timely transition to stratum 16 when a local reference
becomes unreliable?
3.
Would it be appropriate to submit this as a *bug report*?
We would appreciate any clarification or guidance you can provide on
this matter.
Best regards,
--
*/Cordialement,
/*
*/Samir MOUHOUNE/*
Hi Samir,
It looks like you are aware of the complexity here.
As for is it a bug or a feature, I think the answer is "it depends", and
it really depends on each situation.
In no particular order:
- what other sources of time does ntpd see? Just because a time source
goes dark for a while shouldn't matter. PHI marches on, and at some
point the root dispersion will grow and knock the time soure out of the running.
- It seems like your scenario has an insufficient number of time sources.
- If ntpd will call tsync often enough, what is the actual difference
between letting it hit S16 faster/slower, and re-establishing tsync as a source when it is available again?
- How robust is your set of orphan servers?
The list goes on.
And from my POV, as long as ntpd has sufficiently robust "mechanism" to
allow you to tune its behavior to implement your "local policy" choices, that's sufficient. As long as the default behavior is sufficient for
enough people, that seems OK.
H
On 5/27/2025 5:13 AM, MOUHOUNE Samir wrote:
Dear NTP Community,
We have observed a potentially unexpected behavior with |ntpd| version
*4.2.8p18* concerning the delay in transitioning to stratum 16 when a
local reference clock (tsync) loses synchronization.
Issue Summary
When our local reference clock (tsync) becomes unsynchronized, we
expect |ntpd| to stop selecting it and switch the system to stratum 16
relatively quickly, indicating the system is no longer a valid time
source.
However, on systems running *ntpd 4.2.8p18*, this transition appears
*delayed by up to one hour*. During this time, |ntpd| continues to
treat tsync as a valid source and *reports stratum 1*, even though
synchronization is no longer valid.
In comparison, this behavior does *not occur* in older versions like
*4.2.8p15*, where the system transitions to stratum 16 within a few
minutes.
We suspect this may be due to internal changes in source selection and
trust logic introduced in later versions, possibly making |ntpd| more
conservative about declassifying known sources — even when they become
unreliable.
Temporary Workaround
We experimented with the following configuration adjustments, which
appear to mitigate the issue by making |ntpd| more responsive:
*/tos orphanwait 1
tos mindist 0.05
tinker stepout 10
tinker panic 0
minpoll 3
maxpoll 4/*
These parameters seem to accelerate response to changes in sync status.
Questions
1.
Is this delay in downgrading to stratum 16 in 4.2.8p18 an *intended >> behavior*, or is it considered a *regression* compared to earlier
versions?
2.
Are there *recommended configuration settings* or best practices to >> ensure timely transition to stratum 16 when a local reference
becomes unreliable?
3.
Would it be appropriate to submit this as a *bug report*?
We would appreciate any clarification or guidance you can provide on
this matter.
Best regards,
--
*/Cordialement,
/*
*/Samir MOUHOUNE/*
On 2025-05-28 03:08, Harlan Stenn via questions Mailing List wrote:And how does ntp know that a maser has failed?. The only information it
Hi Samir,
It looks like you are aware of the complexity here.
As for is it a bug or a feature, I think the answer is "it depends", and
it really depends on each situation.
In no particular order:
- what other sources of time does ntpd see? Just because a time source
goes dark for a while shouldn't matter. PHI marches on, and at some
point the root dispersion will grow and knock the time soure out of the
running.
- It seems like your scenario has an insufficient number of time sources.
- If ntpd will call tsync often enough, what is the actual difference
between letting it hit S16 faster/slower, and re-establishing tsync as a
source when it is available again?
- How robust is your set of orphan servers?
The list goes on.
And from my POV, as long as ntpd has sufficiently robust "mechanism" to
allow you to tune its behavior to implement your "local policy" choices,
that's sufficient. As long as the default behavior is sufficient for
enough people, that seems OK.
H
On 5/27/2025 5:13 AM, MOUHOUNE Samir wrote:
Dear NTP Community,
We have observed a potentially unexpected behavior with |ntpd| version
*4.2.8p18* concerning the delay in transitioning to stratum 16 when a
local reference clock (tsync) loses synchronization.
Issue Summary
When our local reference clock (tsync) becomes unsynchronized, we
expect |ntpd| to stop selecting it and switch the system to stratum 16
relatively quickly, indicating the system is no longer a valid time
source.
However, on systems running *ntpd 4.2.8p18*, this transition appears
*delayed by up to one hour*. During this time, |ntpd| continues to
treat tsync as a valid source and *reports stratum 1*, even though
synchronization is no longer valid.
In comparison, this behavior does *not occur* in older versions like
*4.2.8p15*, where the system transitions to stratum 16 within a few
minutes.
We suspect this may be due to internal changes in source selection and
trust logic introduced in later versions, possibly making |ntpd| more
conservative about declassifying known sources — even when they become >>> unreliable.
Temporary Workaround
We experimented with the following configuration adjustments, which
appear to mitigate the issue by making |ntpd| more responsive:
*/tos orphanwait 1
tos mindist 0.05
tinker stepout 10
tinker panic 0
minpoll 3
maxpoll 4/*
These parameters seem to accelerate response to changes in sync status.
Questions
1.
Is this delay in downgrading to stratum 16 in 4.2.8p18 an *intended >>> behavior*, or is it considered a *regression* compared to earlier >>> versions?
2.
Are there *recommended configuration settings* or best practices to >>> ensure timely transition to stratum 16 when a local reference
becomes unreliable?
3.
Would it be appropriate to submit this as a *bug report*?
We would appreciate any clarification or guidance you can provide on
this matter.
Best regards,
--
*/Cordialement,
/*
*/Samir MOUHOUNE/*
You seem to have missed the point. This is clearly about the behavior
on a STRATUM 1 server when the hardware(-ish) time source fails, such as
a Maser failing. In these cases, ntpd needs to quickly lower its
announced stratum so the stratum 2 NTP servers will select a different upstream until the issue is fixed . Orphan mode should rarely apply
except in small sites with no Internet sources and no backup hardware
sources . However the affected stratum 1 server might or might not
generally benefit from using NTP time sources as a fallback for
disciplining the local clock .
Enjoy
Jakob
On 2025-06-23, Jakob Bohm <egenagwemdimtapsar@jbohm.dk> wrote:
On 2025-05-28 03:08, Harlan Stenn via questions Mailing List wrote:And how does ntp know that a maser has failed?. The only information it
Hi Samir,
It looks like you are aware of the complexity here.
As for is it a bug or a feature, I think the answer is "it depends", and >>> it really depends on each situation.
In no particular order:
- what other sources of time does ntpd see? Just because a time source >>> goes dark for a while shouldn't matter. PHI marches on, and at some
point the root dispersion will grow and knock the time soure out of the
running.
- It seems like your scenario has an insufficient number of time sources. >>> - If ntpd will call tsync often enough, what is the actual difference
between letting it hit S16 faster/slower, and re-establishing tsync as a >>> source when it is available again?
- How robust is your set of orphan servers?
The list goes on.
And from my POV, as long as ntpd has sufficiently robust "mechanism" to
allow you to tune its behavior to implement your "local policy" choices, >>> that's sufficient. As long as the default behavior is sufficient for
enough people, that seems OK.
H
On 5/27/2025 5:13 AM, MOUHOUNE Samir wrote:
Dear NTP Community,
We have observed a potentially unexpected behavior with |ntpd| version >>>> *4.2.8p18* concerning the delay in transitioning to stratum 16 when a
local reference clock (tsync) loses synchronization.
Issue Summary
When our local reference clock (tsync) becomes unsynchronized, we
expect |ntpd| to stop selecting it and switch the system to stratum 16 >>>> relatively quickly, indicating the system is no longer a valid time
source.
However, on systems running *ntpd 4.2.8p18*, this transition appears
*delayed by up to one hour*. During this time, |ntpd| continues to
treat tsync as a valid source and *reports stratum 1*, even though
synchronization is no longer valid.
In comparison, this behavior does *not occur* in older versions like
*4.2.8p15*, where the system transitions to stratum 16 within a few
minutes.
We suspect this may be due to internal changes in source selection and >>>> trust logic introduced in later versions, possibly making |ntpd| more
conservative about declassifying known sources — even when they become >>>> unreliable.
Temporary Workaround
We experimented with the following configuration adjustments, which
appear to mitigate the issue by making |ntpd| more responsive:
*/tos orphanwait 1
tos mindist 0.05
tinker stepout 10
tinker panic 0
minpoll 3
maxpoll 4/*
These parameters seem to accelerate response to changes in sync status. >>>>
Questions
1.
Is this delay in downgrading to stratum 16 in 4.2.8p18 an *intended
behavior*, or is it considered a *regression* compared to earlier >>>> versions?
2.
Are there *recommended configuration settings* or best practices to
ensure timely transition to stratum 16 when a local reference
becomes unreliable?
3.
Would it be appropriate to submit this as a *bug report*?
We would appreciate any clarification or guidance you can provide on
this matter.
Best regards,
--
*/Cordialement,
/*
*/Samir MOUHOUNE/*
has is the return ntp packets (or lack thereof). And a return packet can
fail to arrive if your computer's connection to the net has failed, temporarily or permanantly, or if the outside server has been
disconnected or is temporarily bad. ntp waits a while tries again.
An hour is no time at all. If you inow something has happened to the
source, you need to go in and resetup ntp to take that into account.
Make sure you have 5 or 7 independent sources, so the majority can vote
out the misbehaving source.
You seem to have missed the point. This is clearly about the behavior
on a STRATUM 1 server when the hardware(-ish) time source fails, such as
a Maser failing. In these cases, ntpd needs to quickly lower its
announced stratum so the stratum 2 NTP servers will select a different
upstream until the issue is fixed . Orphan mode should rarely apply
except in small sites with no Internet sources and no backup hardware
sources . However the affected stratum 1 server might or might not
generally benefit from using NTP time sources as a fallback for
disciplining the local clock .
In the scenario as I understand the OP, NTPD would know the difference because it connects to the hardware/outside time source via a protocol
other than NTP, specifically any of the non-NTP protocols, such as PPS,
GPS, WWWB, DCF77, Modem etc. Thread seems to be about how an NTPD
instance tied exclusively to such a time source via appropriate hardware reacts to said hardware going offline . OP seems to complain that a
recent patchlevel NTPD update made it react much more slowly to such a situation .
Make sure you have 5 or 7 independent sources, so the majority can vote
out the misbehaving source.
Typically, the stratum 1 server in question would be one of those 4 to 7
NTP time sources feeding another NTPD instance, and the problem is
making sure NTP data sent to the stratum 2 server stops advertising
stratum 1 as soon as the hardware time source goes away .
You seem to have missed the point. This is clearly about the behavior
on a STRATUM 1 server when the hardware(-ish) time source fails, such as >>> a Maser failing. In these cases, ntpd needs to quickly lower its
announced stratum so the stratum 2 NTP servers will select a different
upstream until the issue is fixed . Orphan mode should rarely apply
except in small sites with no Internet sources and no backup hardware
sources . However the affected stratum 1 server might or might not
generally benefit from using NTP time sources as a fallback for
disciplining the local clock .
Enjoy
Jakob
On Tue, May 27, 2025 at 12:13 UTC MOUHOUNE Samir
<samir.mouhoune@gmail.com <mailto:samir.mouhoune@gmail.com>> wrote:
Dear NTP Community,
We have observed a potentially unexpected behavior with |ntpd|
version *4.2.8p18* concerning the delay in transitioning to stratum
16 when a local reference clock (tsync) loses synchronization.
Issue Summary
When our local reference clock (tsync) becomes unsynchronized, we
expect |ntpd| to stop selecting it and switch the system to stratum
16 relatively quickly, indicating the system is no longer a valid
time source.
However, on systems running *ntpd 4.2.8p18*, this transition appears
*delayed by up to one hour*. During this time, |ntpd| continues to
treat tsync as a valid source and *reports stratum 1*, even though
synchronization is no longer valid.
In comparison, this behavior does *not occur* in older versions like
*4.2.8p15*, where the system transitions to stratum 16 within a few
minutes.
We suspect this may be due to internal changes in source selection
and trust logic introduced in later versions, possibly making |ntpd|
more conservative about declassifying known sources — even when they
become unreliable.
There have been no changes to ntpd/refclock_tsyncpci.c since 4.2.8p5 in
2016, so the issue might well affect other refclocks. It sounds like something we need to fix, or at a minimum understand and justify as an improvement.
Temporary Workaround
We experimented with the following configuration adjustments, which
appear to mitigate the issue by making |ntpd| more responsive:
*/tos orphanwait 1
tos mindist 0.05
tinker stepout 10
tinker panic 0
minpoll 3
maxpoll 4/*
These parameters seem to accelerate response to changes in sync status.
That's a lot of different knobs turned. Did you make all 6 changes at
once and observe improvement, or one at a time, or ?
I'm glad you found something to help out, and those might help point to
code change(s) responsible, but given so many knobs changed, not as
helpful as I might hope.
Questions
1.
Is this delay in downgrading to stratum 16 in 4.2.8p18 an
*intended behavior*, or is it considered a *regression* compared
to earlier versions?
It's hard to see why it would be intended, but we're far from
understanding the issue well enough to be definitive.
1. Are there *recommended configuration settings* or best practices
to ensure timely transition to stratum 16 when a local reference
becomes unreliable?
Interesting renumbering of your questions is happening in the GMail web editor using Chrome on Windows.
I think it's fair to say we're pretty weak on documented best practices
or recommended configuration settings, but try https://doc.ntp.org/ <https://doc.ntp.org/> You could also look at the archives of this list
and its onetime evil twin newsgroup comp.protocols.time.ntp.
1.
Would it be appropriate to submit this as a *bug report*?
Yes, please, by all means. That's generally true if you think you've
found a misbehavior, regression, suboptimal behavior, or just have a
request to improve. The only thing we don't welcome reports to https:// bugs.ntp.org/ <https://bugs.ntp.org/> about are reports of a security
nature, such as a remote crash of ntpd based on a port 123 query, or nontrivial information disclosure or elevation of privileges, things
that might merit a CVE. In that case, please submit the report to security@ntp.org <mailto:security@ntp.org> to ensure the information is
not made public before remediation can be done.
I apologize for taking so long to respond. I've had a lot going on in
my non-NTP life and I choose to have a relative firehose of email.
Thanks to Jakob for bubbling this up to my attention again. Bug reports
can be ignored too, but much less likely than email.
Cheers,
Dave Hart
The NTP algorithms do ongoing evaluations of established associations.
If an association becomes non-responsive, it auto-degrades.
At some point the association will drop.
Dave Mills was, to the best of my recollection, very hesitant to throw
out an established association "too soon".
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 546 |
Nodes: | 16 (2 / 14) |
Uptime: | 49:39:59 |
Calls: | 10,397 |
Calls today: | 5 |
Files: | 14,067 |
Messages: | 6,417,314 |
Posted today: | 1 |