2018-09-30

ntp versus chrony

Which is better? More accurate, precise or stable? I mean the Linux NTP implementations ntp and chrony.

I found that chrony is the winner (at least for me). Read more if you're interested in my mini research about the topic.

preface

I noticed that Red Hat moved from ntp to chrony. I read the comparison from the chrony point of view. I wasn't convinced and stayed with the ntp on my CentOS servers and the chrony on Fedora stations.

But two things on ntp didn't satisfied me. First - the offset is always in ms (miliseconds), mostly within +/- 1 ms interval. Second - ntp often selects another server over which I personally prefer.

You may thing, that in a real world a milisecond precision should be more than enough. But on local networks a packet travels tens or hundreds of us (microseconds) so it can happend that a packet arrives before it was send (considering local clocks on both ends).

According the luring page Red Hat already knows, but I want to prove it myself.

how to compare?

Starting with "ntpq -p", "chronyc sources" and "chronyc sourcestats", imersing into "ntpq -c rv" and "chronyc tracking" and dvelving in a graphs produced by collectd plugins for NTPd and chrony I realized, that I'm missing a fixed point and a common comparing parameter.

So I conducted a series of experiments during this year's (2018) September to explore ntp and chrony behavior.

To be more concrete I watched which sources are selected by the servers and which servers are selected by the clients. So I let the NTP implementations decide in different scenarios and then I tried to deduce some conclusion.

testing environment

sources (the pool)

9x server from NTP pool - obtained by iterating "dig aaaa 2.cz.pool.ntp.org", ended up with 8x stratum 2 and 1x stratum 3 server.

tested servers

2x CentOS 7.5 with ntp 4.2.6p5 or chrony 3.2 running on an older hardware. One with tsc and the other with hpet clocksource (grep . /sys/devices/system/clocksource/*/current_clocksource). Both using all above nine sources, ended up with stratum 3.

clients

5x Fedora 27 Workstation with chrony 3.3 running on different hardware purchased through the years 2006 - 2016. All with two above servers configured in chrony.conf, ended up with stratum 4.

observations

hardware dependency

I saw no hardware dependency during my tests.

On the begining I set up the same ntp.conf on both servers (with all 9 mentioned sources) run ntpd and found that:

both servers independently selected their sources from subset of the pool, changing the source over and over
all clients was selecting randomly and regulary between the two servers (ratio was near 1:1)

source selection

As I mentioned above, ntpd all the time selected between sources (7 of 9). To see this I run the "ntpq -p | grep ^*" every minute. Some sources was preffered more (thousands selections), some less (hundreds or tens selections) during the whole month. It looks like on the end of the each poll interval (1024 seconds) there is posibility that ntpd switch to another source. This switching corresponded with a spikes in a collectd time_offset-loop graph.

So I wrote 9 sources to chrony.conf, started chronyd instead of ntpd on one of the servers and surprise arises:

upon the start chronyd selected the source which ntpd preffered the most
chronyd was nailed to this source in the vast majority of the time
chronyd was sticked to this source even if this source become unreachable (it took over 11 hours before chronyd switched to the another source - to be precise chronyd switched to the ntpd second most preffered source)
all the (chronyd) clients selected in 99 of 100 cases (there was tens of system starts on most workstations) the chronyd server over the ntpd server

It may look strange, that chronyd lasted above 11 hours on the dead source server. But during this 11 hours "chronyc sourcestats" showed 18h sample span for this server. So it make the sense, if (saved) clock discipline was known, stable and better than on other sources, to hold on this source some time after it became unreachable.

another time, another network, ntpd clients

I saw the exactly same behavior in source selection on larger network.

There were two CentOS 7.5 ntpd (4.2.6p5) servers running on the same hardware and 88 CentOS 6.10 ntpd (4.2.6p5) clients.

I was watching for many days (in hour intervals), that all the clients was steadily moving between these two servers.

I started chronyd on one of the servers and after one hour all the ntpd clients moved to this chronyd server and stayed there till now.

conclusion

I confirmed (for me), that the chrony is more stable considering source selection and thus more accurate too.

Both chronyd and ntpd clients prefer chronyd server over ntpd server.

Offset on the LAN clients with chrony server dropped to tens or hundreds us (microseconds).

Personaly I believe that neverending source switching is bad. Perfect server should keep switching until it founds the best source. As for chrony - it finds the best source immediately after a start.

contact: lachim (you know what) emer.cz