For more than four days, a server at the very core of the Internet’s domain name system was out of sync with its 12 root server peers due to an unexplained glitch that could have caused stability and security problems worldwide. This server, maintained by Internet carrier Cogent Communications, is one of the 13 root servers that provision the Internet’s root zone, which sits at the top of the hierarchical distributed database known as the domain name system, or DNS.
Here’s a simplified recap of the way the domain name system works and how root servers fit in:
When someone enters wikipedia.org in their browser, the servers handling the request first must translate the human-friendly domain name into an IP address. This is where the domain name system comes in. The first step in the DNS process is the browser queries the local stub resolver in the local operating system. The stub resolver forwards the query to a recursive resolver, which may be provided by the user’s ISP or a service such as 1.1.1.1 or 8.8.8.8 from Cloudflare and Google, respectively.
If it needs to, the recursive resolver contacts the c-root server or one of its 12 peers to determine the authoritative name server for the .org top level domain. The .org name server then refers the request to the Wikipedia name server, which then returns the IP address. In the following diagram, the recursive server is labeled “iterator.”
Given the crucial role a root server provides in ensuring one device can find any other device on the Internet, there are 13 of them geographically dispersed all over the world. Each root sever is, in fact, a cluster of servers that are also geographically dispersed, providing even more redundancy. Normally, the 13 root servers—each operated by a different entity—march in lockstep. When a change is made to the contents they host, it generally occurs on all of them within a few seconds or minutes at most.
Strange events at the C-root name server
This tight synchronization is crucial for ensuring stability. If one root server directs traffic lookups to one intermediate server and another root server sends lookups to a different intermediate server, important parts of the Internet as we know it could collapse. More important still, root servers store the cryptographic keys necessary to authenticate some of intermediate servers under a mechanism known as DNSSEC. If keys aren’t identical across all 13 root servers, there’s an increased risk of attacks such as DNS cache poisoning.
For reasons that remain unclear outside of Cogent—which declined to comment for this post—all 12 instances of the c-root it’s responsible for maintaining suddenly stopped updating on Saturday. Stéphane Bortzmeyer, a French engineer who was among the first to flag the problem in a Tuesday post, noted then that the c-root was three days behind the rest of the root servers.
The lag was further noted on Mastodon.
By mid-day Wednesday, the lag was shortened to about one day.
By late Wednesday, the c-root was finally up to date.