Incident review: Core router outages
Published 2020-02-12 by
On Thursday, January 2nd we had significant network availability issues that affected a majority of the websites we are hosting. The outage lasted from 10:30 to 13:30 UTC. A similar issue occured during the night of 2020-01-22 from 02:54 to 03:41 UTC.
First, we’d like to sincerely apologise for the downtime our customers suffered with their websites during these outages. Degraded performance or full downtime for multiple hours is unacceptable. That’s why we’d like to take the time to explain what happened and what we are going to do about it.
On our network architecture
We run our server infrastructure almost exclusively in the data centres of Hetzner Online in Falkenstein, Germany. We use the plural “data centres” here because it is best practice to split a data centre location into multiple server rooms for better access control and containment in the case of a fire. Running a professional data centre takes loads of expert staff. This applies to power supply and cooling, but most of all to network operations.
A data centre network has multiple tiers: Central hubs, so-called core routers, receive internet traffic and distribute it to the server rooms. Each room has its own local router(s), and each server rack has its own network switch(es). Making sure that data gets where it’s needed in minimal time and with maximum reliability is a big task. That’s why we trust Hetzner to take care of the physical side of our managed hosting service.
While we don’t have control over the network architecture and setup, we can influence the distance that packets have to travel between our servers. Since web application servers and data stores (e.g. MySQL, memcached) exchange thousands of queries per second, we try to keep the network latency between them minimal by grouping them in dedicated server racks where they can share a network switch. For higher service availability, we run additional standby database nodes outside of these racks. Services like Varnish that communicate via HTTP are less affected by network latency. That allows us to spread their servers wider across one or multiple data centres.
This architecture has been working well for us, and we’ve been running our freistilbox clusters that way for many years.
On Thursday, January 2nd 2020, at 10:30 am, our internal monitoring notified us of a high number of unavailable servers and website downtimes. Outages of this extent are usually caused by network issues. A few minutes later, while we were still investigating the outage, the website downtime alerts started to resolve. All servers became reachable again, and we saw manual sample requests to websites execute successfully.
At 10:46 am, the data centre reported the outage of a core router (one of the network devices distributing internet traffic to server rooms). This explained the extent of the outage. We were informed that a routing change redirecting traffic around the broken core router had restored network connectivity.
This notice led us to believe that the issue was resolved. As is common practice, we still kept monitoring the situation in case any remediation tasks such as rebooting a server that didn’t recover by itself became necessary.
Our internal monitoring system did not report any issues; services were working and their health metrics looked ok. However, our external monitoring kept reporting intermittent website downtimes. This contradiction left us confused. Only when we manually logged in to several servers and ran connection diagnostics, we could see drastically increased latency between web application servers and the machines serving them traffic (specifically edge routers and Varnish caches). To put things into perspective, normally the latency between two servers is less than 1 ms regardless of their location. However, what we saw were values of 80 ms up to 500 ms, added on top of the normal page rendering time. HTTP response times of several seconds caused timeouts within the content delivery chain and as the end result, website downtime in the form of “Request Delivery Failed” pages.
We decided to investigate the network issues further and determine in detail which of our servers were affected and how. As a quick counter measure, we used the redundancy in our server setup and switched some services over to their standby servers. Sadly, this improved the situation only slightly because most of the available network routes still suffered from high latency.
At 13:30 UTC, data centre staff had repaired the affected core router and put it back into operation. Network latency immediately dropped to normal levels, restoring full availability of all websites on freistilbox.
What we’re going to do
As we’ve outlined above, we do not have control over the network layer at the data centres that host our servers. For the physical side of our managed hosting platform, we are relying on the staff at Hetzner Online, just as IT people all over the world are relying on Google or Amazon data centre operations. What we can do actively is make our server architecture as resilient against network issues as possible, and have proper tools and processes in place that help us investigate and resolve an outage more quickly.
During the outage, we discovered a blind spot in our metrics. We had not noticed this lack of information before because — fortunately! — these kinds of issues are rare and of short duration.
Our on-call engineers need to be able to quickly identify network issues as the cause of an outage and to pinpoint what systems are affected by it. That’s why we have started to gather packet latency metrics for the services that are especially sensitive to increased latency.
In the early morning of 2020-01-22 from 02:54 to 03:41 UTC, another core router had a very similar failure resulting in the same problems. This time, it took data centre staff less than an hour to restore normal operation. During this outage, we were able to confirm the effectiveness of network latency monitoring: Since we already had an early proof-of-concept build available, we could clearly see how the outage affected our infrastructure.
We are working on a new architecture concept for modernising our hosting infrastructure. The experience gained by these incidents will allow us to make the network one of the focal points of our design.
We are very aware that by relying on a data centre service provider, we are vulnerable to issues with physical infrastructure like power supply, cooling or, obviously, networking. Incidents like these core router outages heavily affect the majority of our customers. We’d like to apologise again for the downtime they caused.
These incidents also teach us where we can improve our hosting infrastructure to be more resilient. We’re using the experience we’ve gained to make these improvements. freistilbox already provides our customers with high total website availability, and we’re going to make sure that it’ll be even higher in the future.