Post mortem: Network issues last week

Published 2013-02-14 by Jochen Lillich

On Wednesday, February 6th, and Thursday, February 7th, we had significant outages and we want to take the time to explain what happened. These outages impacted many customer websites and are not at all acceptable to us. I’m very sorry that they happened and our team is working hard to prevent similar incidents in the future.

Background

Our IT infrastructure today consists of more than 180 servers. While we manage the software side of these servers completely, from the OS level to the applications they’re hosting, we decided right from the beginning not to spend time on maintaining hardware and datacenter infrastructure, e.g. network connectivity. That’s why we lease all our servers from our datacenter partners. Almost all our servers are provided by Hetzner AG which operates multiple datacenters in different parts of Germany.

This is an effective arrangement because datacenter services, like almost all IT services, benefit from economies of scale and we are still far from the number of servers that would get us to break even doing them ourselves. By leasing the hardware, we don’t have to pay staff to go on-site to connect new servers to the datacenter infrastructure or to replace broken parts of production machines. Instead, we have access to experienced datacenter staff and 24/7 support from our partners.

To avoid single points of failure, we distribute our servers over the different datacenters. Especially, we make sure that the nodes of a single cluster are located in different datacenters.

The downside of our approach is that we have to accept the fact that we depend on our partners to provide the level of service quality we need. As the recent incidents show, this is unfortunately not always the case.

What went wrong?

On February 6th, our monitoring system started at about 10:10 UTC to alert us of network packet loss levels of 50% to 100% with a number of servers and a lot of failing service checks, which most of the times is a symptom of connectivity problems. We recognized quickly that most of the servers with bad connectivity were located in Hetzner datacenter #10. We also received Twitter posts from Hetzner customers whose servers were running in DC #10. This suggested a problem with a central network component, most probably a router or distribution switch.

The problem was not limited to DC #10, though, and we started to get alerts about saturated web server workers from many other datacenters, too. It didn’t take us long to find that one of our storage cluster nodes, “stor02a”, is located in DC #10. Because our web application clusters store their static content files and their logs on shared storage clusters, the ones which were using this particular storage cluster were affected by the network failure, even if they were located outside DC #10.

Shared storage impact

Our shared storage architecture consists of a number of fileserver clusters which use the Gluster filesystem for redundant file storage and failure handling. With Gluster, files do not get replicated between the server nodes but by the storage clients (in our case, the web application servers). They maintain a connection to every active storage node and use these connections for reading. If a file needs to be written, the client repeats the change for every connected storage node. Metadata stored with the files is used to keep track of each file’s replication status.

The packet loss between the web servers and “stor02a” caused to an increasing number of retries which slowed down file access significantly. In turn, this kept web server processes busy much longer than normal and eventually led to a saturation of available HTTP connections. In other words, the websites on these clusters became unreachable.

Recovery

If a storage node fails completely (e.g. due to a hardware failure or power outage), the Gluster clients will quickly notice repeated connection failures and stop accessing this node. In this incident, though, the network connection kept going down and up again, so the clients kept trying to access “stor02a”. When we became aware of this problem at about 10:35, we decided to shut down “stor02a” manually to provoke a failure event.

Shortly after, at about 10:50, network connectivity in DC #10 became stable again and web server load went down to normal levels.

We had a few additional network issues during the day but they always had already subsided when our on-call staff got notified. That’s why we decided to close the incident.

Unfortunately, we had to reopen it again on the next day. On 2013-02-07 from about 09:40 to 10:56 UTC, we experienced the same kind of network problems in DC #10 again. This time, Hetzner published a datacenter status update explaining that they were caused by a bug in a router firmware.

Unfortunately, the malfunctioning network had caused additional problems which we became aware of in the afternoon when a customer called our support hotline because their website failed to deliver certain image files. We found that this was caused by a split-brain situation on the storage cluster “stor02” where changes made on node “stor02b” weren’t reflected on “stor02a” and the self-heal algorithm built into the Gluster filesystem was not able to resolve this inconsistency between the two data sets.

We were able to resolve this secondary incident by doing backups of both data sets and then deleting the older one. Now, the self-heal mechanism didn’t get contradicting metadata any more and successfully mirrored the intact data set from “stor02b” to “stor02a”. Unfortunately, this caused another brief overload of the web nodes because of a short surge in network traffic.

Where do we go from here?

We will look for effective changes to our architecture that could lessen the impact of local network malfunctions on our server infrastructure.
We will investigate if we can further optimize our storage configuration to make it more resilient against network malfunctions.
We will add checks to our monitoring system that will immediately inform us of data inconsistencies between the nodes of a storage cluster.
We will define and document a Standard Operating Procedure of how to deal with partial or full storage cluster outages.
We will work closely with our datacenter partners to make sure that there are effective communication channels established between our operations teams in the case of datacenter incidents.

Summary

I couldn’t be more sorry about the incident and the impact it had on our customers. We always use problems like this as an opportunity to improve our infrastructure and processes, and this will be no exception. Thank you for your continued support of freistil IT, we are working hard and making significant investments to make sure we live up to the trust you’ve placed in us.