Incident review: Datacenter outage on 2014-08-03

Published 2014-08-13 by Jochen Lillich

On Sunday, 2014-08-03, freistilbox operation was severely disrupted due to a power failure at a datacenter.

We apologise for this outage. We take reliability seriously and an interruption of this magnitude as well as the impact it causes to our customers is unacceptable.

What happened

On Sunday, 2014-08-03, at 12:34 UTC, our on-call engineer was alerted by the monitoring system that a number of servers suddently went offline, and the list was quite long. This indicated a network outage, and we posted a short notice to our status page. We then immediately contacted datacenter support. While we didn’t get a direct answer first, the datacenter posted a first public status update at 12:54, explaining that server room RZ19 suffered an outage.

Since one of our server racks is located in this server room, the impact of this outage was severe. The affected rack hosts all kinds of servers including database and file storage nodes. Without these services, even application servers outside of RZ19 weren’t able to deliver content any more.

Since we run the nodes of our database clusters in different server rooms, we executed a failover procedure to the standby nodes of the affected databases. This restored operation for a part of our hosting infrastructure.

At about 13:00, our servers started to come back online. When we checked their uptime, we realised that they must have just had started up, so we suspected a power outage. This was confirmed when the datacenter announced that RZ19 had suffered a “brownout” that caused its servers to reboot. Later, the ISP added that a whole datacenter location suffered a power outage. The UPS systems of all server rooms had been able to compensate until the power generators had started up – with the exception of RZ19.

At about 14:00, most of our servers were running smoothly again. A few of our database servers had suffered data corruption and since we had already switched to their standby nodes, we decided to repair them later. At that time, it was more urgent to replace application boxes that still had not come back. Some of our customers choose to run single-node freistilbox clusters and the websites running on these boxes were still down. We launched new boxes on servers with spare capacity and at about 15:00, our infrastructure was fully functional again.

What we’re doing about it

Since we don’t run our own datacenters, we depend on our hosting partners when it comes to hardware infrastructure (servers, network, power, cooling etc.). We can’t prevent power outages, only trust that our infrastructure providers take all the necessary measures to prevent them.

What we can do ourselves is build our hosting architecture as resilient as possible in order to minimise the impact of a power outage. We have already built in a lot of redundancy into freistilbox. This enabled us, for example, to quickly switch to non-affected database servers as we did at the beginning of this incident. We have identified a few points, though, where an outage can cause bigger parts of our infrastructure to fail.

The most critical one of these points is our current storage technology. While it comes with data replication features (of which we make use, of course), it is hard to distribute data over server rooms or even distant datacenters without running into network latency issues. That’s why we’re currently testing alternative solutions that don’t have this weakness. As a beta test, we’re already running our own company freistilbox cluster (the one that’s hosting this website) on one of these alternatives. This means we’ll be able to further improve our storage resiliency very soon.

Another point is the private cloud infrastructure on which we run the application boxes of our customers’ freistilbox clusters. By adding more system automation, we’re going to minimise the time it takes us to spin up replacement boxes when that becomes necessary, for example and especially during an outage.

Again, we sincerely apologise to all our customers affected by this outage and thank them for their continued trust.

Previous

Index

Next