On Wednesday, February 6th, and Thursday, February 7th, we had significant outages and we want to take the time to explain what happened. These outages impacted many customer websites and are not at all acceptable to us. I’m very sorry that they happened and our team is working hard to prevent similar incidents in the future.
Our IT infrastructure today consists of more than 180 servers. While we manage the software side of these servers completely, from the OS level to the applications they’re hosting, we decided right from the beginning not to spend time on maintaining hardware and datacenter infrastructure, e.g. network connectivity. That’s why we lease all our servers from our datacenter partners. Almost all our servers are provided by Hetzner AG which operates multiple datacenters in different parts of Germany.
This is an effective arrangement because datacenter services, like almost all IT services, benefit from economies of scale and we are still far from the number of servers that would get us to break even doing them ourselves. By leasing the hardware, we don’t have to pay staff to go on-site to connect new servers to the datacenter infrastructure or to replace broken parts of production machines. Instead, we have access to experienced datacenter staff and 24/7 support from our partners.
To avoid single points of failure, we distribute our servers over the different datacenters. Especially, we make sure that the nodes of a single cluster are located in different datacenters.
The downside of our approach is that we have to accept the fact that we depend on our partners to provide the level of service quality we need. As the recent incidents show, this is unfortunately not always the case.
What went wrong?
On February 6th, our monitoring system started at about 10:10 UTC to alert us of network packet loss levels of 50% to 100% with a number of servers and a lot of failing service checks, which most of the times is a symptom of connectivity problems. We recognized quickly that most of the servers with bad connectivity were located in Hetzner datacenter #10. We also received Twitter posts from Hetzner customers whose servers were running in DC #10. This suggested a problem with a central network component, most probably a router or distribution switch.
The problem was not limited to DC #10, though, and we started to get alerts about saturated web server workers from many other datacenters, too. It didn’t take us long to find that one of our storage cluster nodes, “stor02a”, is located in DC #10. Because our web application clusters store their static content files and their logs on shared storage clusters, the ones which were using this particular storage cluster were affected by the network failure, even if they were located outside DC #10.
Shared storage impact
Our shared storage architecture consists of a number of fileserver clusters which use the Gluster filesystem for redundant file storage and failure handling. With Gluster, files do not get replicated between the server nodes but by the storage clients (in our case, the web application servers). They maintain a connection to every active storage node and use these connections for reading. If a file needs to be written, the client repeats the change for every connected storage node. Metadata stored with the files is used to keep track of each file’s replication status.
The packet loss between the web servers and “stor02a” caused to an increasing number of retries which slowed down file access significantly. In turn, this kept web server processes busy much longer than normal and eventually led to a saturation of available HTTP connections. In other words, the websites on these clusters became unreachable.
If a storage node fails completely (e.g. due to a hardware failure or power outage), the Gluster clients will quickly notice repeated connection failures and stop accessing this node. In this incident, though, the network connection kept going down and up again, so the clients kept trying to access “stor02a”. When we became aware of this problem at about 10:35, we decided to shut down “stor02a” manually to provoke a failure event.
Shortly after, at about 10:50, network connectivity in DC #10 became stable again and web server load went down to normal levels.
We had a few additional network issues during the day but they always had already subsided when our on-call staff got notified. That’s why we decided to close the incident.
Unfortunately, we had to reopen it again on the next day. On 2013-02-07 from about 09:40 to 10:56 UTC, we experienced the same kind of network problems in DC #10 again. This time, Hetzner published a datacenter status update explaining that they were caused by a bug in a router firmware.
Unfortunately, the malfunctioning network had caused additional problems which we became aware of in the afternoon when a customer called our support hotline because their website failed to deliver certain image files. We found that this was caused by a split-brain situation on the storage cluster “stor02” where changes made on node “stor02b” weren’t reflected on “stor02a” and the self-heal algorithm built into the Gluster filesystem was not able to resolve this inconsistency between the two data sets.
We were able to resolve this secondary incident by doing backups of both data sets and then deleting the older one. Now, the self-heal mechanism didn’t get contradicting metadata any more and successfully mirrored the intact data set from “stor02b” to “stor02a”. Unfortunately, this caused another brief overload of the web nodes because of a short surge in network traffic.
Where do we go from here?
- We will look for effective changes to our architecture that could lessen the impact of local network malfunctions on our server infrastructure.
- We will investigate if we can further optimize our storage configuration to make it more resilient against network malfunctions.
- We will add checks to our monitoring system that will immediately inform us of data inconsistencies between the nodes of a storage cluster.
- We will define and document a Standard Operating Procedure of how to deal with partial or full storage cluster outages.
- We will work closely with our datacenter partners to make sure that there are effective communication channels established between our operations teams in the case of datacenter incidents.
I couldn’t be more sorry about the incident and the impact it had on our customers. We always use problems like this as an opportunity to improve our infrastructure and processes, and this will be no exception. Thank you for your continued support of freistil IT, we are working hard and making significant investments to make sure we live up to the trust you’ve placed in us.
14 Feb 2013
We’ve got some important news about freistilbox: The waiting is finally over!
You can read more about that on the freistilbox Blog. And since there will be certainly more news about our next-generation managed hosting platform shortly, don’t forget to subscribe to the blog’s RSS feed!
11 Jan 2013
As we mentioned in our review of 2012, we had to delay the delivery of our new freistilbox infrastructure because we encountered architectural problems. Today, we are happy to announce that after finding some good, long-term solutions, we’ve finally started the rollout of freistilbox clusters.
In this post, we’d like to explain what it was that threw sand between our gears and how we solved the problem.
On DrupalCONCEPT, we had many services sharing the resources of a server; in the case of DrupalCONCEPT POWER, we even had Git, Varnish, Apache, Solr and MySQL running on a single server. With time, we found that this put too many limitations on performance and scalability optimization. So we decided to run almost every freistilbox service on its own servers, resulting in a completely distributed architecture.
From an operations view, freistilbox needs a lot more servers than DrupalCONCEPT: In the backend, there are clusters for Git, MySQL, Solr and file storage. Incoming requests are received by load balancers and SSL offloaders which route them to the customer’s freistilbox cluster. Each of these freistilbox clusters has two servers running Varnish and Memcached, a maintenance server for SSH/SFTP logins and cron jobs, and finally the actual “boxes”, i.e. the application servers running the web applications (Drupal, for example).
First, these servers need to be provisioned. To make this easy, we’ve built a private cloud infrastructure that we operate on bare metal servers leased from our datacenter partners. Thanks to many years of experience with Chef and virtualization, we were able to implement this quite efficiently.
But what caused us a lot of headaches – and the embarrassing delay in delivery – was that these servers needed to be interconnected on the business process level. On a single DrupalCONCEPT server, it was easy for us to synchronize local processes, for example triggering a code deployment after receiving a Git repository update. On freistilbox, however, this synchronization needs to happen between servers. Let’s take the deployment process as an example again:
- The customer pushes an update onto the Git server.
- The Git server then needs to notify the application servers affected by the update.
- Only these application servers finally deploy the changes in parallel which brings the update online.
At my previous jobs, I had experienced how quickly distributed technologies like CORBA can become complicated and costly, so we tried to find a simpler approach. To make a long story short: Try as we might, it turned out that our simple approaches didn’t work as effective or as reliable as we needed them to be. Finally, we bit the bullet and solved the problem with a full-grown orchestration infrastructure based on MCollective.
We’re sad that this conceptual odyssey has cost us as lot of unplanned effort, time and, worst of all, customer trust. Apparently, we had to be reminded the hard way that it isn’t ideas that count but their execution. We won’t make this mistake again.
On the other hand, we’re very happy that we now have all the components in place that we need to build an awesome managed hosting platform.
To all customers who have been waiting for their freistilboxes: The wait is over. We appreciate your patience more than we can put in words, and we promise to make it worth your while.
10 Jan 2013
Until now, our backup strategy didn’t allow for an easy restore of single MySQL databases. Customers that needed to have their database restored sometimes had to endure one or two hours until their website was reset to its previous state again. The reason was that we had chosen the backup method of making a consistent snapshot of the complete MySQL server. So, in order to restore one or more single databases, we first had to restore this snapshot to a spare server where we then were able to dump the single databases that we actually needed.
We’re happy to announce that this weakness has been resolved! Our database backup now works in two phases:
- Write a database dump for every single database on the server. These dumps are stored locally in generations (daily/weekly/monthly).
- Copy the dump files to our well-proven enterprise backup system.
With this new strategy, we have stored all database content at several ages and in multiple places and are still able to quickly restore a single database right from its most recent dump file.
We’re optimistic that this change will significantly shorten the time-to-resolve of our customer’s database restore requests.
03 Jan 2013
2012 has been a great year for freistil IT. We’ve been growing as a company, as a team and as individual persons. In this last post for this year, I’d like to share a few of the things that we look back to in gratitude.
When I started working full-time on freistil IT in 2010, I intended to build a team that solves serious IT headaches for its customers and has lots of fun doing it. I have the feeling that we finally reached this goal this year.
We developed the DrupalCONCEPT managed hosting platform to a mature product that lets our customers run big Drupal websites while having all the necessary IT work done by us. We learned a lot in the process, and in many areas, too:
*DevOps: Keeping our growing IT infrastructure reliable and performant requires continuous work on our architecture. *Entrepreneurship: In order to develop our business strategy, we had to think a lot about what our long-term goals are and with whom we are going to reach them. *Process management: We learned (more often than not the hard way) that to have satisfied customers, a business needs robust processes and efficient tools. *Project management: A good example for the previous point is that, when we realized that we were losing sight of who’s doing what, we decided to introduce the Kanban method together with daily standup meetings.
As a young company in a rapidly changing business sector, we need to learn a lot, quickly. And most importantly, we then need to act on these learnings. Therefore, distributing both work load and knowledge effectively is mission-critical, especially in a decentralized team with different skill sets like ours. This year, we stabilized the foundations for mastering this ongoing challenge.
Our communication infrastructure has now crystallized to a fixed set of tools. While we also use phone and email, most of our communication happens in Campfire and Yammer. We even call the act of going online in the “freistil” chat room “coming to the office”. With Trello, we track our bigger projects (Kanban Style!) while we manage single tasks in Asana. We’re making an effort of documenting as much know-how and “standard operating procedures” as possible in our Confluence wiki. We feel that having a clearly defined place for every important information makes room in our brains for the creative problem-solving our customers need us to do.
Thanks to our strategic work this year, we can honestly say that we love our customers as much as what we do for them. We’re growing with our business and we’re proud of what we achieved this year as a team.
At this point, I have to admit that there were also some growing pains. We had to deal with criticism from customers as well as from ourselves. Our hosting infrastructure suffered some serious outages and performance degradations. In consequence, we implemented what John Allspaw calls a “ blameless post-mortem”: a thorough review process with the single goal of helping the team prevent the problem in the future, instead of just looking back to decide on whom to put the blame. Judging from the fact that the incident rate has been declining continuously over the recent weeks, we’re on a good way here.
These changes in our culture paved the way for important improvements to our infrastructure, products and services.
We gained much better insight into how and what our servers are actually doing by replacing our previous metrics monitoring system with a new one built on Graphite. Every second, we collect hundreds of values from everywhere in our infrastructure. This makes it much easier for us to analyze the incidents that occur and to prevent others while they’re about to happen.
We boosted the performance of our database servers by using Solid State Disks (SSD). These electronic drives may still have far less capacity, but they manage write-intensive databases so much better than their spinning cousins. After extending our backup system, we’re now able to restore single databases within minutes instead of having first to rebuild the whole server on spare hardware.
To use our steadily growing server resources more efficiently, we’ve built our own private cloud infrastructure by combining
Linux Containers with our trusted system management software Chef. Other than with 3rd party cloud products, we keep total control over how resources are distributed by leasing the bare metal from our datacenter partners.
At freistil IT, “infrastructure is code”, which means that everything we build gets automated by writing Chef cookbooks. This year, we’ve made more than 2400 commits to our Chef repository, double of what we did in the first two years combined. We also started to release in-house cookbooks as open source software on our Github account.
###Innovation (and failure)
Looking at the great success of DrupalCONCEPT, we thought intensively about how we could improve our customers’ hosting experience even more. Based on their feedback and our experience running the platform day after day, we created a list of things we could do better. Here are some examples what we found:
*Speed up the deployment of web application updates *Quickly provision additional cluster nodes *Improve the performance and reliability of backend services like databases and storage *Make stage-specific configuration more flexible *Enable customers to use external repositories (e.g. Github) *Fully support external tools like Drush *Open up the platform to other web applications than Drupal
As the result of these deliberations, we decided not to build a new generation of DrupalCONCEPT (which would have been the fourth one in more than 2 years) but a completely new hosting platform. To make it clear that it’s our brainchild, we named it “ freistilbox” and launched the product website in October. Many of our customers were immediately excited by the news and couldn’t wait to get their hands on their own freistilbox cluster. We were literally overwhelmed by this success.
Which brings us to our deepest low of 2012: we failed to deliver on our promise. First, I had dramatically underestimated the effort necessary to build the many small non-incremental improvements from scratch that would distinguish freistilbox from DrupalCONCEPT. Then, we were also forced to divert a big part of our time to solve some serious problems with existing installations. In consequence, our launch schedule fell apart completely. While we were able to bridge the gap by providing DrupalCONCEPT servers, we are devastated knowing that we severely disappointed our most loyal customers. We’re going to make a huge effort in the coming year to compensate for that.
In 2013, the ongoing development of freistilbox will be our main focus. What we’re seeing so far makes us very excited but we’ve only just begun.
Because we’ve clearly reached the limits of what we can achieve with such a small team, we’re going to get additional people on board as quickly as possible. (So, if this whole high-end hosting stuff sounds interesting to you, please get in touch!)
And finally, there’s also going to be growth on the private side: both Markus and I expect new arrivals to our families in early 2013. Looks like we will need all the flexibility that our “Results-Only Work Environment” allows…
When I look back at 2012, what I feel the most is gratitude. I’m thankful to our awesome customers for encouraging us to keep improving, especially when we fail. I’m thankful for being able to work as part of a great team that values results over time spent “at work”, and creativity over control. Most of all, I’m thankful to my family for giving me their trust and love.
And finally, thank you for reading this long. ;-)
To a brilliant 2013!
31 Dec 2012
From this Saturday (2012-12-22) on, our team will be off recharging.
During the holiday time, we’ll only do emergency support . That means we’ll only handle outages and other incidents that impact the delivery of existing websites.
We’ll resume working on tasks that aren’t connected to such incidents on Monday, January 7th 2013.
So, if you plan to launch a new website over the holidays or need other kinds of assistance, please let us know immediately!
17 Dec 2012
In private conversations I’ve already been talking about our “new architecture” for some time, and today, we present it to the world. On behalf of everyone here at freistil IT, I’m very happy to announce the release of freistilbox — our next-generation managed hosting platform!
We’ve taken everything we’ve learned over the past three years to build this platform. We’ll have a technical article for those interested soon. Until then, here are the big benefits of our freistilbox hosting platform!
Scalability made simple
freistilbox is built to scale without forcing our customers to deal with the complexity that comes with a growing hosting infrastructure. To get more capacity (and higher availability as a side benefit), you just add more “boxes” to your hosting platform. These boxes are dedicated server instances that host your web application. Everything else — databases, caches etc. — is managed by our operations team behind the scenes.
Only our DrupalCONCEPT ELITE customers had access to a load balancer that distributes incoming requests between two or more application servers.
For freistilbox, we’ve built load distribution right into the core infrastructure, so as soon as you have two or more freistilbox instances, incoming requests automatically get distributed between them. Should one of these freistilbox instances suffer an outage, it will automatically be left out, so your website visitors won’t notice.
With DrupalCONCEPT, while answers to HTTP requests are answered very fast by a Varnish cache, SSL-encrypted traffic gets delivered directly to the web application servers. There, the requests have to be first decrypted and then answered by Drupal — every single time, because we are going around the Varnish cache. That’s why SSL-heavy sites tend to be slower and need a lot of processing power.
On our new freistilbox infrastructure, SSL requests are decrypted immediately when they arrive by a dedicated SSL offloading layer. This not only takes computing load off the application servers but also allows the HTTP cache to speed up both encrypted and unencrypted traffic.
Integration of external repositories
In DrupalCONCEPT, the Git repository containing the web application code is tightly integrated with the website instance itself. We use Git for freistilbox, too. But if you already have a Git repository that supports webhooks, for example at Github, you can use that one to deploy your code instead of the default repo we provide.
Deployments take less than 10 seconds
With DrupalCONCEPT, if you wanted a site update rolled out, it would take up to 2 minutes for the deployment mechanism to do its thing. Now, thanks to a newly designed deployment process, it takes less than 10 seconds.
Web dashboard and API
Not only does freistilbox offer a web-based dashboard where you can see (and, in later versions, change) your customer data and website configuration. The dashboard application provides a RESTful API that we’re also going to document and expose to the public. That means that you will be able to make changes to your platform parameters either manually in your browser or automated via software.
There’s no set schedule for these yet, but we know you want them, and we’re testing them. We’re working hard to bring you:
- Separate SSH servers to execute commands like mysqldump or drush manually or via Cron, without using precious application server resources
- Secure external access to your MySQL database Of course, as we know ourselves and our customers, we’ll add many other things over time that we don’t even think about today.
What about DrupalCONCEPT?
No change on this front. We’ll continue operating all DrupalCONCEPT installations that already exist. We’ll just stop adding new ones from today on.
So, what do you think?
We’d love to know what you think of our new hosting platform, so head on over to www.freistilbox.com and make sure to drop us a note in the comments below!
03 Oct 2012
Hi there and welcome to freistilbox, our next-generation managed webhosting platform! Everyone here at freistil IT is very excited that we’ve finally reached its launch day!
You can read more about what freistilbox is all about on our company blog. And, of course, just check out the navigation menu above to find all the details!
01 Oct 2012