And there’s another month gone past. Unfortunately, my sum-up for April got lost between incidents, conferences and a short vacation (nice concept BTW, that latter one, I really should do that more often than every three years). So this is going to be a sum-up for both recent months, April and May.
After a cleanup of unused websites in April which decreased the number of active websites to 263, we’re back at the level of March with 281 instances. We’re very happy to report that our overall uptime is steadily approaching the perfection mark: Pingdom tells us we achieved 99.98% availability in April and May (99.93% in March). Our edge traffic didn’t change much over the recent weeks, it’s been 11.48 TB in April and 11.72 TB in May (11.96 in March).
Continuous improvement is visible in our technical support numbers. In March, we received 150 support requests of which we solved 126. In April, the relation was 178/132, and in May 147/114. The backlog of open tickets decreased significantly from 48 to 35 tickets (-27%).
Looking back on April, our Help Center statistics shocked us with an average reaction time of 33.5 hours. We quickly found out that was caused by two single tickets that we had put on hold for a few weeks in agreement with the respective customer. We changed our process to always give initial feedback as quickly as possible and our average reaction time for May returned to a saner value of 9.7 hours.
When we break down these average reaction times, we can see the effect of the new rule. We answered 56% of all tickets in under 1h (March: 42%), 25% within 8h (March: 31%), 9% within 24h (March: 15%) and only 12% after 24h (March: 20%).
With now double the capacity in our operations team (see below), we’re confident that we’ll be able to push both reaction and resolution time down even more in coming months.
Our customers’ satisfaction rating stayed at a perfect 100% and we again got a lot of awesome feedback:
- “Top class support. Very fast and very friendly.”
- “Problem solved flawlessly, although it took more than 24h until the website was available again… Since we’re not live yet, I’ll give you a ‘Good’ anyhow ;-))”
- “Response was very fast. Everything ok ;-)”
- “Thanks again for the support in the middle of the night!”
- “Competent clarification, super every time!”
We finally found the time to do some cleanup work and put a few unneeded machines out to pasture, so although we added some new ones, we’re still counting 284 servers, same as in March. The number of metrics we collect from our servers every 10s also got cleaned up and went down to 90,684 (-17%).
At the end of May, we had an embarrassing incident where it took us much too long to restore servers after a RAID failure, and we even lost the contents of one server completely. This has led to a bunch of remediation tasks, some of which are still in progress. We’ve made sure that such an incident doesn’t repeat. While learning is one of our core values, “learning by suffering” is the worst way to do it and completely unacceptable if it’s our customers that suffer.
Unfortunately, we’ve not yet managed to reduce our number of on-call alerts; actually, it’s even grown by 18% to 1332 alerts in May. This is mainly due to alert multiplication — different alerts that share the same cause. To improve this, some work on our monitoring configuration is required. Good thing that we got reinforcement!
Having Philipp Kaiser join our ops team in May so far has been the highlight of our year! His experience with huge data center installations will help us tackle future projects, and the added team capacity enables us to reduce our “technical debt”, i.e. finally take care of low-priority tasks that we had postponed for a lack of capacity. Philipp is currently undergoing the “Ops Bootcamp” where we’re introducing him to everything he needs to know about our IT infrastructure and processes. In parallel, he’s also started to work on his first production tasks. We’ll talk about his first impressions at his new workplace in another blog post soon.
As active members of our open source communities, we participated in a bunch of events during April and May:
- Markus explained how to build high-performance Drupal websites at the World Hosting Days.
- I did a session on “Dynamic Infrastructure Orchestration” at the Open Source Datacenter Conference.
- At DrupalCamp Frankfurt, Markus talked about “How to automate your Drupal development environment”.
- I first talked about “Doing DevOps with Drupal” at DrupalCamp Scotland in Edinburgh,
- and, since that talk got a lot of great feedback, repeated it at the Drupal Open Days Dublin.
April and May were two months jam-packed with valuable experiences for us as individuals and as a company as a whole. We’re exited to see what June has got in store for us!
12 Jun 2014
The “Changelog” is a new category in our blog where we publish important changes to freistilbox infrastructure and functionality.
Each freistilbox cluster comes with its own “shell node” that customers access via SSH to run maintenance tasks like mysqldump or drush. In order to make it easy to access the right website instance, each one has its own user account.
So far, the interactive use of these user accounts was severely limited by tight write restrictions on the user home directory.
In a change we’ve rolled out this week, we’ve replaced the old instance directories with homes to which the shell user has full write access. This solves the problems that many customers experienced when they tried to store configuration files or to create arbitrary files and subdirectories.
Together with all the symlinks to important website directories, the work subdirectory that we used to create as a workaround for the previous write restrictions has been automatically moved to the new shell user home directory. Apart from the full write permissons, everything should look and function exactly as it used to.
30 May 2014
On Thursday, 15 May, one of our VM hosts named “vm3” did not return back to operation after a standard maintenance procedure, resulting in an outage of more than 14 hours. While we were able to restore all affected DrupalCONCEPT POWER servers, we only had backups available that were more than 24 hours old. And in the case of a custom-built managed server, we even lost most of its files completely.
We regard reliability and effective IT processes as essential for our business. An outage of this duration and with these results is not acceptable. We are embarassed and deeply sorry about this incident and I apologize on behalf of freistil IT to all customers that we disappointed.
In this review, I’d like to give you detailed insight into what’s happened and what we’re going to do to prevent incidents like this in the future.
On Monday, 12 May, the VM host vm3 signaled one of two disks of its RAID–1 array as failed. It kept running on the second disk without any problems. We scheduled a maintenance window to have the failed disk replaced for Thursday, 15 May at 19:00 UTC, and announced the scheduled maintenance on the freistilbox Status Page.
Data center staff shut down the server at 18:55 UTC (a few minutes early) and replaced the broken disk. After restarting the server, we found that the server would not boot into a working system again. It turned out that there was no bootable operating system available on the remaining disk any more, which suggested that this disk had failed, too. When we realised that there was nothing we could do about the second failed disk, we decided to go the only viable, albeit laborious, way of rebuilding the server from scratch. After getting the second disk replaced, we started reinstalling the server OS, then the host environment and finally, the guest servers.
When we started the restore process, we realised that already the first phase, building a directory tree of the data to restore, would take several hours. We hoped that it would finish over night, but after 7 hours on Friday morning, the backup database was still working on collecting data for the restore directory tree. Fortunately, we found out by experimenting that by aborting the slow query on the database server, we could force the backup system to fall back to doing a full restore of all files in the backup set.
After the restore jobs were finished on all affected servers, we started reimporting the database dumps that were included in the backups. That’s when we found that we had timed the creation of these dumps badly: The job for doing daily database dumps actually ran later than the file backup that was supposed to pick them up. Restoring data from the Wednesday night backup meant that we had lost almost a whole day of data but the backup then only contained database backups from Tuesday night.
And as if this wasn’t bad enough news for our customers already, it turned out that one of the affected servers didn’t have any of its websites backed up at all. The respective server is a custom-built managed server. While with DrupalCONCEPT and freistilbox servers, everything (including the backup) is configured automatically, this server would have needed a manual backup configuration and we obviously had forgotten this part during setup.
Some customers had newer backups available that we were able to copy back to their server but in the end, most of them still suffered a catastrophic loss of data.
On Friday at about 11:30 UTC, all servers were online again. We then spent the rest of the day with assisting our affected customers to solve some minor remaining issues.
What we are going to do about it
In a post mortem meeting on Monday, 19 May, we discussed the incident and decided on remediation measures to prevent it from repeating.
The root cause of the incident, the loss of both disks of a RAID–1 array, is a rare event but we need to be prepared for it to occur. We especially need to minimise the amount of data lost due to such a failure.
While the affected customers had consciously chosen a one-server setup that has many single points of failure (SPOF), neither they nor we had expected that an outage would take this long and would result in such catastrophic data loss. We need to make sure that all our backups have complete coverage and that they can be restored within a reasonable amount of time (a few hours max).
As a result of our post mortem, we decided on the following remedial measures:
- We checked to make sure that all customer data, especially on custom-built servers, will be fully backed up from now on.
- We rescheduled our file backup in order to include the latest database dumps.
- Planned maintenance must be done right after a backup run. We will either schedule it after the regular daily backup job or we’ll trigger an extra backup in advance of the maintenance.
- We’ll schedule regular disaster recovery exercises where we take production backups and restore them to a spare server.
- We’ll research how we can speed up the restore process. This could mean improvements to specific components or even switching to a different backup system altogether.
- If customers need shorter backup periods than 24 hours, we’ll support them in setting up custom backup jobs directly from their content management system.
In conclusion, I’d like to state that this incident showed an embarassing lack of preparation on our side for the failure of a whole disk array. I apologize to all affected customers that we were not able to restore normal operation more quickly and to the full extent. I assure you that we are working hard to prevent an incident like this from ever happening again.
26 May 2014
We’re very happy to welcome Philipp Kaiser as our first full-time employee! As an experienced system administrator, Philipp will help us get our IT operations backlog shortened to a more human size, and to get new, innovative changes out of the door more quickly.
Did you know this is actually the second time I’ve hired Philipp? Back in 2004, as the new IT team lead “CRM, Billing and WEB.Cent” at WEB.DE, I got him to join my team as its second member. During his interview, he had impressed me by how he handled the stress at the place where he did his apprenticeship as a Qualified IT Specialist. I was responsible for a bunch of business-critical systems and thought “If he’s lived through this, he’ll be able to survive here, too.” He did much more than just survive and I enjoyed working with him during several years.
As sysadmin #1 at freistil IT, Philipp will be responsible for business-critical systems, too — business-critical not only for us but especially for our customers. And I’m 100% sure he’ll be doing a great job.
The day-to-day in our distributed team at freistil IT will be quite different from what he’s used to and we’re well aware that there’s a learning curve waiting for all of us. That’s exactly what makes running freistil IT so exciting for me. Learning is one of our core values, and every new employee will notice this quickly. From his time back at WEB.DE, Philipp is already used to the concept of the Ops Bootcamp which we’ve implemented at freistil IT: In this programme spanning several weeks, we’re going to introduce him to all important aspects of his job from Asana to zsh. At the same time, he’ll already be working on internal projects and support requests.
So, welcome, Philipp, and thanks for joining our team! Cheers — to a lot of fun!
05 May 2014
The Open Source Datacenter Conference has been an established DevOps conference even before the term was widely used. For 2014, the Netways team left Nürnberg as their traditional location behind and moved the event to Berlin.
I’m happy that they kept the “conference hotel” concept; nothing beats being able to walk from bed to breakfast to conference session under the same roof. For me, the Hotel MOA in Berlin-Moabit ticked all the boxes. My room was comfortable and spacious, the catering was excellent and there were more than enough seats in the conference rooms for all the attendees.
There is an essential aspect of tech conferences where OSDC 2014 unfortunately left a lot to be desired: reliable internet connectivity. The hotel WiFi clearly was not built to accomodate a crowd of techies who each brought 2 or more WiFi clients. Unfortunately, Netways had not prepared a dedicated network in advance so I had to fall back to my 3G hotspot when my WiFi connection broke down only a few minutes into the introductory talk.
As regular OSDC participants have come to expect, the talks were diverse, interesting and on a professional level. Here is a short excerpt of the session programme:
- Lennart Koopmann & Jordan Sissel: Intro to Log Management
- Mike Adolphs: How we run Support at GitHub
- Martin Gerhard Loschwitz: What’s next for Ceph?
- Thomas Schend: Introduction to the Synnefo open source cloud stack
- Michael Renner: Secure encryption in a wiretapped future
- Christopher Kunz: Software defined networking in an open-source compute cloud
- Andreas Schmidt: Testing server infrastructure with serverspec
Netways generously published their recordings from OSDC 2014 on Youtube, including my own talk, “Dynamic Infrastructure Orchestration”.
Making our configuration management more dynamic is one of our main concerns at freistil IT. Our infrastructure is growing steadily and we have to make sure that changes are executed in a quick and reliable manner. Conventional configuration management tools like our trusted Chef are limited in terms of change rollout speed because they are based on regular convergence runs. That’s why I took a closer look at serf and etcd. I’ve then summarised my findings in my OSDC talk:
After I watched the recording, I realised sadly that I missed the bar at least on the aspect of “professional presentation”. Judging from the many interested questions during and after the talk, I managed to convey everything I had intended to. But I had not rehearsed the talk enough and that showed clearly in far too many errs and uhms. I apologise to all attendees for not delivering the presentation quality that you can expect at such a conference. I’ll push myself harder in order to always arrive fully prepared when I’m invited as a speaker.
With the move to Berlin, the Open Source Datacenter Conference is on the way from a German IT event to an international DevOps conference. And I’ll be happy to return next year!
04 May 2014
Tomorrow morning, I’m going to jump on the train to the first DrupalCamp Frankfurt. I’m looking forward to meet my friends of the German Drupal community and also to get to know some new folks, so say hi!
If you’re interested in how to build a Drupal development environment with Vagrant and automation tools, come by my session that I’ll be doing on Sunday morning!
See you in Frankfurt!
11 Apr 2014
And just so, the first quarter of 2014 has passed. Wow, time flies when you’re busy. And busy we were!
Our conference schedule for this year is so packed that we need to distribute events within the team. In March, while I visited Szeged, Hungary, and enjoyed the community feeling at the Drupal Developer Days, Markus attended DevOpsCamp Nuremberg. DevOps is at the core of our business, so DevOps Days and DevOps Camps are ideal opportunities for us to exchange experiences with other IT specialists and also gain new inspiration.
Our managed hosting platform freistilbox keeps growing and now hosts 282 websites (Feb: 266, +6%). Our total uptime was 99.93% (Feb: 99,97%) and with 11.96 TB, our website traffic came up to about the same level as in the previous month (Feb: 12.5 TB).
We’re working hard every day to keep our promise, “Work efficiently, sleep peacefully.” And it’s feedback like this customer’s that always makes our day: “Pingdom reports 99.99% availability of our cluster for March. Thank you.”
In March, we achieved a great improvement in tech support quality. While ticket traffic resembled the one in February, our reaction times were substantially better.
We got 150 support requests (Feb: 149) and were able to resolve 126 tickets (Feb: 127), leaving a backlog of 48 (Feb: 35). Our average first reply time came in at 16h, a whopping 31% less than last month (Feb: 23h). You can see the cause in the time breakdown: We shifted a significant part of “tickets we didn’t answer within 24h” to the 8h bracket. This month, we tackled 42% of all tickets within an hour (Feb: 43%) and 31% within 8h (Feb: 23%); together, this makes 73% of all incoming requests. The number of tickets answered within 24h stayed the same (Feb, Mar: 15%), and only 12% took us longer than 24h (Feb: 20%).
With these results, it’s not surprising that our customer’s satisfaction feedback stayed at 100% “good”. These are some of the comments we got from solved support requests:
- “VERY fast and great support!”
- “CSI Varnish ^^”
- “Super fast response, thanks!”
- “A kind of magique”
Customer happiness is our primary goal and we’re glad that we’re well on target here.
Since we’re able put existing capacity to more efficient use, we only added 2 new hosts to our infrastructure (Feb: 282). From all these hosts, we’re collecting 108,880 distinct health and performance metrics (Feb: 107,706) — every 10 seconds.
The number of on-call notifications decreased further to 1122 (Feb: 1397, –20%). This number doesn’t mean that we experience an outage every 40 minutes on average. We’re quite happy that the biggest part of these alerts get automatically resolved within minutes because the issue (for example a high system load) was only short-lived. As soon as we get to fine-tune our alert thresholds, the number of notifications will go down significantly.
All in all, it’s been a great month and a successful first quarter. We’re looking forward to what’s coming next!
07 Apr 2014
Next week, I’m going to attend the Open Source Datacenter Conference (OSDC). Conviently for me, the event will take place in Berlin now after Netways, the company behind the conference, decided to move it there from its traditional location in Nuremberg.
The OSDC takes place on Wednesday and Thursday and is packed with sessions about the newest developments in Open Source technology. Some of my personal highlights:
- Jordan Sissel: “Find Happiness in your Logs”
- Andreas Schmidt: “Testing server infrastructure with serverspec”
- Martin Gerhard Loschwitz: “What’s next for Ceph?”
- Mike Adolphs: “How we run Support at GitHub”
On Thursday, I’ll give a talk myself, titled “Dynamic Infrastructure Orchestration”. My goal is to point out possible next steps after getting configuration management solutions like Puppet or Chef in place. I’ll cover two different approaches:
- using central key-value stores like etcd for service discovery and
- configuration, and
- decentralising system automation with tools like serf.
OSDC always is a fun event and I’m excited to meet old colleagues and talk to fellow IT professionals.
Hm. I wonder if the city has changed much since I last visited Berlin in 1987…
04 Apr 2014