Monthly sum-up for April and May 2014
Published 2014-06-12 by
And there’s another month gone past. Unfortunately, my sum-up for April got lost between incidents, conferences and a short vacation (nice concept BTW, that latter one, I really should do that more often than every three years). So this is going to be a sum-up for both recent months, April and May.
After a cleanup of unused websites in April which decreased the number of active websites to 263, we’re back at the level of March with 281 instances. We’re very happy to report that our overall uptime is steadily approaching the perfection mark: Pingdom tells us we achieved 99.98% availability in April and May (99.93% in March). Our edge traffic didn’t change much over the recent weeks, it’s been 11.48 TB in April and 11.72 TB in May (11.96 in March).
Continuous improvement is visible in our technical support numbers. In March, we received 150 support requests of which we solved 126. In April, the relation was 178/132, and in May 147/114. The backlog of open tickets decreased significantly from 48 to 35 tickets (-27%).
Looking back on April, our Help Center statistics shocked us with an average reaction time of 33.5 hours. We quickly found out that was caused by two single tickets that we had put on hold for a few weeks in agreement with the respective customer. We changed our process to always give initial feedback as quickly as possible and our average reaction time for May returned to a saner value of 9.7 hours.
When we break down these average reaction times, we can see the effect of the new rule. We answered 56% of all tickets in under 1h (March: 42%), 25% within 8h (March: 31%), 9% within 24h (March: 15%) and only 12% after 24h (March: 20%).
With now double the capacity in our operations team (see below), we’re confident that we’ll be able to push both reaction and resolution time down even more in coming months.
Our customers’ satisfaction rating stayed at a perfect 100% and we again got a lot of awesome feedback:
- “Top class support. Very fast and very friendly.”
- “Problem solved flawlessly, although it took more than 24h until the website was available again… Since we’re not live yet, I’ll give you a ‘Good’ anyhow ;-))”
- “Response was very fast. Everything ok ;-)”
- “Thanks again for the support in the middle of the night!”
- “Competent clarification, super every time!”
We finally found the time to do some cleanup work and put a few unneeded machines out to pasture, so although we added some new ones, we’re still counting 284 servers, same as in March. The number of metrics we collect from our servers every 10s also got cleaned up and went down to 90,684 (-17%).
At the end of May, we had an embarrassing incident where it took us much too long to restore servers after a RAID failure, and we even lost the contents of one server completely. This has led to a bunch of remediation tasks, some of which are still in progress. We’ve made sure that such an incident doesn’t repeat. While learning is one of our core values, “learning by suffering” is the worst way to do it and completely unacceptable if it’s our customers that suffer.
Unfortunately, we’ve not yet managed to reduce our number of on-call alerts; actually, it’s even grown by 18% to 1332 alerts in May. This is mainly due to alert multiplication — different alerts that share the same cause. To improve this, some work on our monitoring configuration is required. Good thing that we got reinforcement!
Having Philipp Kaiser join our ops team in May so far has been the highlight of our year! His experience with huge data center installations will help us tackle future projects, and the added team capacity enables us to reduce our “technical debt”, i.e. finally take care of low-priority tasks that we had postponed for a lack of capacity. Philipp is currently undergoing the “Ops Bootcamp” where we’re introducing him to everything he needs to know about our IT infrastructure and processes. In parallel, he’s also started to work on his first production tasks. We’ll talk about his first impressions at his new workplace in another blog post soon.
As active members of our open source communities, we participated in a bunch of events during April and May:
- Markus explained how to build high-performance Drupal websites at the World Hosting Days.
- I did a session on “Dynamic Infrastructure Orchestration” at the Open Source Datacenter Conference.
- At DrupalCamp Frankfurt, Markus talked about “How to automate your Drupal development environment”.
- I first talked about “Doing DevOps with Drupal” at DrupalCamp Scotland in Edinburgh,
- and, since that talk got a lot of great feedback, repeated it at the Drupal Open Days Dublin.
April and May were two months jam-packed with valuable experiences for us as individuals and as a company as a whole. We’re exited to see what June has got in store for us!