Philipp started in May as the first employee of freistil IT Ltd. He is a long-time system administrator with a lot of experience in operating web-scale infrastructure.
Since May, Philipp has been going through Operations Bootcamp, a training series where new sysadmins learn everything they need about our tools, processes and especially all the software we run on our servers. He also started taking care of day-to-day tasks like support requests or the replacement of old infrastructure.
Here’s what Philipp makes of his first weeks at freistil IT:
Life before freistil IT
“My job before freistil was nine years of corporate work that I don’t want to have missed, but I’m really happy that it’s over now. I’ve learned a lot in different teams and environments over the time, and I had the chance to learn from some of the best developers and IT system cracks I’ve met so far. But being a small wheel in a large machine can be very frustrating, and from some point on, there’s only little chance to level up.”
What made freistil IT stand out?
“Let’s say “nine years are enough”! :) I felt the need for something new, and it should be something smaller, more flexible, and should not have anything to do with operating JBoss application servers. ;) Being given the opportunity to work remotely is a completely new experience to me. I wanted to try this way of working and the job offer from freistil was the only one I got featuring this. And last but not least, I knew Jochen before, so I knew the start would be a bit easier than in a completely new working environment.”
What are your first impressions of the team and culture?
“The team (and here, I refer more to Jochen and Markus than to myself) is highly skilled and quite enthusiastic. My learning curve is not a curve at all, it’s more like a vertical line, which can be strenuous sometimes, but at the end of the day is exactly what I was looking for. The culture is new, fresh, and free. Fantastic when you’re coming from a musty corporate environment.”
What do want to achieve with the operations team?
“Keep it up and running, but also pull it to the latest trends and never let the needs of our customers out of sight. And I’m looking forward to get more involved in the Open Source community, which is much easier here at freistil IT, I think.”
Thanks, Philipp! We’re happy to have you on our team and are looking forward to a great time!
18 Jul 2014
If you’re familiar with my monthly retrospectives, you know they come with a lot of numbers. Let’s start with these: 7:1 ! That’s what proper teamwork looks like, folks!
At freistil IT, we enjoy our effective collaboration, too. While Philipp is still going through Operations Bootcamp (for an IT infrastructure like ours, there’s a lot of ground to cover), he’s also started to assist in production changes. We’ll soon post a blog entry with his first impressions of our small business. Markus decided to examine the life of a digital nomad for a few weeks: He’s traveling around Germany with his family in a camping van and takes care of our servers and your support requests via mobile broadband and local WiFi hotspots. And I’m enjoying a magnificent Irish summer at home, trying not to get a sunburn!
In June, more than 40 new websites launched on our hosting platform, increasing the total by 15% to 324. With the number of websites, our total traffic also grew: We delivered 12.15 terabytes of content in June, 3.7% more than in the month before.
Another number makes us very happy, too: We managed to keep the overall availability at an excellent level of 99,97% (99.98% in May).
While the number of engineering support requests was exactly the same as in May (150), our resolution rate somehow was 15% lower with 107 tickets compared to 126 in May. We’ll need to have a look at possible causes.
On the other hand, we’ve improved our average reaction time by whopping 35%, from 9.7 hours down to only 6.3h. The reaction time breakdown below shows that we were able to cut the number of customers that had to wait more than 24 hours for a first reply by almost 60%!
We’re also proud that we’ve kept a perfect score of 100% positive customer feedback. Judging from comments like “Very friendly and helpful. My request has been solved very quickly. Thanks for that.”, we’re doing a good job.
With the continuing success of freistilbox, our infrastructure grew by 6% to 300 servers. Our metrics monitoring even increased by 9% to 98,951 metric points that we collect every 10 seconds.
That at the same time the number of on-call alerts went down by 13% isn’t something we’ll complain about. ;-)
June was a bit less conference-heavy and while I manned the stations, the new ops team went on tour:
Markus attended WordCamp Hamburg and gave a talk about automated WordPress development setups. When he sat down with our friends from Palasthotel to demonstrate our communication tools, it turned into an impromtu BoF on remote working!
Just a few days later, both Philipp and Markus went to the Netherlands and brought back at lot of inspiration from DevOps Days Amsterdam.
I’m also proud that I’ve been selected to present among the high-calibre speakers at the Open Source Monitoring Conference in November.
Usually, the coming months will be a bit more quiet because many of our customers take their vacation time (there it is again, that word…). Since our internal backlogs are looong, there won’t be any boredom, though. And of course, we’ll always be there if you need us!
10 Jul 2014
On Tuesday night, I had the opportunity to give a talk at the “Entrepreneurs Anonymous Dublin” meetup about my professional journey from my first VIC-20 home computer to my current business with hundreds of Linux boxes. I also explained our management philosophy and how we foster trust and motivation by giving our employees as much freedom as possible.
I took the DART train back to Bray with meeting host John Muldoon and he expressed his happiness to hear about a company that successfully implemented innovative approaches like ROWE and unlimited time off. He head read about such methods but in many discussions got the feedback that they could not work in practice. I beg to differ.
While we’re certainly a small company that is agile enough to experiment with novel HR methods, our approach actually is based on solid scientific findings. It’s a proven fact that creating a work environment of trust and freedom leads to better business.
There’s a great summary of current study results in the New York Times opinion piece “ Why You Hate Work”. The company of author Tony Schwartz conducted a study with more than 20,000 employees and the result came back clear without ambiguity:
“Employees are vastly more satisfied and productive, it turns out, when four of their core needs are met: physical, through opportunities to regularly renew and recharge at work; emotional, by feeling valued and appreciated for their contributions; mental, when they have the opportunity to focus in an absorbed way on their most important tasks and define when and where they get their work done; and spiritual, by doing more of what they do best and enjoy most, and by feeling connected to a higher purpose at work.”
So, are we some sort of business hippies when we aim at more satisfied employees? No, we are effective entrepreneurs:
“In a 2012 meta-analysis of 263 research studies across 192 companies, Gallup found that companies in the top quartile for engaged employees, compared with the bottom quartile, had 22 percent higher profitability, 10 percent higher customer ratings, 28 percent less theft and 48 percent fewer safety incidents.”
It makes at lot of business sense for us to ask “What would make our employees feel more valued, more productive and more inspired?” For example, our policy to not track work time lets people take breaks without guilt and recharge their creativity batteries.
“Employees who take a break every 90 minutes report a 30 percent higher level of focus than those who take no breaks or just one during the day. They also report a nearly 50 percent greater capacity to think creatively and a 46 percent higher level of health and well-being.”
By treating our employees as the adults they are, we send a clear signal that we value not only their skills but their whole personality. This strengthens their identification with the company and its goals:
“Employees who say they have more supportive supervisors are 1.3 times as likely to stay with the organization and are 67 percent more engaged.”
Our Results-Only Work Environment lets our team members focus on getting the important things done. Isn’t that one of the most essential things in a business? And as long as, when it comes to delivering results, the “what” and the “when” are okay, we simply don’t need to talk about the “how”.
“Only 20 percent of respondents said they were able to focus on one task at a time at work, but those who could were 50 percent more engaged.”
Another common HR topic is retention. When people actually turn out to be the great contributors we saw in them when we hired them, it’s obvious that we’d like to keep them for as long as possible. Well, that’s not that hard to achieve:
“Employees who derive meaning and significance from their work were more than three times as likely to stay with their organizations — the highest single impact of any variable in our survey.”
We really don’t worry about team members abusing their freedom, for example their unlimited time off. Worrying would actually hurt our chances of building a highly productive team:
“Partly, the challenge for employers is trust. For example, our study found that employees have a deep desire for flexibility about where and when they work — and far higher engagement when they have more choice. But many employers remain fearful that their employees won’t accomplish their work without constant oversight — a belief that ironically feeds the distrust of their employees, and diminishes their engagement.”
Trust is a business investment where the gains are much greater than the risks. For us, granting our employees as much freedom as possible is an obvious choice.
26 Jun 2014
In the “What’s Hot, What’s Not” index in a recent Irish Times Magazine, I found two work-related entries:
What’s Hot: Annual leave approval – Sweet, liberating relief.
What’s Not: Annual ‘Goal Setting’ at work – Taking your ambitions and shaping them into corporate jargon. When dreams get replaced by deliverables.
This brings back memories of my corporate past, and they’re not fond ones.
“Annual leave”. Just the name brings out the cynic in me. “You don’t have to work your ass off all of the year. We have reserved a few days in summer where you may take some PTO. But make sure that you’ll be back refreshed and ready to work for the next 11 months!”
And then there’s the “goal setting”. Done in a regular fashion in order to keep you on target. But only within intervals like quarterly or even yearly to not bother the manager who’s responsible for your performance too much. In between goal appraisals, you’re on your own, of course. Didn’t get the resources you needed approved? “You have to do more with less.” Got your priorities overruled by your own or some other manager again and again? “You need to keep focusing on your responsibilities!” Changes in strategy made your goal senseless? “You should think more big-picture!”
I’ve come to think of these practices as occupational therapy for managers who don’t have the skills to do any better. It’s passive management. And passive management is bad for the business.
That’s why we’re taking a radically different approach at freistil IT. In 2010, just when I was getting started with my company, Cali Ressler and Jody Thompson published their book “Why Work Sucks and How to Fix It: The Results-Only Revolution”. Reading it, I recognised many of the the bad management patterns they described because I had experienced them.
A Results-Only Work Environment (ROWE) is based on trust and accountability. If I trust my employees to put in the hours stated in their contract, I don’t need to count butts in chairs. If they deliver appropriate results, I don’t need to ask when or where they worked on achieving them. Treating people as trustworthy adults frees a lot of time to do actual management, e.g. creating individual career development plans or helping employees overcome problems they can’t solve left on their own devices. The catch: a ROWE requires active management.
Active management is what happens between meetings.
In today’s world, things change far too frequently for annual goals to retain any value. In software engineering, we’re using agile and lean methodologies to continuously resharpen our focus, to keep up with moving targets. Software is developed in weekly sprints instead of annual big-bang releases. Why should these agile approaches not also be applied to relationship and performance development (a.k.a. management)?
At freistil IT, we don’t count work hours and give our employees unlimited time off. All we ask for is to give as many days advance notice as the vacation is going to have. You’ll probably ask: “And what happens if someone abuses this?” Then it’s probably time to talk to the employee about how they’re damaging their professional relationship with the whole company. It’s also necessary to look at the conditions in which this misbehaviour was able to develop. Were there mistakes made during the hiring process, since cultural fit is one of the most important selection criteria? Were there events that turned motivation into frustration and self-indulgence? Giving employees freedom and making sure that it’s not abused is active management.
In order to give our work the right direction, we use the V2MOM model that Marc Benioff created at Salesforce.com and described in his book “Behind the Cloud”. On every level of the company, we define
- the Vision for our business,
- the Values our work is based on
- the Methods we plan to employ
- the Obstacles we anticipate on the way and
- the Metrics we’re going to use to judge the success of our work.
We discuss and define these 5 aspects for the whole company, we break them down to the team level, and finally, every team member defines their own V2MOM based on the company and team model. Since transparency is one of our core values, we publish all V2MOM’s on our company wiki. Each V2MOM gets revised regularly, generally every quarter or biannually. Longer intervals are okay here because a vision is not necessarily a concrete goal. V2MOM is just the framework that helps us decide what is going to advance the business in the right direction and what not (the “V2” part). It also provides the basic tools (the “MOM” part) that we’ll need to make daily progress towards these goals. We then communicate our progress via IDoneThis and discuss it in weekly one-on-one talks. Aligning goals to an overarching strategy and continuously adjusting our aim is active management.
These two approaches actually go hand in hand. We are convinced that employees who fit our culture will not abuse their freedom. Because they understand the company’s vision and values and have discovered how their own vision and values fit into this framework, there’s no need for any micro-management. Instead, we let people make their own decisions and trust that it’ll be for the best of the business.
And judging from the results so far, “That’s Hot”.
23 Jun 2014
If you’re the type of customer we love the most, you’re a Drupal or WordPress shop that builds amazing websites. This requires great developers and these developers tend to know a thing or two about web infrastructure. So, why not have them also run the hosting of the websites they know best?
Let me tell you why not. Why I think that that’s a really bad idea that can quickly lead you to lose track of your main business goal, which is — remember — building amazing websites.
The world of web operations
Running a website that serves a lot of users is far from trivial. There are a lot of IT topics that need to be covered in order to build and operate an application that…
- …reliably and quickly delivers the information the user needs (= performance),
- …can cope with a steadily (or even exponentially!) growing user base (= scalability),
- …and is robust enough that smaller incidents (e.g. disk failure, network partitions) will not cause it to be inaccessible (= availability).
I found a detailed overview of all the important issues that an operations engineer needs to address in Mathias Meyer’s blog post “ Web Operations 101 For Developers”. It’s a long post and I highly recommend reading it in full (after you’ve finished this article).
Every business relies on some kind of infrastructure. If you were a transport business, you’d rely on infrastructure like highways, gas stations and warehouses. Your business is based on web applications, so you rely on IT infrastructure like networks and server racks, operating systems and software applications.
Getting some kind of hosting infrastructure is easy. It’s just a few clicks over at Amazon Web Services or DigitalOcean. But in his article, Mathias points out the catch:
“Every little piece of it can break at any time, can stall at any time. The more pieces you have in your application puzzle, the more breaking points you have. And everything that can break, will break.”
Someone needs to manages this IT infrastructure. This could be you or someone from your team, it could also be someone you specifically hire for that task. And keeping stuff running requires know-how and experience:
“You don’t need to know everything about every piece of hardware out there, but you should be able to investigate strengths and weaknesses, when an SSD is an appropriate tool to use, and when SAS drives will kick butt. Learn to distinguish the different levels of RAID, why having an additional file system buffer on top of a RAID that doesn’t have a backup battery for its own internal write buffer is a bad idea. That’s a pretty good start, and will make decisions much easier.”
I’d say that’s quite a laundry list of insight that doesn’t come by just reading some manuals. And that’s only the hardware aspect – Mathias also details a separate list for the operating system level.
Is this how you want to spend valuable engineering time?
There will come the time when stuff hits the fan.
“You should be willing to dig into whatever data you have posthumous to find whatever went wrong, whatever caused a strange latency spike in database queries, or caused an unusually high amount of errors in your application.”
Troubleshooting and incident response are a special area of expertise that requires both deep knowledge and experience to find and eliminate the problem’s root causes.
Is this how you want to spend valuable engineering time?
Deploying your application to a single server is easy and it’s actually not that much more demanding to use version control software like Git or even a Continuous Integration tool like Capistrano. But how about deploying a new app version to 5 or 15 servers? What if that new version alters the database schema making it incompatible with older versions, so all servers need to updated at the same time instead of sequentially?
As Mathias points out in his post, you need automation:
”There’s an abundance of tools available to automate infrastructure, hand-written script are only the simplest part of it. Once you go beyond managing just one or two servers, tools like Chef, Puppet and MCollective come in very handy to automate everything from setting up bare servers to pushing out configuration changes from a single point, to deploying code.”
But before you will be able to benefit from the high efficiency these tools offer, you need to learn how they work and how you describe to them the infrastructure you want them to build.
Is this how you want to spend valuable engineering time?
Over its lifetime, your web application will probably become more complex and with it the IT infrastructure required to support it. You’ll add a caching service here, a key-value database there – want a PHP extension with that? All these add-ons need to be installed, configured and fine-tuned.
“Whenever you add a new component, a new feature to an application, you add a new point of failure.”
Complex systems tend to break in very “interesting” ways, so troubleshooting will also become more difficult as your application grows.
Is this how you want to spend valuable engineering time?
Only by monitoring the current status of your hosting components and recording metrics about their performance over time, you can make decisions when things start to behave strangely, or — better yet — before they do so.
“I can’t say it enough how important having a proper monitoring and metrics gathering system in place is. It should be by your side from day one of any testing deployment.”
So you’ll soon decide to get some monitoring software and a metrics collection service in place. But that’s just the start:
“You’ll never get alerting and thresholds right the first time, you’ll adapt over time, identifying false negatives and false positives, but if you don’t have a system in place at all, you’ll never know what hit your application or your servers.”
Is this how you want to spend valuable engineering time?
Probably every service in your hosting infrastructure writes some kind of log where it saves details about the things it does and events that happen. That’s very useful:
“In case of an emergency, a good set of log files will mean the world to you. This doesn’t just include the standard set of log files available on a Unix system. It includes your application and all services involved too.”
But each service will log its own kind of details in its individual format, sometimes as a text file, sometimes in a database. It takes a lot of time to learn how to find and understand the relevant stories buried in thousands of lines of text scattered over different sources.
Is this how you want to spend valuable engineering time?
Failure will happen. All the time.
“The bottom line of everything is, stuff breaks, everything breaks at different scale. Embrace breakage and failure, it will help you learn and improve your knowledge and skill set over time.”
In our experience, failures will almost every time lead to better insight, improved skills and a more robust hosting infrastructure. But:
Is this how you want to spend valuable engineering time?
Stay on course
The answer is No. No, you most certainly don’t want to spend valuable engineering time on doing all these daily IT operations tasks. They tend to get more and more expensive over time, and, more importantly, they distract you from your core business.
Behind freistilbox, there’s a team of IT experts that know how to manage a growing business-critical infrastructure. We take care of all daily (and nightly) operations tasks, handle incidents and make sure that your website runs with optimal performance.
By fully managing your hosting platform, we enable you to keep a laser-like focus on your mission: building amazing websites.
That’s how you should spend every second of valuable engineering time.
How you can do DevOps without an ops team
Better yet, we’re available to you like an in-house ops team, via phone, email and chat; with our Premium Support, you can even reach us 24/7.
- Got a question about HTTP caching headers? We’ll explain them to you over the phone.
- You need help in optimising a database query? Send us a support request and we’ll work out a solution.
- You’d like us to keep an eye on our servers while you launch your new website? We’ll set up a chat room where you get instant answers and live updates how your hosting platform is keeping up.
This is much more than just technical support, it’s decades of IT know-how at your fingertips during the whole life cycle of your web application. And it’s included for free in all our hosting packages.
freistilbox is not only high-performance web hosting, it’s DevOps done right.
19 Jun 2014
And there’s another month gone past. Unfortunately, my sum-up for April got lost between incidents, conferences and a short vacation (nice concept BTW, that latter one, I really should do that more often than every three years). So this is going to be a sum-up for both recent months, April and May.
After a cleanup of unused websites in April which decreased the number of active websites to 263, we’re back at the level of March with 281 instances. We’re very happy to report that our overall uptime is steadily approaching the perfection mark: Pingdom tells us we achieved 99.98% availability in April and May (99.93% in March). Our edge traffic didn’t change much over the recent weeks, it’s been 11.48 TB in April and 11.72 TB in May (11.96 in March).
Continuous improvement is visible in our technical support numbers. In March, we received 150 support requests of which we solved 126. In April, the relation was 178/132, and in May 147/114. The backlog of open tickets decreased significantly from 48 to 35 tickets (-27%).
Looking back on April, our Help Center statistics shocked us with an average reaction time of 33.5 hours. We quickly found out that was caused by two single tickets that we had put on hold for a few weeks in agreement with the respective customer. We changed our process to always give initial feedback as quickly as possible and our average reaction time for May returned to a saner value of 9.7 hours.
When we break down these average reaction times, we can see the effect of the new rule. We answered 56% of all tickets in under 1h (March: 42%), 25% within 8h (March: 31%), 9% within 24h (March: 15%) and only 12% after 24h (March: 20%).
With now double the capacity in our operations team (see below), we’re confident that we’ll be able to push both reaction and resolution time down even more in coming months.
Our customers’ satisfaction rating stayed at a perfect 100% and we again got a lot of awesome feedback:
- “Top class support. Very fast and very friendly.”
- “Problem solved flawlessly, although it took more than 24h until the website was available again… Since we’re not live yet, I’ll give you a ‘Good’ anyhow ;-))”
- “Response was very fast. Everything ok ;-)”
- “Thanks again for the support in the middle of the night!”
- “Competent clarification, super every time!”
We finally found the time to do some cleanup work and put a few unneeded machines out to pasture, so although we added some new ones, we’re still counting 284 servers, same as in March. The number of metrics we collect from our servers every 10s also got cleaned up and went down to 90,684 (-17%).
At the end of May, we had an embarrassing incident where it took us much too long to restore servers after a RAID failure, and we even lost the contents of one server completely. This has led to a bunch of remediation tasks, some of which are still in progress. We’ve made sure that such an incident doesn’t repeat. While learning is one of our core values, “learning by suffering” is the worst way to do it and completely unacceptable if it’s our customers that suffer.
Unfortunately, we’ve not yet managed to reduce our number of on-call alerts; actually, it’s even grown by 18% to 1332 alerts in May. This is mainly due to alert multiplication — different alerts that share the same cause. To improve this, some work on our monitoring configuration is required. Good thing that we got reinforcement!
Having Philipp Kaiser join our ops team in May so far has been the highlight of our year! His experience with huge data center installations will help us tackle future projects, and the added team capacity enables us to reduce our “technical debt”, i.e. finally take care of low-priority tasks that we had postponed for a lack of capacity. Philipp is currently undergoing the “Ops Bootcamp” where we’re introducing him to everything he needs to know about our IT infrastructure and processes. In parallel, he’s also started to work on his first production tasks. We’ll talk about his first impressions at his new workplace in another blog post soon.
As active members of our open source communities, we participated in a bunch of events during April and May:
- Markus explained how to build high-performance Drupal websites at the World Hosting Days.
- I did a session on “Dynamic Infrastructure Orchestration” at the Open Source Datacenter Conference.
- At DrupalCamp Frankfurt, Markus talked about “How to automate your Drupal development environment”.
- I first talked about “Doing DevOps with Drupal” at DrupalCamp Scotland in Edinburgh,
- and, since that talk got a lot of great feedback, repeated it at the Drupal Open Days Dublin.
April and May were two months jam-packed with valuable experiences for us as individuals and as a company as a whole. We’re exited to see what June has got in store for us!
12 Jun 2014
The “Changelog” is a new category in our blog where we publish important changes to freistilbox infrastructure and functionality.
Each freistilbox cluster comes with its own “shell node” that customers access via SSH to run maintenance tasks like mysqldump or drush. In order to make it easy to access the right website instance, each one has its own user account.
So far, the interactive use of these user accounts was severely limited by tight write restrictions on the user home directory.
In a change we’ve rolled out this week, we’ve replaced the old instance directories with homes to which the shell user has full write access. This solves the problems that many customers experienced when they tried to store configuration files or to create arbitrary files and subdirectories.
Together with all the symlinks to important website directories, the work subdirectory that we used to create as a workaround for the previous write restrictions has been automatically moved to the new shell user home directory. Apart from the full write permissons, everything should look and function exactly as it used to.
30 May 2014
On Thursday, 15 May, one of our VM hosts named “vm3” did not return back to operation after a standard maintenance procedure, resulting in an outage of more than 14 hours. While we were able to restore all affected DrupalCONCEPT POWER servers, we only had backups available that were more than 24 hours old. And in the case of a custom-built managed server, we even lost most of its files completely.
We regard reliability and effective IT processes as essential for our business. An outage of this duration and with these results is not acceptable. We are embarassed and deeply sorry about this incident and I apologize on behalf of freistil IT to all customers that we disappointed.
In this review, I’d like to give you detailed insight into what’s happened and what we’re going to do to prevent incidents like this in the future.
On Monday, 12 May, the VM host vm3 signaled one of two disks of its RAID–1 array as failed. It kept running on the second disk without any problems. We scheduled a maintenance window to have the failed disk replaced for Thursday, 15 May at 19:00 UTC, and announced the scheduled maintenance on the freistilbox Status Page.
Data center staff shut down the server at 18:55 UTC (a few minutes early) and replaced the broken disk. After restarting the server, we found that the server would not boot into a working system again. It turned out that there was no bootable operating system available on the remaining disk any more, which suggested that this disk had failed, too. When we realised that there was nothing we could do about the second failed disk, we decided to go the only viable, albeit laborious, way of rebuilding the server from scratch. After getting the second disk replaced, we started reinstalling the server OS, then the host environment and finally, the guest servers.
When we started the restore process, we realised that already the first phase, building a directory tree of the data to restore, would take several hours. We hoped that it would finish over night, but after 7 hours on Friday morning, the backup database was still working on collecting data for the restore directory tree. Fortunately, we found out by experimenting that by aborting the slow query on the database server, we could force the backup system to fall back to doing a full restore of all files in the backup set.
After the restore jobs were finished on all affected servers, we started reimporting the database dumps that were included in the backups. That’s when we found that we had timed the creation of these dumps badly: The job for doing daily database dumps actually ran later than the file backup that was supposed to pick them up. Restoring data from the Wednesday night backup meant that we had lost almost a whole day of data but the backup then only contained database backups from Tuesday night.
And as if this wasn’t bad enough news for our customers already, it turned out that one of the affected servers didn’t have any of its websites backed up at all. The respective server is a custom-built managed server. While with DrupalCONCEPT and freistilbox servers, everything (including the backup) is configured automatically, this server would have needed a manual backup configuration and we obviously had forgotten this part during setup.
Some customers had newer backups available that we were able to copy back to their server but in the end, most of them still suffered a catastrophic loss of data.
On Friday at about 11:30 UTC, all servers were online again. We then spent the rest of the day with assisting our affected customers to solve some minor remaining issues.
What we are going to do about it
In a post mortem meeting on Monday, 19 May, we discussed the incident and decided on remediation measures to prevent it from repeating.
The root cause of the incident, the loss of both disks of a RAID–1 array, is a rare event but we need to be prepared for it to occur. We especially need to minimise the amount of data lost due to such a failure.
While the affected customers had consciously chosen a one-server setup that has many single points of failure (SPOF), neither they nor we had expected that an outage would take this long and would result in such catastrophic data loss. We need to make sure that all our backups have complete coverage and that they can be restored within a reasonable amount of time (a few hours max).
As a result of our post mortem, we decided on the following remedial measures:
- We checked to make sure that all customer data, especially on custom-built servers, will be fully backed up from now on.
- We rescheduled our file backup in order to include the latest database dumps.
- Planned maintenance must be done right after a backup run. We will either schedule it after the regular daily backup job or we’ll trigger an extra backup in advance of the maintenance.
- We’ll schedule regular disaster recovery exercises where we take production backups and restore them to a spare server.
- We’ll research how we can speed up the restore process. This could mean improvements to specific components or even switching to a different backup system altogether.
- If customers need shorter backup periods than 24 hours, we’ll support them in setting up custom backup jobs directly from their content management system.
In conclusion, I’d like to state that this incident showed an embarassing lack of preparation on our side for the failure of a whole disk array. I apologize to all affected customers that we were not able to restore normal operation more quickly and to the full extent. I assure you that we are working hard to prevent an incident like this from ever happening again.
26 May 2014