Just a short message while I’m packing my bag for this year’s Open Source Datacenter Conference: If you’re going to be in Nürnberg on Wednesday or Thursday as well, I’d be happy to meet you! Let’s talk DevOps over some coffee or a drink at the bar!
15 Apr 2013
Our freistilbox hosting platform is built from the ground up with high availability in mind. In order to minimize the impact of failures, every backend service (i.e. each MySQL database, each Apache Solr core etc.) is running on at least two servers. And if you run your website on more than a single freistilbox, you’re in good shape on the web application level, too.
Redundancy alone doesn’t guarantee maximum uptime, though. Recently, we had to deal with various kinds of network problems ranging from minor packet loss to a full loss of external connectivity. While we can’t prevent datacenter staff from mistakenly shutting down our IP addresses on the routing level, we realized that we needed to make our infrastructure more resilient against other, more common, network issues.
We found that even smaller network congestions, oftentimes caused by high traffic from or to a neighboring server of another datacenter customer, could seriously impact requests from our web boxes to backend services. The reason for this is that, on a box doing hundreds or even thousands of database requests per second, increases of only a few milliseconds in network latency add up quickly. This can very well impact operation to the extent that the box becomes incapable of serving new incoming requests because it runs full with web server processes waiting for their data.
This problem would be even more severe if, instead of leasing “bare-metal servers”, we were using cloud-based infrastructure where we can’t even influence with whom we’re sharing a VM host. The Drupal experts at 2bits even make this recommendation to VPS users:>When you encounter variable performance or poor performance, before wasting time on troubleshooting that may not lead anywhere, it is worthwhile to contact your host, and ask for your VPS to be moved to a different physical server. Doing so most likely will solve the issue, since you effectively have a different set of housemates.
With IaaS vendors like Amazon, that would mean replacing your server instances with others on a trial-and-error basis. What a pain.
To minimize the impact of network performance degradation on our hosting infrastructure, we’ve started three improvement projects:
- Optimize request distribution at the loadbalancer level.
- Build our own CDN.
- Move our servers into dedicated racks.
We did already finish project 1. A loadbalancer needs to distribute HTTP requests to those backend boxes that have the necessary resources and are responsive. Boxes that are maxed out or do not respond for other reasons become ineligible. We recently optimized the health checks that our loadbalancers use to determine what boxes are ready to receive requests. Now, a box only gets passed HTTP requests if it proved itself to be stable by successfully responding to a continuous series of health checks.
One cause of boxes to become unresponsive is that their backend requests “get stuck” on the network. And since we don’t control the network layer, we instead chose to minimize our dependency on it. That’s why, in project 2, we’re building our own Content Delivery Network. We’re going to cover this topic in another blog post, so stay tuned!
Where we still need to rely on the communication with backend services (for example, with database clusters), we need to make this communication more robust. That’s the goal of project 3. We are going to move our servers into our own racks where they share a direct network connection only with each other, not with other datacenter customers. This dedicated network connection makes data transfers between our servers faster, more reliable and more secure.
These are only the most prominent ones of all changes that we’re doing day in, day out to improve the performance and availability of our freistilbox hosting platform. And although the quality of our services is growing steadily, our prices don’t. So, if you know someone who’s looking for a hosting service that reduces their IT headaches without breaking the bank, please tell them about us!
And if you’d like to help us improve our next-generation managed hosting, join the team!
15 Apr 2013
DevOps Days London took place a few weeks ago on 15 and 16 of March 2013. I’ve finished my review a bit late, but I had so much fun at the conference — or rather, unconference — that I’d like to post it anyway.
“Bridge the Gap” — Adopting this modification of the well-known London underground warning as the event’s motto was a stroke of genius. After all, that’s what DevOps is all about: crossing the chasm between software development and IT operations. And there were many practical examples of these efforts at DevOps Days London.
I attended all the talks on both days. They were all interesting and some even entertaining as well:
- “DevOps For Dinosaurs - My experience in introducing a DevOps culture in a traditional enterprise”
- “Checking DevOps’ vital signs - how healthy is your culture?”
- “StartOps: Growing an ops team from 1 founder”
- “Adding Business Metrics”
- “DevOps and the traditional enterprise IT - Opposites and the best of two worlds”
With “DevOps in the Hell of a Thousand Different Platforms”, Sam Eaton gave a both highly entertaining and insightful talk that made “fail cake” a trending topic for the #devopsdays hashtag.
In the last keynote titled “Lessons Learned From Manufacturing For Maximizing Flow From Dev To Ops”, Gene Kim laid out some insight from his new book “The Phoenix Project”. It’s uncanny how much the situations described in his “IT novel” match my experience in corporate IT. Over lunch, I took the opportunity to tell Gene that reading “The Phoenix Project” is fun but at the same time nearly causes me PTSD…
After the keynotes, there was opportunity for participants to give IGNITE talks. I’ve especially enjoyed Patrick Dubios’ talk “What if config management was created by game designers?”.
Open Space sessions
DevOps Days, like Barcamps, are organised as an “unconference”. They don’t consist only of a fixed session schedule with speakers designated in advance by the orga team, which would reduce the participants to a mostly passive audience. Instead, DevOps Days leave most of the time open for topics brought in by the participants themselves. Both afternoons were reserved for “OpenSpace”, a very flexible format where everyone can suggest session topics and these may even still change during the sessions.
I was excited to join a session about hiring for DevOps teams and found out that we’re on the right track with our own growth efforts. I also had suggested a session about “Open Source storage solutions” myself. It was a great success, both in number of participants and in the insight I gained from the conversation. As an outcome of this session, we’re going to research object storage systems like MogileFS.
On Sunday, I had to already leave shortly after lunch because I chose an early flight home to my family. I’m sure the OpenSpace sessions were equally inspiring as the Saturday ones.
Location and catering
The event was hosted at the “Mary Ward House Conference & Exhibition Centre”. The house is a bit rambling and we had to climb stairs and turn many corners every time we needed to change rooms or go to the loo. On the other hand, this prevented us from further endangering our health by sitting all the time.
As earthly beings, we still need more than only food for thought. For beverages, we could choose between water, coffee and, we’re in London after all, tea. At lunch time, we were offered tasty options for both carnivores and vegetarians, and there were baskets of cookies for tea time. I think the level of catering was just about right for such a low-price event. I don’t like sitting through talks hungry but I also hate falling into a post-lunch digestion coma, and they hit the sweet spot in between.
After a packed conference day, I’m usually quite exhausted; a tribute I have to pay to my introvert nature. That and my burning interest in learning more about object storage systems led to my decision to forego the social event on Friday evening and instead have a pizza alone before spending the evening in front of my laptop. What I heard, though, is that many people enjoyed having drinks at “The Last”.
Although I missed the Sunday OpenSpace sessions, I found DevOps Days London highly inspiring and the results well worth the trip. The “unconference” character of the event which lets everyone address their own issues, and the very active “hallway track” are what from my perspective make DevOps Days essential community events.
If you’d like to go to one of the next european DevOps Days events, you should consider joining our team! We regularly send team members to community events, all expenses paid. Interested? Get in touch!
I look forward to the upcoming DevOps Days in Berlin! See you there?
09 Apr 2013
On Wednesday night, we experienced a massive loadbalancer outage that affected a huge part of the websites that we are hosting. I’d like to take the time to explain what went wrong, and what consequences this incident will have on how we build our IT infrastructure with our partners.
We use loadbalancers to distribute incoming requests from website visitors to the right web application servers. In our case, these loadbalancers are Linux servers running HTTP proxy software like HAProxy and nginx. Of course, we have redundancy for machines of this importance, so every loadbalancer configuration always runs on a pair of machines. In the case of an outage, caused for example by a hardware failure, we can switch the routing of the loadbalancer’s IP addresses to the spare machine which immediately starts distributing incoming requests. While we can switch these IP addresses between servers, from a billing perspective they are permanently associated with one single server.
Because of our rapidly growing freistilbox infrastructure, we recently decided to replace the oldest loadbalancer pair with much more powerful hardware after three years of operation. This loadbalancer is responsible for routing a big part of the incoming traffic to our DrupalCONCEPT and freistilbox clusters at our datacenter partner Hetzner AG.
In preparation of the hardware upgrade, we first built the first node of the new loadbalancer pair and switched the routing of all of the old loadbalancer’s IP addresses to this new machine a few days in advance. This switch happened over night and there was no service interruption. We were pleased to see that the new server managed all incoming requests with a mere 2% of its CPU power.
Now we had to upgrade the old LB server with which all the loadbalancer IP addresses were associated. For network architecture reasons, the new machine needed to physically replace the old one and on Tuesday, 2013-03-26, at about 14:30 UTC, Hetzner datacenter staff swapped the servers. Since web traffic was already handled by the other new loadbalancer node, the replacement procedure had no impact on website operation.
We only found a seemingly small issue after the upgrade. The IP addresses now associated with the new server were not yet displayed on the datacenter management web interface. Their routing was obviously working and all websites were reachable, so no emergency measures seemed necessary. We sent a support request to the datacenter, though, asking why the address list had vanished.
To make sure that loadbalancer operation was not in danger, we followed up with a call to Hetzner support at 16:07 UTC. The support agent told us that the subnets were still associated with the server and our customer account and that we’d get feedback from backoffice support the following day.
In the night, at 00:16 UTC on 2013-03-27, our monitoring system suddenly started sending “IP Address down” alerts. A lot of alerts, actually. It quickly became clear that all IP addresses associated with the new loadbalancer had gone down. Which meant that many websites had become unreachable. Our on-call engineer immediately sent a support request to the datacenter. He also tried to get direct information from Hetzner support via phone but was asked to wait for an email response. Another inquiry attempt about 15 minutes later was cut short, too.
When we still didn’t have any feedback at 01:30, we called Hetzner again to emphasize the severity of this outage. We were told that their network team did not have a night shift presence at the datacenter and that the network engineer on call had not responded yet. We demanded to have the issue escalated to highest priority and to be kept in the loop about any progress. The support agent confirmed that he’d make sure that we’d get feedback within a few minutes.
Still waiting for feedback at 01:59 UTC, we were relieved to see first recovery notifications from our monitoring system. One of the missing subnets even was displayed again in the datacenter web UI.
But there were a lot of addresses that were still down, so we called Hetzner support again at 02:18. The agent, sounding clearly annoyed, stated that he had already sent an email response that all addresses were active again and that if there were problems remaining, they were probably caused by our system configuration. Not accepting this simplistic explanation, we told the agent that we’d prepare a list of the addresses that were still down so Hetzner could actually check them.
While collecting this information, we realized that only the first quarter of the biggest IP subnet on the loadbalancer was online again. We contacted Hetzner again, indicating that they had probably used a wrong prefix or subnet mask while reconfiguring the routing. A few minutes later, at 02:54, our monitoring sent us recovery notifications for all remaining addresses.
Root cause analysis
First thing In the morning, we contacted our Hetzner sales contact, gave them our timeline of the outage and asked for an explanation for what had happened. It turns out that we were right with our concerns about the vanished address list: When the contract for the old server was terminated after it got replaced, its IP addresses got canceled with it. Then, in the night, an automatic deprovisioning process removed them from the routing tables.
Where we go from here
Our sales contact at Hetzner apologized sincerely for this clerical error and a day later notified us that they added a security step to their cancelation process. Now, the person doing the contract change gets a warning message that asks them to in doubt confirm with sales if an upgraded server’s address list should be canceled with it.
This outage could have been prevented completely if either our support request about the IP addresses missing in the web UI would have been handled earlier or if the support agent that we spoke to on Tuesday afternoon would have realized that the addresses had actually been canceled with the old server.
The loadbalancer downtime would also have been much shorter if the on-call network engineer at Hetzner had acted more quickly and then also had taken more care in reconfiguring the routing and making sure that all IP addresses were reachable again. We especially find it unacceptable that the support agent we spoke to tried to pass the buck to us and that we had to prove that service restoration had indeed not been executed properly.
That’s why we chose to escalate this incident to Hetzner’s CEO. We also asked for a personal meeting with the managers responsible for datacenter and support operations to discuss how we can cooperate more effectively. We haven’t yet heard back from Hetzner on this request and will check back with them in a few days.
Even though we had executed every step of our loadbalancer upgrade with diligence and tried to make sure that there was no impact on website operation at any time, we suffered a significant outage. This shows how dependent we are on our IT partners, their processes and staff and we’re going to put more effort into making sure that the companies with which we partner align with our values and goals towards service quality. Additionally, on a technological level, we’re discussing how we can increase the availability of our customers’ websites further by spreading our infrastructure out over multiple IT infrastructure providers.
In closing, I apologize sincerely for this outage. We were lucky that it happened at a time where its impact on website visitors was low but it was 2,5 hours of downtime nonetheless. This is unacceptable for a company that promises its customers that they won’t have to worry about their hosting in any way. We are making every effort to prevent such an outage from happening ever again.
Jochen Lillich, founder and IT architect, freistil IT
30 Mar 2013
Dear freistil IT customers,
From 29nd March (Good Friday) to Friday, 7th April, we’re going take some time off to recharge.
Of course, emergency support will be available during this time. Outages and other problems with our IT infrastructure that impact the delivery of your websites will be handled in the usual swift manner at any time.
On Monday, 8th April, we’re going to resume working on tasks that are not related to production problems.
If you do plan to launch a new website or if you are going to need other kinds of assistance during this timeframe, please let us know immediately via our Help Center. We’ll be happy to see what we can arrange to support you.
We wish you a happy Easter weekend and some joyful spring days! The team at freistil IT
18 Mar 2013
Today’s Irish Times has an article by Jennifer O’Connell titled “If you want me to work in an office, I demand to commute in a flying car”. This idea, she explains, comes from a conversation with her father who had read Toffler’s “Third Wave” and expected our generation to replace working in offices mostly by teleworking and “then spend all the time we weren’t commuting to the office in our flying cars, pursuing more worthwhile projects”.
Okay, we still don’t have flying cars. (Which fails to make me sad – would you want those imbeciles you encounter all day in the air above you?) And with Yahoo! calling all their telecommuters back into corporate shelter, some seem to regard working from home an impossible dream, too.
The author disagrees and stresses that, while certainly not always easy, “being in control of your environment is surprisingly life-affirming”. It may require new measures of productivity, though. Instead of checking punch cards and counting butts-in-chairs, productivity in today’s companies needs to be judged by actual results. In such an environment (known as “ROWE”, results-only work environment), working from home holds up quite well against offices where, as O’Connell describes, “leaving your desk to locate a stapler can take 45 seconds on a Monday morning, but four and a half hours on a Friday afternoon”. She backs up her perspective by highlighting studies that have shown that “employees who worked from home were 13 per cent more productive, took fewer sick days, had higher job satisfaction, and were half as likely to leave”.
We built freistil IT as a ROWE from the start, with everyone working from any place where they can be the most effective. I’m writing this post in a bakery sipping great coffee before I’m off to get my daughter from Kindergarten. Before, I’ve spent the afternoon at my desk at a shared office space, and last night at 2:30, I did software upgrades on our servers from the living room table. I would not want to go back to having to commute – even in a flying car. We’re very happy working as a distributed team.
And we’re hiring! So, if you’d like to join our both productive and happy team, jump on over to our Jobs Page and get in touch!
06 Mar 2013
This summer will offer a great opportunity for Drupal developers to get together in one of Europe’s greatest cities: Dublin will be the location for the Drupal Dev Days from June 28th to 30th.
Of course, the freistil IT team will be in Dublin, too. Who in their right mind would want to miss the conference sessions during the day and the craic over some Guinness at night?
01 Mar 2013
On Wednesday, February 6th, and Thursday, February 7th, we had significant outages and we want to take the time to explain what happened. These outages impacted many customer websites and are not at all acceptable to us. I’m very sorry that they happened and our team is working hard to prevent similar incidents in the future.
Our IT infrastructure today consists of more than 180 servers. While we manage the software side of these servers completely, from the OS level to the applications they’re hosting, we decided right from the beginning not to spend time on maintaining hardware and datacenter infrastructure, e.g. network connectivity. That’s why we lease all our servers from our datacenter partners. Almost all our servers are provided by Hetzner AG which operates multiple datacenters in different parts of Germany.
This is an effective arrangement because datacenter services, like almost all IT services, benefit from economies of scale and we are still far from the number of servers that would get us to break even doing them ourselves. By leasing the hardware, we don’t have to pay staff to go on-site to connect new servers to the datacenter infrastructure or to replace broken parts of production machines. Instead, we have access to experienced datacenter staff and 24/7 support from our partners.
To avoid single points of failure, we distribute our servers over the different datacenters. Especially, we make sure that the nodes of a single cluster are located in different datacenters.
The downside of our approach is that we have to accept the fact that we depend on our partners to provide the level of service quality we need. As the recent incidents show, this is unfortunately not always the case.
What went wrong?
On February 6th, our monitoring system started at about 10:10 UTC to alert us of network packet loss levels of 50% to 100% with a number of servers and a lot of failing service checks, which most of the times is a symptom of connectivity problems. We recognized quickly that most of the servers with bad connectivity were located in Hetzner datacenter #10. We also received Twitter posts from Hetzner customers whose servers were running in DC #10. This suggested a problem with a central network component, most probably a router or distribution switch.
The problem was not limited to DC #10, though, and we started to get alerts about saturated web server workers from many other datacenters, too. It didn’t take us long to find that one of our storage cluster nodes, “stor02a”, is located in DC #10. Because our web application clusters store their static content files and their logs on shared storage clusters, the ones which were using this particular storage cluster were affected by the network failure, even if they were located outside DC #10.
Shared storage impact
Our shared storage architecture consists of a number of fileserver clusters which use the Gluster filesystem for redundant file storage and failure handling. With Gluster, files do not get replicated between the server nodes but by the storage clients (in our case, the web application servers). They maintain a connection to every active storage node and use these connections for reading. If a file needs to be written, the client repeats the change for every connected storage node. Metadata stored with the files is used to keep track of each file’s replication status.
The packet loss between the web servers and “stor02a” caused to an increasing number of retries which slowed down file access significantly. In turn, this kept web server processes busy much longer than normal and eventually led to a saturation of available HTTP connections. In other words, the websites on these clusters became unreachable.
If a storage node fails completely (e.g. due to a hardware failure or power outage), the Gluster clients will quickly notice repeated connection failures and stop accessing this node. In this incident, though, the network connection kept going down and up again, so the clients kept trying to access “stor02a”. When we became aware of this problem at about 10:35, we decided to shut down “stor02a” manually to provoke a failure event.
Shortly after, at about 10:50, network connectivity in DC #10 became stable again and web server load went down to normal levels.
We had a few additional network issues during the day but they always had already subsided when our on-call staff got notified. That’s why we decided to close the incident.
Unfortunately, we had to reopen it again on the next day. On 2013-02-07 from about 09:40 to 10:56 UTC, we experienced the same kind of network problems in DC #10 again. This time, Hetzner published a datacenter status update explaining that they were caused by a bug in a router firmware.
Unfortunately, the malfunctioning network had caused additional problems which we became aware of in the afternoon when a customer called our support hotline because their website failed to deliver certain image files. We found that this was caused by a split-brain situation on the storage cluster “stor02” where changes made on node “stor02b” weren’t reflected on “stor02a” and the self-heal algorithm built into the Gluster filesystem was not able to resolve this inconsistency between the two data sets.
We were able to resolve this secondary incident by doing backups of both data sets and then deleting the older one. Now, the self-heal mechanism didn’t get contradicting metadata any more and successfully mirrored the intact data set from “stor02b” to “stor02a”. Unfortunately, this caused another brief overload of the web nodes because of a short surge in network traffic.
Where do we go from here?
- We will look for effective changes to our architecture that could lessen the impact of local network malfunctions on our server infrastructure.
- We will investigate if we can further optimize our storage configuration to make it more resilient against network malfunctions.
- We will add checks to our monitoring system that will immediately inform us of data inconsistencies between the nodes of a storage cluster.
- We will define and document a Standard Operating Procedure of how to deal with partial or full storage cluster outages.
- We will work closely with our datacenter partners to make sure that there are effective communication channels established between our operations teams in the case of datacenter incidents.
I couldn’t be more sorry about the incident and the impact it had on our customers. We always use problems like this as an opportunity to improve our infrastructure and processes, and this will be no exception. Thank you for your continued support of freistil IT, we are working hard and making significant investments to make sure we live up to the trust you’ve placed in us.
14 Feb 2013