3 times 52 is 156 – hey, we’ve had our three-year anniversary just recently! Shouldn’t there have been a celebration or something?
From Tuesday to Thursday, Jochen visited Nuremberg for the Open Source Datacenter Conference (OSDC) where he’s been a regular attendant and speaker for some years now. OSDC is a great conference to learn about interesting open source software projects for IT operations and to exchange experiences with other professionals using OSS in the datacenter. Jochen came back with a long list of ideas how we can improve both our daily work and the services we offer to our customers.
For the rest of the week, Jochen had his focus on our recruiting. We’ve received some interesting new applications for our sysadmin job offer and we’ll make sure not to delay checking if they’re a good fit.
Meanwhile, Markus took care of our existing IT infrastructure and our customers. Behind the scenes, he worked on adding geo-replication to our storage clusters which will come in handy for a number of purposes when it’s ready for production. As he’s getting more involved in our marketing, he also made some preparations for the Drupal Dev Days which we’re supporting as Gold Sponsor.
Next week, we’re both going to meet with a business coach for a one-day sales and marketing workshop. Getting external support on running our company is something we still need to get used to and we’re excited to see what results this workshop will bring.
19 Apr 2013
On many hosting platforms, including our own DrupalCONCEPT, secure traffic that is encrypted via SSL has to be handled directly by the web server. This not only puts additional computing load on those servers, it also prevents HTTP caching which means less responsiveness. To speed up the delivery of static page assets, some customers choose to use “mixed mode”, i.e. deliver these assets via HTTP even if the page is requested via SSL. But because this workaround can cause sensitive data to be transferred in an insecure way, it is not a practice we recommend.
For freistilbox, we eliminated this shortcoming! If you want to add SSL encryption to a website hosted on freistilbox, we have a great feature for you: SSL offloading. This means that SSL packets are decrypted the moment they reach our freistilbox infrastructure. The content of these SSL packets is then passed on to the next system layers as plain HTTP requests. This has several advantages.
First, content caching works both for plain HTTP and for SSL traffic. Since the Varnish cache proxy is located between the SSL offloading layer and your freistilboxes, it can store static assets and even pages regardless of encryption. You really don’t need to unsettle your visitors with those “mixed content” browser warnings.
The second benefit of SSL offloading is made obvious by its name: Your web application servers don’t have to use precious computing resources for decrypting requests and encrypting responses. Our hosting platform takes complete care of that. (As usual with freistilbox, I can’t resist to add.)
So go on, make your website more secure and enable SSL! You’ll find everything you need to set up SSL in our online documentation.
16 Apr 2013
Just a short message while I’m packing my bag for this year’s Open Source Datacenter Conference: If you’re going to be in Nürnberg on Wednesday or Thursday as well, I’d be happy to meet you! Let’s talk DevOps over some coffee or a drink at the bar!
15 Apr 2013
Our freistilbox hosting platform is built from the ground up with high availability in mind. In order to minimize the impact of failures, every backend service (i.e. each MySQL database, each Apache Solr core etc.) is running on at least two servers. And if you run your website on more than a single freistilbox, you’re in good shape on the web application level, too.
Redundancy alone doesn’t guarantee maximum uptime, though. Recently, we had to deal with various kinds of network problems ranging from minor packet loss to a full loss of external connectivity. While we can’t prevent datacenter staff from mistakenly shutting down our IP addresses on the routing level, we realized that we needed to make our infrastructure more resilient against other, more common, network issues.
We found that even smaller network congestions, oftentimes caused by high traffic from or to a neighboring server of another datacenter customer, could seriously impact requests from our web boxes to backend services. The reason for this is that, on a box doing hundreds or even thousands of database requests per second, increases of only a few milliseconds in network latency add up quickly. This can very well impact operation to the extent that the box becomes incapable of serving new incoming requests because it runs full with web server processes waiting for their data.
This problem would be even more severe if, instead of leasing “bare-metal servers”, we were using cloud-based infrastructure where we can’t even influence with whom we’re sharing a VM host. The Drupal experts at 2bits even make this recommendation to VPS users:>When you encounter variable performance or poor performance, before wasting time on troubleshooting that may not lead anywhere, it is worthwhile to contact your host, and ask for your VPS to be moved to a different physical server. Doing so most likely will solve the issue, since you effectively have a different set of housemates.
With IaaS vendors like Amazon, that would mean replacing your server instances with others on a trial-and-error basis. What a pain.
To minimize the impact of network performance degradation on our hosting infrastructure, we’ve started three improvement projects:
- Optimize request distribution at the loadbalancer level.
- Build our own CDN.
- Move our servers into dedicated racks.
We did already finish project 1. A loadbalancer needs to distribute HTTP requests to those backend boxes that have the necessary resources and are responsive. Boxes that are maxed out or do not respond for other reasons become ineligible. We recently optimized the health checks that our loadbalancers use to determine what boxes are ready to receive requests. Now, a box only gets passed HTTP requests if it proved itself to be stable by successfully responding to a continuous series of health checks.
One cause of boxes to become unresponsive is that their backend requests “get stuck” on the network. And since we don’t control the network layer, we instead chose to minimize our dependency on it. That’s why, in project 2, we’re building our own Content Delivery Network. We’re going to cover this topic in another blog post, so stay tuned!
Where we still need to rely on the communication with backend services (for example, with database clusters), we need to make this communication more robust. That’s the goal of project 3. We are going to move our servers into our own racks where they share a direct network connection only with each other, not with other datacenter customers. This dedicated network connection makes data transfers between our servers faster, more reliable and more secure.
These are only the most prominent ones of all changes that we’re doing day in, day out to improve the performance and availability of our freistilbox hosting platform. And although the quality of our services is growing steadily, our prices don’t. So, if you know someone who’s looking for a hosting service that reduces their IT headaches without breaking the bank, please tell them about us!
And if you’d like to help us improve our next-generation managed hosting, join the team!
15 Apr 2013
DevOps Days London took place a few weeks ago on 15 and 16 of March 2013. I’ve finished my review a bit late, but I had so much fun at the conference — or rather, unconference — that I’d like to post it anyway.
“Bridge the Gap” — Adopting this modification of the well-known London underground warning as the event’s motto was a stroke of genius. After all, that’s what DevOps is all about: crossing the chasm between software development and IT operations. And there were many practical examples of these efforts at DevOps Days London.
I attended all the talks on both days. They were all interesting and some even entertaining as well:
- “DevOps For Dinosaurs - My experience in introducing a DevOps culture in a traditional enterprise”
- “Checking DevOps’ vital signs - how healthy is your culture?”
- “StartOps: Growing an ops team from 1 founder”
- “Adding Business Metrics”
- “DevOps and the traditional enterprise IT - Opposites and the best of two worlds”
With “DevOps in the Hell of a Thousand Different Platforms”, Sam Eaton gave a both highly entertaining and insightful talk that made “fail cake” a trending topic for the #devopsdays hashtag.
In the last keynote titled “Lessons Learned From Manufacturing For Maximizing Flow From Dev To Ops”, Gene Kim laid out some insight from his new book “The Phoenix Project”. It’s uncanny how much the situations described in his “IT novel” match my experience in corporate IT. Over lunch, I took the opportunity to tell Gene that reading “The Phoenix Project” is fun but at the same time nearly causes me PTSD…
After the keynotes, there was opportunity for participants to give IGNITE talks. I’ve especially enjoyed Patrick Dubios’ talk “What if config management was created by game designers?”.
Open Space sessions
DevOps Days, like Barcamps, are organised as an “unconference”. They don’t consist only of a fixed session schedule with speakers designated in advance by the orga team, which would reduce the participants to a mostly passive audience. Instead, DevOps Days leave most of the time open for topics brought in by the participants themselves. Both afternoons were reserved for “OpenSpace”, a very flexible format where everyone can suggest session topics and these may even still change during the sessions.
I was excited to join a session about hiring for DevOps teams and found out that we’re on the right track with our own growth efforts. I also had suggested a session about “Open Source storage solutions” myself. It was a great success, both in number of participants and in the insight I gained from the conversation. As an outcome of this session, we’re going to research object storage systems like MogileFS.
On Sunday, I had to already leave shortly after lunch because I chose an early flight home to my family. I’m sure the OpenSpace sessions were equally inspiring as the Saturday ones.
Location and catering
The event was hosted at the “Mary Ward House Conference & Exhibition Centre”. The house is a bit rambling and we had to climb stairs and turn many corners every time we needed to change rooms or go to the loo. On the other hand, this prevented us from further endangering our health by sitting all the time.
As earthly beings, we still need more than only food for thought. For beverages, we could choose between water, coffee and, we’re in London after all, tea. At lunch time, we were offered tasty options for both carnivores and vegetarians, and there were baskets of cookies for tea time. I think the level of catering was just about right for such a low-price event. I don’t like sitting through talks hungry but I also hate falling into a post-lunch digestion coma, and they hit the sweet spot in between.
After a packed conference day, I’m usually quite exhausted; a tribute I have to pay to my introvert nature. That and my burning interest in learning more about object storage systems led to my decision to forego the social event on Friday evening and instead have a pizza alone before spending the evening in front of my laptop. What I heard, though, is that many people enjoyed having drinks at “The Last”.
Although I missed the Sunday OpenSpace sessions, I found DevOps Days London highly inspiring and the results well worth the trip. The “unconference” character of the event which lets everyone address their own issues, and the very active “hallway track” are what from my perspective make DevOps Days essential community events.
If you’d like to go to one of the next european DevOps Days events, you should consider joining our team! We regularly send team members to community events, all expenses paid. Interested? Get in touch!
I look forward to the upcoming DevOps Days in Berlin! See you there?
09 Apr 2013
On Wednesday night, we experienced a massive loadbalancer outage that affected a huge part of the websites that we are hosting. I’d like to take the time to explain what went wrong, and what consequences this incident will have on how we build our IT infrastructure with our partners.
We use loadbalancers to distribute incoming requests from website visitors to the right web application servers. In our case, these loadbalancers are Linux servers running HTTP proxy software like HAProxy and nginx. Of course, we have redundancy for machines of this importance, so every loadbalancer configuration always runs on a pair of machines. In the case of an outage, caused for example by a hardware failure, we can switch the routing of the loadbalancer’s IP addresses to the spare machine which immediately starts distributing incoming requests. While we can switch these IP addresses between servers, from a billing perspective they are permanently associated with one single server.
Because of our rapidly growing freistilbox infrastructure, we recently decided to replace the oldest loadbalancer pair with much more powerful hardware after three years of operation. This loadbalancer is responsible for routing a big part of the incoming traffic to our DrupalCONCEPT and freistilbox clusters at our datacenter partner Hetzner AG.
In preparation of the hardware upgrade, we first built the first node of the new loadbalancer pair and switched the routing of all of the old loadbalancer’s IP addresses to this new machine a few days in advance. This switch happened over night and there was no service interruption. We were pleased to see that the new server managed all incoming requests with a mere 2% of its CPU power.
Now we had to upgrade the old LB server with which all the loadbalancer IP addresses were associated. For network architecture reasons, the new machine needed to physically replace the old one and on Tuesday, 2013-03-26, at about 14:30 UTC, Hetzner datacenter staff swapped the servers. Since web traffic was already handled by the other new loadbalancer node, the replacement procedure had no impact on website operation.
We only found a seemingly small issue after the upgrade. The IP addresses now associated with the new server were not yet displayed on the datacenter management web interface. Their routing was obviously working and all websites were reachable, so no emergency measures seemed necessary. We sent a support request to the datacenter, though, asking why the address list had vanished.
To make sure that loadbalancer operation was not in danger, we followed up with a call to Hetzner support at 16:07 UTC. The support agent told us that the subnets were still associated with the server and our customer account and that we’d get feedback from backoffice support the following day.
In the night, at 00:16 UTC on 2013-03-27, our monitoring system suddenly started sending “IP Address down” alerts. A lot of alerts, actually. It quickly became clear that all IP addresses associated with the new loadbalancer had gone down. Which meant that many websites had become unreachable. Our on-call engineer immediately sent a support request to the datacenter. He also tried to get direct information from Hetzner support via phone but was asked to wait for an email response. Another inquiry attempt about 15 minutes later was cut short, too.
When we still didn’t have any feedback at 01:30, we called Hetzner again to emphasize the severity of this outage. We were told that their network team did not have a night shift presence at the datacenter and that the network engineer on call had not responded yet. We demanded to have the issue escalated to highest priority and to be kept in the loop about any progress. The support agent confirmed that he’d make sure that we’d get feedback within a few minutes.
Still waiting for feedback at 01:59 UTC, we were relieved to see first recovery notifications from our monitoring system. One of the missing subnets even was displayed again in the datacenter web UI.
But there were a lot of addresses that were still down, so we called Hetzner support again at 02:18. The agent, sounding clearly annoyed, stated that he had already sent an email response that all addresses were active again and that if there were problems remaining, they were probably caused by our system configuration. Not accepting this simplistic explanation, we told the agent that we’d prepare a list of the addresses that were still down so Hetzner could actually check them.
While collecting this information, we realized that only the first quarter of the biggest IP subnet on the loadbalancer was online again. We contacted Hetzner again, indicating that they had probably used a wrong prefix or subnet mask while reconfiguring the routing. A few minutes later, at 02:54, our monitoring sent us recovery notifications for all remaining addresses.
Root cause analysis
First thing In the morning, we contacted our Hetzner sales contact, gave them our timeline of the outage and asked for an explanation for what had happened. It turns out that we were right with our concerns about the vanished address list: When the contract for the old server was terminated after it got replaced, its IP addresses got canceled with it. Then, in the night, an automatic deprovisioning process removed them from the routing tables.
Where we go from here
Our sales contact at Hetzner apologized sincerely for this clerical error and a day later notified us that they added a security step to their cancelation process. Now, the person doing the contract change gets a warning message that asks them to in doubt confirm with sales if an upgraded server’s address list should be canceled with it.
This outage could have been prevented completely if either our support request about the IP addresses missing in the web UI would have been handled earlier or if the support agent that we spoke to on Tuesday afternoon would have realized that the addresses had actually been canceled with the old server.
The loadbalancer downtime would also have been much shorter if the on-call network engineer at Hetzner had acted more quickly and then also had taken more care in reconfiguring the routing and making sure that all IP addresses were reachable again. We especially find it unacceptable that the support agent we spoke to tried to pass the buck to us and that we had to prove that service restoration had indeed not been executed properly.
That’s why we chose to escalate this incident to Hetzner’s CEO. We also asked for a personal meeting with the managers responsible for datacenter and support operations to discuss how we can cooperate more effectively. We haven’t yet heard back from Hetzner on this request and will check back with them in a few days.
Even though we had executed every step of our loadbalancer upgrade with diligence and tried to make sure that there was no impact on website operation at any time, we suffered a significant outage. This shows how dependent we are on our IT partners, their processes and staff and we’re going to put more effort into making sure that the companies with which we partner align with our values and goals towards service quality. Additionally, on a technological level, we’re discussing how we can increase the availability of our customers’ websites further by spreading our infrastructure out over multiple IT infrastructure providers.
In closing, I apologize sincerely for this outage. We were lucky that it happened at a time where its impact on website visitors was low but it was 2,5 hours of downtime nonetheless. This is unacceptable for a company that promises its customers that they won’t have to worry about their hosting in any way. We are making every effort to prevent such an outage from happening ever again.
Jochen Lillich, founder and IT architect, freistil IT
30 Mar 2013
Dear freistil IT customers,
From 29nd March (Good Friday) to Friday, 7th April, we’re going take some time off to recharge.
Of course, emergency support will be available during this time. Outages and other problems with our IT infrastructure that impact the delivery of your websites will be handled in the usual swift manner at any time.
On Monday, 8th April, we’re going to resume working on tasks that are not related to production problems.
If you do plan to launch a new website or if you are going to need other kinds of assistance during this timeframe, please let us know immediately via our Help Center. We’ll be happy to see what we can arrange to support you.
We wish you a happy Easter weekend and some joyful spring days! The team at freistil IT
18 Mar 2013
Today’s Irish Times has an article by Jennifer O’Connell titled “If you want me to work in an office, I demand to commute in a flying car”. This idea, she explains, comes from a conversation with her father who had read Toffler’s “Third Wave” and expected our generation to replace working in offices mostly by teleworking and “then spend all the time we weren’t commuting to the office in our flying cars, pursuing more worthwhile projects”.
Okay, we still don’t have flying cars. (Which fails to make me sad – would you want those imbeciles you encounter all day in the air above you?) And with Yahoo! calling all their telecommuters back into corporate shelter, some seem to regard working from home an impossible dream, too.
The author disagrees and stresses that, while certainly not always easy, “being in control of your environment is surprisingly life-affirming”. It may require new measures of productivity, though. Instead of checking punch cards and counting butts-in-chairs, productivity in today’s companies needs to be judged by actual results. In such an environment (known as “ROWE”, results-only work environment), working from home holds up quite well against offices where, as O’Connell describes, “leaving your desk to locate a stapler can take 45 seconds on a Monday morning, but four and a half hours on a Friday afternoon”. She backs up her perspective by highlighting studies that have shown that “employees who worked from home were 13 per cent more productive, took fewer sick days, had higher job satisfaction, and were half as likely to leave”.
We built freistil IT as a ROWE from the start, with everyone working from any place where they can be the most effective. I’m writing this post in a bakery sipping great coffee before I’m off to get my daughter from Kindergarten. Before, I’ve spent the afternoon at my desk at a shared office space, and last night at 2:30, I did software upgrades on our servers from the living room table. I would not want to go back to having to commute – even in a flying car. We’re very happy working as a distributed team.
And we’re hiring! So, if you’d like to join our both productive and happy team, jump on over to our Jobs Page and get in touch!
06 Mar 2013