Last month, Netways held their annual Open Source Monitoring Conference. System engineers from Germany and abroad came to Nuremberg to learn more about tools that help us better understand our IT infrastructure.
Netways was gracious enough to invite me as a speaker and the video of my talk just became available. I gave an introduction into Sensu, a great monitoring framework that we started to use when Nagios could not handle our steadily growing number of checks any more.
I had a lot of fun at OSMC. Netways not only did a good job in selecting interesting talks by competent speakers but also went to great lenghts with the catering during the day and an awesome conference dinner. For me, OSDC and OSMC also are welcome opportunities to meet former colleagues again, and OSMC 2014 didn’t disappoint in this regard either.
Next year, OSMC will have its 10th anniversary and I’m looking forward to both the conference and the party!
18 Dec 2014
The end of the year is approaching quickly and as always, we’re going to take it as an opportunity to recharge our batteries. This means that from 24 December 2014 to 4 January 2015 , we’ll provide emergency support only.
Of course, should incidents occur that impact the operation of your production websites, our 24/7 on-call will take care of them.
If you need some last-minute engineering support for anything else (launching the new reindeerrental.com, adding SSH access for additional elfs etc.), we recommend you send us your wishes as early as possible! Like Santa, we’ll handle each of your requests to our best ability, but we’ll close our workshop on Christmas Eve. (We’ll be back in far less than a year, though.)
From the whole freistil IT team, thank you for an awesome 2014, have a happy Christmas and a great start into the new year!
08 Dec 2014
You may be wondering why we don’t have an “Enterprise” column in our pricing table. It looks like everyone has such an offer, after all.
The reason is: With freistilbox, we decided to not make Enterprise-grade hosting the highest(-paid) tier of our product. Instead, we’ve built the whole platform with Enterprise quality, from the smallest configuration upwards.
With other vendors, you get much less powerful Drupal and WordPress hosting if you choose one of their lower tiers:
- no distributed hosting architecture where every service runs on its own redundant and diligently configured server infrastructure,
- no SSL offloading,
- no storage network with multiple file copies,
- no SSD-based database clusters,
- etc. etc.
In other words, you don’t get much more than a managed VPS. If your website needs real performance and availability, your only option is their offering with that impressive “Enterprise” label (and price tag).
With freistilbox, things are different — and simpler! The only decision you have to make is what capacity you’d like to start with. It doesn’t matter if you choose a single freistilbox S or a 4xM power cluster; you’ll always get the full benefits of our hosting platform. Regardless of its size, every customer setup is based on the same high performance hosting architecture with SSL offloading, load balancing, redundancy and bare-metal performance.
freistilbox is Enterprise hosting from the start.
04 Dec 2014
Working as a remote team has many advantages and we’re very happy to have the freedom and flexibility it gives us. We can not deny, though, that it has a significant downside: our personalities and therefore our interactions have a lot more aspects than we can ever convey sitting in front of a webcam. You can’t really learn to know a person fully without experiencing how they think and act in different situations.
That’s why we decided to have regular events where the whole team comes together at a common location for a few days. During these events — we’ve named them “freistil Days” — we share an apartment, we eat, talk, work and play together. The idea of a company meetup is not new and we’ve looked at how other distributed companies like Automattic, Buffer, Sqwiggle and Zapier have done theirs.
And finally, freistil Days are about to happen for the first time: On Monday, we’ll fly to Barcelona where Markus, Philipp and I are going to enjoy working as a co-located team for 8 days. If you’re interested in what’s happening there, be sure to follow our Twitter account!
03 Oct 2014
freistilbox is a fully managed hosting platform. That means that we do everything that’s necessary to run a reliable hosting service.
Last week, a new software security threat with a catching name raised its ugly head: Shellshock is a security flaw in the widely used command-line shell “bash”. This security flaw can be exploited to issue an arbitrary command to a server to be executed. Troy Hunt has the technical details.
After this security weakness became widely known on Wednesday and security fixes were made available soon after, we immediately tested and installed them. Since then, we got two follow-up bash updates with additional fixes that we rolled out in the same swift fashion.
If you prefer to sleep peacefully, knowing that we take care of hosting security, why don’t you check out all the other advantages of freistilbox?
29 Sep 2014
August has been a bit more quiet due to vacations. Unfortunately, my own vacations came in the way of finishing and publishing my sum-up for July. That’s why I’ll compare our numbers for August to those I published in my sum-up for June.
In August, our DevOps support took center stage. We spent a significant amount of time working with customers on launching new websites and optimising existing ones. Performance tuning is one of the main concerns here. freistilbox certainly offers everything a high-traffic website needs to master traffic peaks without hiccups. Achieving reliable performance, though, requires optimising the web application so it can fully take advantage of our hosting platform. That’s where our engineering support shines with deep expertise in Drupal and WordPress tuning. We collaborate with our customers via phone, email or web chat as soon as any question or issue arises until it is solved.
We’re continuously expanding our infrastructure. Over the recent weeks, the number of websites we run on freistilbox increased by 22% to 394. With the number of websites, our web traffic also made a jump of 24% to 15.09 TB. Although a growing infrastructure means more points of failure, our monthly uptime stayed at an excellent value of 99.87% (+0.01%).
As I’ve mentioned above, delivering DevOps support is taking up a growing portion of our time. The August numbers for support requests reflects that. That month, we received 29% more tickets (193) than in June. Nonetheless, we’ve kept our ticket backlog at 39 because we were able to resolve 161 tickets, a whopping 50% improvement!
Unfortunately, our average reaction time went up significantly by 144%. As the chart shows, we slightly improved in the area of quick responses but much higher percentage of customers had to wait for more than a business day, compared to June. We’ll investigate if that’s due to the nature of the actual support requests or if we need to tweak our Help Center processes. Since satisfaction feedback remained at a perfect 100% “good”, we’re confident that we’re still doing a great job.
With more websites a growth in IT infrastructure is to be expected, and the number of servers our ops team has to maintain actually increased by 24%. 373 hosts means that our server:sysadmin ratio is 187:1.
The number of metrics we collect even grew by 26%. We’re now collecting 124,642 metrics every 10s. In order to achieve the necessary I/O performance, we built a new metrics storage on SSD drives.
Causing us a bit of concern is the fact that the amount of on-call alerts went up by 20% in August (1378 alerts total). So it’s exactly at the right time that PagerDuty published “ Let’s talk about Alert Fatigue”. We’ll especially have to dig deeper on the aspect “Cut alerts that aren’t actionable & adjust thresholds”. Another important improvement will be eliminating alerts that only get triggered as a consequence of previous alerts (for example, identical shared storage space warnings from all the boxes of a freistilbox cluster).
While our web hosting platform only runs PHP-based applications, we use Ruby for a lot of internal applications and tools. That’s why Markus spent the first August weekend attending eurucamp at the Hasso Plattner Institute in Potsdam. It was amazing to see the inspiration he brought back. This can only be good for our latest Ruby-based project, the freistilbox Hosting API. We’ll let you know more on this important undertaking later. So stay tuned!
16 Sep 2014
On Sunday, 2014-08-03, freistilbox operation was severely disrupted due to a power failure at a datacenter.
We apologise for this outage. We take reliability seriously and an interruption of this magnitude as well as the impact it causes to our customers is unacceptable.
On Sunday, 2014-08-03, at 12:34 UTC, our on-call engineer was alerted by the monitoring system that a number of servers suddently went offline, and the list was quite long. This indicated a network outage, and we posted a short notice to our status page. We then immediately contacted datacenter support. While we didn’t get a direct answer first, the datacenter posted a first public status update at 12:54, explaining that server room RZ19 suffered an outage.
Since one of our server racks is located in this server room, the impact of this outage was severe. The affected rack hosts all kinds of servers including database and file storage nodes. Without these services, even application servers outside of RZ19 weren’t able to deliver content any more.
Since we run the nodes of our database clusters in different server rooms, we executed a failover procedure to the standby nodes of the affected databases. This restored operation for a part of our hosting infrastructure.
At about 13:00, our servers started to come back online. When we checked their uptime, we realised that they must have just had started up, so we suspected a power outage. This was confirmed when the datacenter announced that RZ19 had suffered a “brownout” that caused its servers to reboot. Later, the ISP added that a whole datacenter location suffered a power outage. The UPS systems of all server rooms had been able to compensate until the power generators had started up – with the exception of RZ19.
At about 14:00, most of our servers were running smoothly again. A few of our database servers had suffered data corruption and since we had already switched to their standby nodes, we decided to repair them later. At that time, it was more urgent to replace application boxes that still had not come back. Some of our customers choose to run single-node freistilbox clusters and the websites running on these boxes were still down. We launched new boxes on servers with spare capacity and at about 15:00, our infrastructure was fully functional again.
What we’re doing about it
Since we don’t run our own datacenters, we depend on our hosting partners when it comes to hardware infrastructure (servers, network, power, cooling etc.). We can’t prevent power outages, only trust that our infrastructure providers take all the necessary measures to prevent them.
What we can do ourselves is build our hosting architecture as resilient as possible in order to minimise the impact of a power outage. We have already built in a lot of redundancy into freistilbox. This enabled us, for example, to quickly switch to non-affected database servers as we did at the beginning of this incident. We have identified a few points, though, where an outage can cause bigger parts of our infrastructure to fail.
The most critical one of these points is our current storage technology. While it comes with data replication features (of which we make use, of course), it is hard to distribute data over server rooms or even distant datacenters without running into network latency issues. That’s why we’re currently testing alternative solutions that don’t have this weakness. As a beta test, we’re already running our own company freistilbox cluster (the one that’s hosting this website) on one of these alternatives. This means we’ll be able to further improve our storage resiliency very soon.
Another point is the private cloud infrastructure on which we run the application boxes of our customers’ freistilbox clusters. By adding more system automation, we’re going to minimise the time it takes us to spin up replacement boxes when that becomes necessary, for example and especially during an outage.
Again, we sincerely apologise to all our customers affected by this outage and thank them for their continued trust.
13 Aug 2014
In terms of Drupal events, there is no summer break; the best example being the DrupalCamping going on in Wolfsburg at the moment. I’m so sad that my schedule doesn’t allow me be there and camp with my German Drupal friends!
Fortunately, I get to attend DrupalCamp North East in Sunderland next weekend. I’m very much looking forward to fly over to the UK again for the third time this year because I enjoy the Drupal community there as much as the ones in Germany and Ireland.
Since community is one of our core values at freistil IT, we try to participate at these events as actively as possible. I’m proud to announce that my session proposal about “ DevOps with Drupal” has been accepted and I’ll do my very best to explain how embedding development in operations and vice versa can improve working with Drupal in a great way.
If you’re also going to be at DrupalCamp NE next weekend, give me a shout via Twitter! I’ll happily arrange sharing a few drinks and great news about our new Partner Programme!
21 Jul 2014