How we're reducing the impact of network issues
Published 2013-04-15 by
Our freistilbox hosting platform is built from the ground up with high availability in mind. In order to minimize the impact of failures, every backend service (i.e. each MySQL database, each Apache Solr core etc.) is running on at least two servers. And if you run your website on more than a single freistilbox, you’re in good shape on the web application level, too.
Redundancy alone doesn’t guarantee maximum uptime, though. Recently, we had to deal with various kinds of network problems ranging from minor packet loss to a full loss of external connectivity. While we can’t prevent datacenter staff from mistakenly shutting down our IP addresses on the routing level, we realized that we needed to make our infrastructure more resilient against other, more common, network issues.
We found that even smaller network congestions, oftentimes caused by high traffic from or to a neighboring server of another datacenter customer, could seriously impact requests from our web boxes to backend services. The reason for this is that, on a box doing hundreds or even thousands of database requests per second, increases of only a few milliseconds in network latency add up quickly. This can very well impact operation to the extent that the box becomes incapable of serving new incoming requests because it runs full with web server processes waiting for their data.
This problem would be even more severe if, instead of leasing “bare-metal servers”, we were using cloud-based infrastructure where we can’t even influence with whom we’re sharing a VM host. The Drupal experts at 2bits even make this recommendation to VPS users:>When you encounter variable performance or poor performance, before wasting time on troubleshooting that may not lead anywhere, it is worthwhile to contact your host, and ask for your VPS to be moved to a different physical server. Doing so most likely will solve the issue, since you effectively have a different set of housemates.
With IaaS vendors like Amazon, that would mean replacing your server instances with others on a trial-and-error basis. What a pain.
To minimize the impact of network performance degradation on our hosting infrastructure, we’ve started three improvement projects:
- Optimize request distribution at the loadbalancer level.
- Build our own CDN.
- Move our servers into dedicated racks.
We did already finish project 1. A loadbalancer needs to distribute HTTP requests to those backend boxes that have the necessary resources and are responsive. Boxes that are maxed out or do not respond for other reasons become ineligible. We recently optimized the health checks that our loadbalancers use to determine what boxes are ready to receive requests. Now, a box only gets passed HTTP requests if it proved itself to be stable by successfully responding to a continuous series of health checks.
One cause of boxes to become unresponsive is that their backend requests “get stuck” on the network. And since we don’t control the network layer, we instead chose to minimize our dependency on it. That’s why, in project 2, we’re building our own Content Delivery Network. We’re going to cover this topic in another blog post, so stay tuned!
Where we still need to rely on the communication with backend services (for example, with database clusters), we need to make this communication more robust. That’s the goal of project 3. We are going to move our servers into our own racks where they share a direct network connection only with each other, not with other datacenter customers. This dedicated network connection makes data transfers between our servers faster, more reliable and more secure.
These are only the most prominent ones of all changes that we’re doing day in, day out to improve the performance and availability of our freistilbox hosting platform. And although the quality of our services is growing steadily, our prices don’t. So, if you know someone who’s looking for a hosting service that reduces their IT headaches without breaking the bank, please tell them about us!
And if you’d like to help us improve our next-generation managed hosting, join the team!