06 Mar 2009 09:48
Everytime we have an emergency situation at Wikidot (as yesterday with our webserver) we try to stop and think how well protected we are from different kind of hardware and software failures, network problems and even larger disasters (like our datacenter being destroyed). Also yesterday we stepped down from our daily tasks and looked at the situation.
First of all, when designing current servers' configuration, we concentrated on having each piece of hardware as reliable as possible, which means:
- we are using only high-quality servers from SoftLater (they are using SuperMicro servers and pack it with really great hardware)
- our servers have redundant power supplies, in case one PSU dies (which is one of the most common problems), second one works and the tech support can replace the faulty one without any interruption
- we are using only mirrored redundant disks (in RAID1 or RAID10) with hight quality components: controllers from Adaptac and high-quality SAS 15k disks for system and database + large SATA disks for file uploads.
- all the services are being monitored by external resources — we are using Pingdom to monitor services availability + our own monitoring stack to check health of hardware.
- we automate as much as possible, this includes health checks, error logs etc.
- we are having a few autoreplication mechanisms for database and filesystems.
- we also have periodic backups stored on remotely on S3.
So far, we are really, REALLY happy with SoftLater, our hosting provider. And this includes everything: great available hardware options, devoted support, datacenters with fast internet connection etc. If someone asks me about a datacenter, I always recommend SoftLayer. And as far as I know at least a few people moved there and are really happy with the choice.
OK, so back to the subject. What we were discussing yesterday can be summarized as: "we should not scale vertically, we should scale horizontally". What this means is that instead on relying on just a few servers we upgrade from time to time, we should create a more self-healing structure by adding more (less powerful) nodes and assume "by design" that some of them can fail without affecting availability of the service.
Why is it important? Because we need to avoid single points of failure. So far our setup works perfectly, giving us a very good overall uptime (it is still 99.9%). But the question I am always asking myself is: if something goes very wrong, how much trouble will it cause for us and our users? How much work before we bring service up after unrecoverable hardware failures? Or filesystem errors? Sure we eventually will fix things and all the data will be available, but how quickly? How fast can our hosting provider provision new servers to replace failed ones?
Although answers to the above questions can be still considered as satisfactory, with the growth of Wikidot and larger and larger datasets, current solutions simply become less and less effective. We would rather prefer another case: "What do you need to do when one of your servers fail? — NOTHING."
Just to give you a glimpse what my next blog post will be about and how we could improve our infrastructure: virtualization and cloud computing. Enough said for today ;-)
rating: 1, tags: cloud virtualization wikidot