Wikidot on Cloud

Blog calendar

RSS feed from Michal Frackowiak's blog

subscribe to the RSS feed

— or —

get my blog posts via email

michal frackowiakmichal frackowiak
Helmuti_pdorfHelmuti_pdorf
SquarkSquark
anjelanjel
ArtimonierArtimonier
shark797039shark797039
clearekicleareki
Watch: site | category | page

Blog tags


View my profile on LinekdIn

My Twitter

1282558230|%e %b, %H:%M|agohover
New blog post: Back from holidays - new challenges on the horizon bit.ly/9eRpf0

1281297211|%e %b, %H:%M|agohover
I earned the Check-in Rookie sticker on @GetGlue! bit.ly/dbpWkZ

1280993335|%e %b, %H:%M|agohover
Why is it that when I am on my holiday Wikidot servers crash? It is kind of a rule. Bad karma, shame on the unkind hardware.

1280771665|%e %b, %H:%M|agohover
I earned the Lone Wolf sticker on @GetGlue! bit.ly/aLOBKx

1280700561|%e %b, %H:%M|agohover
I earned the Couch Potato(50) sticker on @GetGlue! bit.ly/bWcP2U

1279886686|%e %b, %H:%M|agohover
I love the cloud. "Database Error: Unable to connect to the database:Could not connect to MySQL". Yeah. I'd better backup my data now!

Photos

Blogroll

Piotr Gabryjeluk's blog (Gabrys on Wikidot)
Lukasz Tarka's blog (Squark on Wikidot)

Recent posts by my friends

Maj


Me in other networks:

facebook, last.fm, del.icio.us, Flickr, Flaker

« Back to the Blog

1236332915|%e %b, %H:%M (%O ago)

Everytime we have an emergency situation at Wikidot (as yesterday with our webserver) we try to stop and think how well protected we are from different kind of hardware and software failures, network problems and even larger disasters (like our datacenter being destroyed). Also yesterday we stepped down from our daily tasks and looked at the situation.

First of all, when designing current servers' configuration, we concentrated on having each piece of hardware as reliable as possible, which means:

  • we are using only high-quality servers from SoftLater (they are using SuperMicro servers and pack it with really great hardware)
  • our servers have redundant power supplies, in case one PSU dies (which is one of the most common problems), second one works and the tech support can replace the faulty one without any interruption
  • we are using only mirrored redundant disks (in RAID1 or RAID10) with hight quality components: controllers from Adaptac and high-quality SAS 15k disks for system and database + large SATA disks for file uploads.
  • all the services are being monitored by external resources — we are using Pingdom to monitor services availability + our own monitoring stack to check health of hardware.
  • we automate as much as possible, this includes health checks, error logs etc.
  • we are having a few autoreplication mechanisms for database and filesystems.
  • we also have periodic backups stored on remotely on S3.

So far, we are really, REALLY happy with SoftLater, our hosting provider. And this includes everything: great available hardware options, devoted support, datacenters with fast internet connection etc. If someone asks me about a datacenter, I always recommend SoftLayer. And as far as I know at least a few people moved there and are really happy with the choice.

OK, so back to the subject. What we were discussing yesterday can be summarized as: "we should not scale vertically, we should scale horizontally". What this means is that instead on relying on just a few servers we upgrade from time to time, we should create a more self-healing structure by adding more (less powerful) nodes and assume "by design" that some of them can fail without affecting availability of the service.

Why is it important? Because we need to avoid single points of failure. So far our setup works perfectly, giving us a very good overall uptime (it is still 99.9%). But the question I am always asking myself is: if something goes very wrong, how much trouble will it cause for us and our users? How much work before we bring service up after unrecoverable hardware failures? Or filesystem errors? Sure we eventually will fix things and all the data will be available, but how quickly? How fast can our hosting provider provision new servers to replace failed ones?

Although answers to the above questions can be still considered as satisfactory, with the growth of Wikidot and larger and larger datasets, current solutions simply become less and less effective. We would rather prefer another case: "What do you need to do when one of your servers fail? — NOTHING."

Just to give you a glimpse what my next blog post will be about and how we could improve our infrastructure: virtualization and cloud computing. Enough said for today ;-)


rating: 1, tags: cloud virtualization wikidot

rating: +1+x

del.icio.usdiggSimpyRedditYahooMyWebFurl

Add a New Comment
asdad