02 Feb 2009 23:13
Today in the very morning we had a small problem with one of our Wikidot servers, which manifested in "service unavailable" message between 2 AM and 5 AM UTC. The problem was easy to fix, but unfortunately because it was the middle of the night for us, we could not respond quickly enough and SMSes from our monitoring systems did not wake us up.
What really happened?
From what we were able to reconstruct, one of the file operations related to live backup of file uploads went crazy and started consuming loads of memory. Usually we are duplicating all files belonging to each of the sites in real-time to another server to have a hot-backup for both backup purposes and to have a hot-swap server in case the main server dies.
The problem started at about 2 AM and we could see radical drop in memory used for disk caching. This is usually a bad sign and we have a monitoring set up for this, so when this happens we are getting warning emails. The problem with tracing memory-related issues in Linux is that utilities do not show memory consumed by the kernel itself, and this include memory for handling filesystems. (Try doing ls */*/* a couple of times on a really large filesystem and you will see memory vanish quite quickly).
When memory went low, other services that rely on kernel disk caches (like the PostgreSQL database) started making a lot of disk operations, increasing system load and decreasing responsiveness. Under high load PostgreSQL is often unable to accept new connections form the PHP application.
Then our PHP backend died, also because of memory issues, producing increased "500 Internal Server Errors". From that moment nothing dramatic happened, besides the prolonged unavailability of the service.
At this point please allow me to say: this is not the kind of situation that happens often. In fact, such large memory outage happened for the first time. Suddenly about 7 GB of RAM disappeared. For most part of the day we were trying to trace the issue and are quite sure it was partly an unfortunate mixture of kernel operation and method of live backup of filesystem.
So far however we have never had experienced similar issues.
How we reacted
Wikidot is fully monitored from external servers for web availability, memory usage, other services. Monitoring services automatically send us emails and SMSes when anything suspicious happens. What failed was the human factor and the fact all those reporting methods could not wake us up in the middle of the night.
The problem was easy to fix once I got my hands on a computer. After stopping filesystem replication and restarting core services (www server) things got back to normal. Other services were restarted too to make sure things are ok.
For most part of the day we were analyzing logs and trying to figure out what really happened. I guess Piotr has a better and more detailed opinion on the internal source of the problem, but we mostly agree on what I have written above.
How can we protect Wikidot for the future?
Random things always happen, it is more important how we react to them and how we try to avoid them. We really put a lot of effort into eliminating situations like this one. The obvious solution is to make our SMSes wake us at night. So far we had very little monitoring alerts (servers are running very even) and I assumed this would work.
What is also important for our users, we have several safety measures to keep your data intact, including data replication and periodic full backups to external servers (to a different datacenters).
Also we are analyzing server usage reports every day to prevent such events from happening.
Until the current issue is fully resolved, we have removed the live file replication from the service and we will probably reimplement it in a different way.
Apart from situations like that, I need to admit that our servers are working very reliably, with a very high uptime. According to external measures (from Pidgin), even including the today outage, we still have over 99.8% for the last couple of months.
I really hope today's outage did not cause much inconvenience.
rating: 2, tags: outage wikidot