Steve: Developing on the Edge - SPOF: Single Point of Failure
Steve: Developing on the Edge
Thoughts on development, Web-services, technology and mountains.
16Feb
Sat2008
SPOF: Single Point of Failure

The SmartFrog Patterns of Deployment site that I'm putting together has one entry on the SPOF, the Single Point of Failure

This is not something that should be designed into any high-availability distributed system. Sometimes it turns out be in there anyway.

How do you know you have a SPOF? Oh, you always have one. How do you find it? You don't: it finds you. And on Friday, it found Amazon Web Services

Having discussed S3 over beers with Al at Gabe's house in Corvallis (hello out there!), I was pretty impressed with the redudancy that Amazon had put in. Multiple routes out each datacentre, with different providers. Multiple data centres within the same city, in areas of different geology. It would take more than one earthquake to lose the data. But apparently the one thing they weren't set up to deal with was the CPU load of authentication.

If you read the thread, a lot of people are upset. What's going on? they demand; fix it! they say. I sympathise with their point of view, but I don't agree with it. AWS can't say what's going on, not until they know. They can't fix it until after that. I've been on the receiving end of these 'fix it now' crises, and having lots of people on the phone doesn't help you find the problem any faster. So well done to the AWS team to (a) fixing it fairly quickly and (b) having so many users that the outage got such publicity!

Also: things like this happen. High Availability systems aren't called continous availability. Merely high. For a reason. Stay mellow out there when it does happen.

Comments