The SmartFrog
Patterns of Deployment site that I'm putting together has one
entry on the SPOF, the
Single Point of Failure
This is not something that should be
designed into any high-availability distributed system. Sometimes
it turns out be in there anyway.
How do you know you have a SPOF? Oh, you always have one. How do
you find it? You don't: it finds you. And on Friday, it
found Amazon Web Services
Having discussed S3 over beers with Al at Gabe's house in
Corvallis (hello out there!), I was pretty impressed with the
redudancy that Amazon had put in. Multiple routes out each
datacentre, with different providers. Multiple data centres within
the same city, in areas of different geology. It would take more
than one earthquake to lose the data. But apparently the one thing
they weren't set up to deal with was
the CPU load of authentication.
If you read the thread, a lot of people are upset. What's going
on? they demand; fix it! they say. I sympathise with their point of
view, but I don't agree with it. AWS can't say what's going on, not
until they know. They can't fix it until after that. I've been on
the receiving end of these 'fix it now' crises, and having lots of
people on the phone doesn't help you find the problem any faster.
So well done to the AWS team to (a) fixing it fairly quickly and
(b) having so many users that the outage got such publicity!
Also: things like this happen. High Availability systems aren't
called continous availability. Merely high. For a reason. Stay
mellow out there when it does happen.