Handling an Outage Via Good Crisis Management

Over the weekend I finally made a bit of time to read up on what happened to cause the outage with Amazon’s EC2 web service (note: a great summary of what happened can be found here: https://aws.amazon.com/message/65648).

Amazon displayed great transparency by offering a detailed explanation of what happened.  Briefly, their Elastic Block Store (EBS) volumes became unresponsive in a single Eastern US availability zone.  Subsequently Amazon disabled access to these EBS causing the applications that depended on them to also become un responsive.

As is the case with most calamitous events in this high tech world we live in, the problem started with, you guessed it, human error. A network change needed to scale a particular ESB was made which essentially made the normal ESB mirroring process impossible.  The ESB was essentially “hung” as a result.  This in turn affected the load routing service or Control Plane that directs traffic to the cluster.  Because the control plane was subsequently impacted, the effect was felt in an even larger number of ESBs.

To fix the problem Amazon had to make numerous configuration changes related to how the ESB clusters and the control plan interact.  But the most time consuming part of the process was physically deploying new servers allowing the hung servers to re-mirror their data.

How long did all this take?  The outage began at 12:47 AM PDT on April 21st.  By 6:15 PM PDT on April 23rd 97.8% of the ESBs were up and running.  By 3:00 PM PDT on April 24th everything that could be recovered was recovered.  Ultimately, only 0.07% of data was lost.

Doing the right things under pressure while in crisis is never easy.  But it appears that Amazon did and is doing many of the right things to handle the outage:

  • They were transparent.  Amazon made an effort to provide detailed information throughout the course of the outage.  They are putting new capability in place to do an even better job of communicating with their customers during outages.
  • They’ve communicated what they’re doing to ensure that a similar outage does not happen again:  increasing change process rigor, improving the auditing of such procedures, and automating some of the change processes thus reducing human-powered errors.
  • They’re working to improve the process needed to recover.  For example having more hardware staged and ready to go at the data centers.
  • They’re working to address the bugs that made the outage worse (fixing race conditions related to mirroring, etc.).
  • Provided guidance on how to build even better fault tolerant applications between ESB Regions and Availability Zones.  Amazon will be hosting webinars on this topic starting May 2nd.
  • Automatically applied a 10 day service credit for all affected customers.

Finally, and appropriately, there was an apology.  This simple act is a critical step to getting your agitated customers back in your camp.  Its essential to say you’re sorry when you have screwed something up.

So, at the end of the day, it’s complicated.  And there are no silver bullets.  EC2 is relatively new.  EC2 attempts to hide the complexity of building high availability systems to subscribing customers.  But behind the curtain, the same or similar techniques are being utilized to implement high availability systems: load balancing, managing state, mirroring data, providing fail over paths, redundant hardware, etc., etc., etc.  It’s great that technologies have progressed far enough to offer this as a service, but it does not change the fact that it’s still complicated.

In the end, this outage will not slow the march of greater adoption of cloud services and on-demand computing.  Rather, it will facilitate one of the many refinements of the future of computing.

I welcome your comments,
Mike Brannan