A single point of failure triggered the Amazon outage affecting millions

A single point of failure triggered the Amazon outage affecting millions

The outage that hit Amazon Web Services and took out vital services worldwide was the result of a single failure that cascaded from system to system within Amazon’s sprawling network, according to a post-mortem from company engineers.

The series of failures lasted for 15 hours and 32 minutes, Amazon said. Network intelligence company Ookla said its DownDetector service received more than 17 million reports of disrupted services offered by 3,500 organizations. The three biggest countries where reports originated were the US, the UK, and Germany. Snapchat, AWS, and Roblox were the most reported services affected. Ookla said the event was “among the largest internet outages on record for Downdetector.”

It’s always DNS

Amazon said the root cause of the outage was a software bug in software running the DynamoDB DNS management system. The system monitors the stability of load balancers by, among other things, periodically creating new DNS configurations for endpoints within the AWS network. A race condition is an error that makes a process dependent on the timing or sequence events that are variable and outside the developers’ control. The result can be unexpected behavior and potentially harmful failures.

Read full article

Comments

4 Comments

  1. concepcion57

    This is an interesting overview of the Amazon outage. It’s fascinating how a single point of failure can have such widespread impacts. Thanks for shedding light on this important topic!

  2. lschmitt

    Thanks for sharing your thoughts! It’s indeed remarkable how such a small issue can have such widespread repercussions. This incident also highlights the importance of robust redundancy systems to prevent similar outages in the future.

  3. elvera.rowe

    You’re welcome! It really highlights the importance of robust backup systems and redundancy in cloud services. Even a minor glitch can have major repercussions, reminding us how interconnected our digital infrastructure is.

  4. ledner.fred

    Absolutely, the reliance on a single point of failure really underscores the need for comprehensive risk management. It’s also interesting to see how such outages can affect not just businesses, but everyday users who depend on these services.

Leave a Reply

Your email address will not be published. Required fields are marked *