How a Millisecond Timing Bug Cost Hundreds of Billions

fahd.zafar • October 24, 2025

Amazon has revealed the shocking cause behind one of history's most devastating cloud outages: a simple race condition in DynamoDB's DNS management system brought down AWS services globally for an entire day, with damage estimates potentially reaching hundreds of billions of dollars.

The Failure

The incident began at 11:48 PM PDT on 19th October when customers reported increased DynamoDB API error rates in US-EAST-1. The root cause? A timing bug that left DynamoDB's DNS record completely empty—systems trying to connect literally couldn't find it.


DynamoDB's DNS management comprises two independent components: a DNS Planner that creates DNS plans, and DNS Enactors that apply changes via Route 53. This separation should prevent failures, but instead created conditions for a catastrophic race condition.



The Race Condition

One DNS Enactor experienced unusually high delays whilst processing a DNS plan. Meanwhile, the Planner continued generating newer plans, which a second Enactor began applying. As the second Enactor executed its clean-up process to remove "stale" plans, the first delayed Enactor finally completed its work.


The clean-up deleted the older plan just as it finished—immediately removing all IP addresses for DynamoDB's regional endpoint. The system entered an inconsistent state that prevented automated recovery. Manual intervention was required.



The Cascade

With DynamoDB's DNS failing, systems attempting to connect experienced failures—including internal AWS services. The DropletWorkflow Manager (DWFM), which maintains leases for EC2 physical servers, depends on DynamoDB. When DWFM state checks failed, EC2 couldn't launch new instances or modify existing ones.


Engineers restored DynamoDB at 02:25 PDT—but recovery created new problems. DWFM attempted to re-establish leases across the entire EC2 fleet simultaneously. The massive scale meant leases began timing out before completion, causing "congestive collapse" that required manual intervention until 05:28 PDT.


Network Manager then propagated a huge backlog of delayed configurations, causing newly launched EC2 instances to experience delays. This affected Network Load Balancer's health checking, creating further instability.


With EC2 impaired, every dependent service failed: Lambda, ECS, EKS, and Fargate all experienced issues—virtually every modern AWS compute service was affected.



The Global Impact

Major gaming platforms went offline. UK banks couldn't process transactions. Government services failed. Amazon.com itself was down. Alexa and Ring devices stopped working. Messaging apps experienced problems.

Early estimates suggest damage may reach hundreds of billions of dollars—reflecting lost revenue, missed opportunities, damaged customer relationships, regulatory penalties, and emergency response costs across thousands of organisations globally.



AWS's Response

AWS has disabled the DynamoDB DNS Planner and Enactor automation worldwide until safeguards prevent recurrence. Rather than implementing a quick fix, AWS essentially admitted the system cannot be trusted until fundamental architectural changes are made.

Amazon stated: "We will look for additional ways to avoid impact from a similar event in the future, and how to further reduce time to recovery."



The Uncomfortable Truths

Complexity creates unexpected failures. The DNS system followed best practices—independent components, clean-up processes, automated recovery. Yet their interaction created a race condition that left the system unrecoverable.

Cascading failures are inevitable. The dependency chain from DynamoDB to DWFM to EC2 to virtually every AWS service meant a single database failure cascaded across the entire platform.

Recovery can be harder than prevention. DynamoDB recovered in under three hours, but full restoration took over five hours because recovery efforts made problems worse.

Manual intervention remains essential. Despite extensive automation, human engineers were required at multiple points. Automated systems couldn't correct the DNS state or prevent congestive collapse.



What This Means for You

Race conditions are notoriously difficult to detect. They only manifest under specific timing conditions that don't occur during normal operations or standard testing. The "unusually high delays" that triggered this bug may have existed in production for extended periods, waiting for the right conditions.


For organisations running on AWS, critical questions demand answers:


Can you tolerate day-long outages? If not, you need strategies beyond AWS's native redundancy. Multi-cloud architectures may be the only path to required resilience.

Do you understand your dependencies? Even applications not using DynamoDB were affected through EC2 dependencies. Do you have visibility into your dependency chains?

Are your disaster recovery plans adequate? When AWS experiences failures affecting fundamental services, what can your team actually do?



The Bottom Line

A timing bug measured in milliseconds brought down one of the world's largest cloud platforms, with potential damage reaching hundreds of billions of dollars. Even the most sophisticated infrastructure, designed by world-class engineers, remains vulnerable to simple bugs in critical systems.


For organisations depending on cloud infrastructure, this is a reminder that no provider is immune to catastrophic failures. Resilience requires architectural strategies that account for provider-level outages, not just the availability zone redundancy that cloud providers offer natively.


The cloud's benefits are too significant to abandon. But the illusion of infallibility has been shattered. True resilience requires honest risk assessment, realistic disaster recovery planning, and potentially the expensive complexity of genuine multi-cloud architectures.


A single race condition brought AWS to its knees. What failure mode will we discover next?



Build Resilient Cloud Infrastructure

At Altiatech, we help organisations design cloud strategies that account for provider-level failures that no amount of native redundancy can prevent.


Our cloud services are built upon Cloud Centre of Excellence (CCoE) values with expertise in Microsoft Azure, Amazon Web Services, and Google Cloud Platform. We help you understand dependencies, design architectures that survive provider failures, and build incident response capabilities for scenarios you can't directly control.


Don't assume cloud provider best practices guarantee the resilience your business requires.


Get in touch:

📧 Email: innovate@altiatech.com
📞 Phone (UK): +44 (0)330 332 5482


Build resilience. Understand risk. Stay operational.

October 24, 2025
Microsoft published an unscheduled security patch on Friday addressing a severe vulnerability in Windows Server Update Services (WSUS), creating weekend work for system administrators.
October 24, 2025
Alaska Airlines experienced its second mystery IT outage in three months, grounding its entire fleet for eight hours and cancelling over 360 flights. The incident raises uncomfortable questions about disaster recovery planning in critical infrastructure.
By fahd.zafar October 21, 2025
When Amazon Web Services' US-EAST-1 region went down on 20th October, it didn't just affect services in Northern Virginia—it brought down websites and critical services across the globe, from European banks to UK government agencies. The incident has exposed a fundamental vulnerability in modern cloud infrastructure that no amount of redundancy planning can fully address.
By fahd.zafar October 20, 2025
The numbers are stark and deeply concerning. The National Cyber Security Centre (NCSC) handled a record 204 nationally significant cyber attacks in the year to September 2025—an average of four every single week. This represents a dramatic increase from 89 incidents in the previous year, more than doubling in just 12 months.  For British businesses, this isn't abstract threat intelligence—it's a clear warning that the cyber threat landscape has fundamentally changed, and urgent action is required.
By fahd.zafar October 17, 2025
Artificial intelligence has fundamentally changed the cybersecurity landscape, and the statistics are alarming. According to Microsoft's latest Digital Defense Report, AI-automated phishing emails are 4.5 times more effective than traditional phishing attempts—and potentially 50 times more profitable for cybercriminals.  This isn't just incremental improvement for attackers. It's a game-changer that demands immediate attention from every organisation.
October 15, 2025
The National Cyber Security Centre (NCSC) has issued urgent guidance following confirmation of a significant security incident affecting F5 Networks. Organisations across the UK using F5 products should take immediate action to protect their infrastructure.
By fahd.zafar October 3, 2025
In 2020, the UK's Office for National Statistics launched an ambitious plan to revolutionise government data sharing. Five years and £240.8 million later, the Treasury has pulled the plug—leaving the government with three separate, poorly integrated data platforms just as it faces mounting policy challenges requiring comprehensive data analysis. 
October 3, 2025
A hacking group calling itself "the Crimson Collective" has claimed responsibility for what could be one of the most significant breaches in the open source world—the alleged theft of 570GB of compressed data from Red Hat's private GitHub repositories. Whilst the full scope remains unconfirmed, the attackers' claims paint a troubling picture that extends far beyond Red Hat itself, potentially compromising numerous enterprise customers across banking, telecommunications, and government sectors.
By fahd.zafar October 1, 2025
Artificial intelligence tools promise to revolutionise how we work, making complex tasks simpler and boosting productivity across organisations. However, security researchers at Tenable have just demonstrated why AI integrations must be treated as active threat surfaces rather than passive productivity tools. Their discovery of three distinct vulnerabilities in Google Gemini—collectively dubbed the "Gemini Trifecta"—reveals how attackers can weaponise AI's most helpful features against users and organisations.
October 1, 2025
For most people, buying a house represents the largest financial transaction of their lives. Instead of marking an exciting new chapter, thousands of UK house buyers are discovering their life savings have vanished into criminals' accounts through a sophisticated fraud that exploits the very professionals meant to protect them.