How a Millisecond Timing Bug Cost Hundreds of Billions
Amazon has revealed the shocking cause behind one of history's most devastating cloud outages: a simple race condition in DynamoDB's DNS management system brought down AWS services globally for an entire day, with damage estimates potentially reaching hundreds of billions of dollars.

The Failure
The incident began at 11:48 PM PDT on 19th October when customers reported increased DynamoDB API error rates in US-EAST-1. The root cause? A timing bug that left DynamoDB's DNS record completely empty—systems trying to connect literally couldn't find it.
DynamoDB's DNS management comprises two independent components: a DNS Planner that creates DNS plans, and DNS Enactors that apply changes via Route 53. This separation should prevent failures, but instead created conditions for a catastrophic race condition.
The Race Condition
One DNS Enactor experienced unusually high delays whilst processing a DNS plan. Meanwhile, the Planner continued generating newer plans, which a second Enactor began applying. As the second Enactor executed its clean-up process to remove "stale" plans, the first delayed Enactor finally completed its work.
The clean-up deleted the older plan just as it finished—immediately removing all IP addresses for DynamoDB's regional endpoint. The system entered an inconsistent state that prevented automated recovery. Manual intervention was required.
The Cascade
With DynamoDB's DNS failing, systems attempting to connect experienced failures—including internal AWS services. The DropletWorkflow Manager (DWFM), which maintains leases for EC2 physical servers, depends on DynamoDB. When DWFM state checks failed, EC2 couldn't launch new instances or modify existing ones.
Engineers restored DynamoDB at 02:25 PDT—but recovery created new problems. DWFM attempted to re-establish leases across the entire EC2 fleet simultaneously. The massive scale meant leases began timing out before completion, causing "congestive collapse" that required manual intervention until 05:28 PDT.
Network Manager then propagated a huge backlog of delayed configurations, causing newly launched EC2 instances to experience delays. This affected Network Load Balancer's health checking, creating further instability.
With EC2 impaired, every dependent service failed: Lambda, ECS, EKS, and Fargate all experienced issues—virtually every modern AWS compute service was affected.
The Global Impact
Major gaming platforms went offline. UK banks couldn't process transactions. Government services failed. Amazon.com itself was down. Alexa and Ring devices stopped working. Messaging apps experienced problems.
Early estimates suggest damage may reach hundreds of billions of dollars—reflecting lost revenue, missed opportunities, damaged customer relationships, regulatory penalties, and emergency response costs across thousands of organisations globally.
AWS's Response
AWS has disabled the DynamoDB DNS Planner and Enactor automation worldwide until safeguards prevent recurrence. Rather than implementing a quick fix, AWS essentially admitted the system cannot be trusted until fundamental architectural changes are made.
Amazon stated: "We will look for additional ways to avoid impact from a similar event in the future, and how to further reduce time to recovery."
The Uncomfortable Truths
Complexity creates unexpected failures. The DNS system followed best practices—independent components, clean-up processes, automated recovery. Yet their interaction created a race condition that left the system unrecoverable.
Cascading failures are inevitable. The dependency chain from DynamoDB to DWFM to EC2 to virtually every AWS service meant a single database failure cascaded across the entire platform.
Recovery can be harder than prevention. DynamoDB recovered in under three hours, but full restoration took over five hours because recovery efforts made problems worse.
Manual intervention remains essential. Despite extensive automation, human engineers were required at multiple points. Automated systems couldn't correct the DNS state or prevent congestive collapse.
What This Means for You
Race conditions are notoriously difficult to detect. They only manifest under specific timing conditions that don't occur during normal operations or standard testing. The "unusually high delays" that triggered this bug may have existed in production for extended periods, waiting for the right conditions.
For organisations running on AWS, critical questions demand answers:
Can you tolerate day-long outages? If not, you need strategies beyond AWS's native redundancy. Multi-cloud architectures may be the only path to required resilience.
Do you understand your dependencies? Even applications not using DynamoDB were affected through EC2 dependencies. Do you have visibility into your dependency chains?
Are your disaster recovery plans adequate? When AWS experiences failures affecting fundamental services, what can your team actually do?
The Bottom Line
A timing bug measured in milliseconds brought down one of the world's largest cloud platforms, with potential damage reaching hundreds of billions of dollars. Even the most sophisticated infrastructure, designed by world-class engineers, remains vulnerable to simple bugs in critical systems.
For organisations depending on cloud infrastructure, this is a reminder that no provider is immune to catastrophic failures. Resilience requires architectural strategies that account for provider-level outages, not just the availability zone redundancy that cloud providers offer natively.
The cloud's benefits are too significant to abandon. But the illusion of infallibility has been shattered. True resilience requires honest risk assessment, realistic disaster recovery planning, and potentially the expensive complexity of genuine multi-cloud architectures.
A single race condition brought AWS to its knees. What failure mode will we discover next?
Build Resilient Cloud Infrastructure
At Altiatech, we help organisations design cloud strategies that account for provider-level failures that no amount of native redundancy can prevent.
Our cloud services are built upon Cloud Centre of Excellence (CCoE) values with expertise in Microsoft Azure, Amazon Web Services, and Google Cloud Platform. We help you understand dependencies, design architectures that survive provider failures, and build incident response capabilities for scenarios you can't directly control.
Don't assume cloud provider best practices guarantee the resilience your business requires.
Get in touch:
📧 Email:
innovate@altiatech.com
📞 Phone (UK): +44 (0)330 332 5482
Build resilience. Understand risk. Stay operational.












