How a Millisecond Timing Bug Cost Hundreds of Billions

fahd.zafar • October 24, 2025

Amazon has revealed the shocking cause behind one of history's most devastating cloud outages: a simple race condition in DynamoDB's DNS management system brought down AWS services globally for an entire day, with damage estimates potentially reaching hundreds of billions of dollars.

The Failure

The incident began at 11:48 PM PDT on 19th October when customers reported increased DynamoDB API error rates in US-EAST-1. The root cause? A timing bug that left DynamoDB's DNS record completely empty—systems trying to connect literally couldn't find it.


DynamoDB's DNS management comprises two independent components: a DNS Planner that creates DNS plans, and DNS Enactors that apply changes via Route 53. This separation should prevent failures, but instead created conditions for a catastrophic race condition.



The Race Condition

One DNS Enactor experienced unusually high delays whilst processing a DNS plan. Meanwhile, the Planner continued generating newer plans, which a second Enactor began applying. As the second Enactor executed its clean-up process to remove "stale" plans, the first delayed Enactor finally completed its work.


The clean-up deleted the older plan just as it finished—immediately removing all IP addresses for DynamoDB's regional endpoint. The system entered an inconsistent state that prevented automated recovery. Manual intervention was required.



The Cascade

With DynamoDB's DNS failing, systems attempting to connect experienced failures—including internal AWS services. The DropletWorkflow Manager (DWFM), which maintains leases for EC2 physical servers, depends on DynamoDB. When DWFM state checks failed, EC2 couldn't launch new instances or modify existing ones.


Engineers restored DynamoDB at 02:25 PDT—but recovery created new problems. DWFM attempted to re-establish leases across the entire EC2 fleet simultaneously. The massive scale meant leases began timing out before completion, causing "congestive collapse" that required manual intervention until 05:28 PDT.


Network Manager then propagated a huge backlog of delayed configurations, causing newly launched EC2 instances to experience delays. This affected Network Load Balancer's health checking, creating further instability.


With EC2 impaired, every dependent service failed: Lambda, ECS, EKS, and Fargate all experienced issues—virtually every modern AWS compute service was affected.



The Global Impact

Major gaming platforms went offline. UK banks couldn't process transactions. Government services failed. Amazon.com itself was down. Alexa and Ring devices stopped working. Messaging apps experienced problems.

Early estimates suggest damage may reach hundreds of billions of dollars—reflecting lost revenue, missed opportunities, damaged customer relationships, regulatory penalties, and emergency response costs across thousands of organisations globally.



AWS's Response

AWS has disabled the DynamoDB DNS Planner and Enactor automation worldwide until safeguards prevent recurrence. Rather than implementing a quick fix, AWS essentially admitted the system cannot be trusted until fundamental architectural changes are made.

Amazon stated: "We will look for additional ways to avoid impact from a similar event in the future, and how to further reduce time to recovery."



The Uncomfortable Truths

Complexity creates unexpected failures. The DNS system followed best practices—independent components, clean-up processes, automated recovery. Yet their interaction created a race condition that left the system unrecoverable.

Cascading failures are inevitable. The dependency chain from DynamoDB to DWFM to EC2 to virtually every AWS service meant a single database failure cascaded across the entire platform.

Recovery can be harder than prevention. DynamoDB recovered in under three hours, but full restoration took over five hours because recovery efforts made problems worse.

Manual intervention remains essential. Despite extensive automation, human engineers were required at multiple points. Automated systems couldn't correct the DNS state or prevent congestive collapse.



What This Means for You

Race conditions are notoriously difficult to detect. They only manifest under specific timing conditions that don't occur during normal operations or standard testing. The "unusually high delays" that triggered this bug may have existed in production for extended periods, waiting for the right conditions.


For organisations running on AWS, critical questions demand answers:


Can you tolerate day-long outages? If not, you need strategies beyond AWS's native redundancy. Multi-cloud architectures may be the only path to required resilience.

Do you understand your dependencies? Even applications not using DynamoDB were affected through EC2 dependencies. Do you have visibility into your dependency chains?

Are your disaster recovery plans adequate? When AWS experiences failures affecting fundamental services, what can your team actually do?



The Bottom Line

A timing bug measured in milliseconds brought down one of the world's largest cloud platforms, with potential damage reaching hundreds of billions of dollars. Even the most sophisticated infrastructure, designed by world-class engineers, remains vulnerable to simple bugs in critical systems.


For organisations depending on cloud infrastructure, this is a reminder that no provider is immune to catastrophic failures. Resilience requires architectural strategies that account for provider-level outages, not just the availability zone redundancy that cloud providers offer natively.


The cloud's benefits are too significant to abandon. But the illusion of infallibility has been shattered. True resilience requires honest risk assessment, realistic disaster recovery planning, and potentially the expensive complexity of genuine multi-cloud architectures.


A single race condition brought AWS to its knees. What failure mode will we discover next?



Build Resilient Cloud Infrastructure

At Altiatech, we help organisations design cloud strategies that account for provider-level failures that no amount of native redundancy can prevent.


Our cloud services are built upon Cloud Centre of Excellence (CCoE) values with expertise in Microsoft Azure, Amazon Web Services, and Google Cloud Platform. We help you understand dependencies, design architectures that survive provider failures, and build incident response capabilities for scenarios you can't directly control.


Don't assume cloud provider best practices guarantee the resilience your business requires.


Get in touch:

📧 Email: innovate@altiatech.com
📞 Phone (UK): +44 (0)330 332 5482


Build resilience. Understand risk. Stay operational.

January 26, 2026
Cyberattacks, system failures, natural disasters, and human errors will occur—the question isn't if but when. Cyber resilience planning ensures organisations can withstand incidents, maintain critical operations during disruptions, and recover quickly when systems fail. It's not just about preventing attacks; it's about ensuring business continuity regardless of what goes wrong.
January 19, 2026
Manual user provisioning - the process of creating accounts and granting access through email requests and IT tickets - seems manageable for small organisations. As organisations grow, this approach creates mounting security risks, operational inefficiencies, and frustrated users waiting days for access they need immediately.
January 12, 2026
Multi-cloud strategies deliver flexibility, redundancy, and the ability to select the best platform for each workload. They also create complex security challenges, particularly around identity and access management. Each cloud provider offers different security models, tools, and terminology, making unified security difficult to achieve.
January 5, 2026
Privileged accounts—those with administrative rights to critical systems—represent the most attractive target for attackers. A single compromised privileged credential gives attackers complete control over infrastructure, data, and operations. Yet many organisations manage privileged access inadequately, creating unnecessary risk.
December 22, 2025
Identity and access management represents a critical security capability, yet many organisations struggle to assess whether their IAM implementation is truly effective. Identity governance maturity models provide a framework for evaluation, revealing gaps and priorities for improvement.
December 15, 2025
Traditional security models assumed everything inside the corporate network was trustworthy, focusing defensive efforts on the perimeter. This approach fails catastrophically in today's hybrid work environment where employees access resources from homes, coffee shops, and co-working spaces whilst applications reside across multiple clouds.
Microsoft logo on a wood-paneled wall, with colorful squares and company name.
December 10, 2025
Microsoft is introducing major Microsoft 365 licensing changes in 2026. Learn what’s changing, who is affected and how businesses should prepare.
December 8, 2025
Cloud computing promised cost savings through pay-per-use models and elastic scaling. Yet many UK organisations discover their cloud bills steadily increasing without corresponding business growth. The culprit? Cloud waste - unnecessary spending on unused or inefficiently configured resources.
November 28, 2025
A threat group known as Scattered Lapsus$ Hunters is targeting Zendesk users through a sophisticated campaign involving fake support sites and weaponised helpdesk tickets, according to security researchers at ReliaQuest. The operation represents an evolution in how cybercriminals exploit trust in enterprise SaaS platforms.
November 28, 2025
Amazon Web Services has launched a new feature allowing customers to make DNS changes within 60 minutes during service disruptions in its US East (N. Virginia) region. The announcement tacitly acknowledges what many have long observed: AWS's largest and most critical region has a reliability problem.