How a Millisecond Timing Bug Cost Hundreds of Billions

fahd.zafar • October 24, 2025

Amazon has revealed the shocking cause behind one of history's most devastating cloud outages: a simple race condition in DynamoDB's DNS management system brought down AWS services globally for an entire day, with damage estimates potentially reaching hundreds of billions of dollars.

The Failure

The incident began at 11:48 PM PDT on 19th October when customers reported increased DynamoDB API error rates in US-EAST-1. The root cause? A timing bug that left DynamoDB's DNS record completely empty—systems trying to connect literally couldn't find it.


DynamoDB's DNS management comprises two independent components: a DNS Planner that creates DNS plans, and DNS Enactors that apply changes via Route 53. This separation should prevent failures, but instead created conditions for a catastrophic race condition.



The Race Condition

One DNS Enactor experienced unusually high delays whilst processing a DNS plan. Meanwhile, the Planner continued generating newer plans, which a second Enactor began applying. As the second Enactor executed its clean-up process to remove "stale" plans, the first delayed Enactor finally completed its work.


The clean-up deleted the older plan just as it finished—immediately removing all IP addresses for DynamoDB's regional endpoint. The system entered an inconsistent state that prevented automated recovery. Manual intervention was required.



The Cascade

With DynamoDB's DNS failing, systems attempting to connect experienced failures—including internal AWS services. The DropletWorkflow Manager (DWFM), which maintains leases for EC2 physical servers, depends on DynamoDB. When DWFM state checks failed, EC2 couldn't launch new instances or modify existing ones.


Engineers restored DynamoDB at 02:25 PDT—but recovery created new problems. DWFM attempted to re-establish leases across the entire EC2 fleet simultaneously. The massive scale meant leases began timing out before completion, causing "congestive collapse" that required manual intervention until 05:28 PDT.


Network Manager then propagated a huge backlog of delayed configurations, causing newly launched EC2 instances to experience delays. This affected Network Load Balancer's health checking, creating further instability.


With EC2 impaired, every dependent service failed: Lambda, ECS, EKS, and Fargate all experienced issues—virtually every modern AWS compute service was affected.



The Global Impact

Major gaming platforms went offline. UK banks couldn't process transactions. Government services failed. Amazon.com itself was down. Alexa and Ring devices stopped working. Messaging apps experienced problems.

Early estimates suggest damage may reach hundreds of billions of dollars—reflecting lost revenue, missed opportunities, damaged customer relationships, regulatory penalties, and emergency response costs across thousands of organisations globally.



AWS's Response

AWS has disabled the DynamoDB DNS Planner and Enactor automation worldwide until safeguards prevent recurrence. Rather than implementing a quick fix, AWS essentially admitted the system cannot be trusted until fundamental architectural changes are made.

Amazon stated: "We will look for additional ways to avoid impact from a similar event in the future, and how to further reduce time to recovery."



The Uncomfortable Truths

Complexity creates unexpected failures. The DNS system followed best practices—independent components, clean-up processes, automated recovery. Yet their interaction created a race condition that left the system unrecoverable.

Cascading failures are inevitable. The dependency chain from DynamoDB to DWFM to EC2 to virtually every AWS service meant a single database failure cascaded across the entire platform.

Recovery can be harder than prevention. DynamoDB recovered in under three hours, but full restoration took over five hours because recovery efforts made problems worse.

Manual intervention remains essential. Despite extensive automation, human engineers were required at multiple points. Automated systems couldn't correct the DNS state or prevent congestive collapse.



What This Means for You

Race conditions are notoriously difficult to detect. They only manifest under specific timing conditions that don't occur during normal operations or standard testing. The "unusually high delays" that triggered this bug may have existed in production for extended periods, waiting for the right conditions.


For organisations running on AWS, critical questions demand answers:


Can you tolerate day-long outages? If not, you need strategies beyond AWS's native redundancy. Multi-cloud architectures may be the only path to required resilience.

Do you understand your dependencies? Even applications not using DynamoDB were affected through EC2 dependencies. Do you have visibility into your dependency chains?

Are your disaster recovery plans adequate? When AWS experiences failures affecting fundamental services, what can your team actually do?



The Bottom Line

A timing bug measured in milliseconds brought down one of the world's largest cloud platforms, with potential damage reaching hundreds of billions of dollars. Even the most sophisticated infrastructure, designed by world-class engineers, remains vulnerable to simple bugs in critical systems.


For organisations depending on cloud infrastructure, this is a reminder that no provider is immune to catastrophic failures. Resilience requires architectural strategies that account for provider-level outages, not just the availability zone redundancy that cloud providers offer natively.


The cloud's benefits are too significant to abandon. But the illusion of infallibility has been shattered. True resilience requires honest risk assessment, realistic disaster recovery planning, and potentially the expensive complexity of genuine multi-cloud architectures.


A single race condition brought AWS to its knees. What failure mode will we discover next?



Build Resilient Cloud Infrastructure

At Altiatech, we help organisations design cloud strategies that account for provider-level failures that no amount of native redundancy can prevent.


Our cloud services are built upon Cloud Centre of Excellence (CCoE) values with expertise in Microsoft Azure, Amazon Web Services, and Google Cloud Platform. We help you understand dependencies, design architectures that survive provider failures, and build incident response capabilities for scenarios you can't directly control.


Don't assume cloud provider best practices guarantee the resilience your business requires.


Get in touch:

📧 Email: innovate@altiatech.com
📞 Phone (UK): +44 (0)330 332 5482


Build resilience. Understand risk. Stay operational.

Ready to move from ideas to delivery?


Whether you’re planning a cloud change, security uplift, cost governance initiative or a digital delivery programme, we can help you shape the scope and the right route to market.


Email:
innovate@altiatech.com or call 0330 332 5842 (Mon–Fri, 9am–5:30pm).


Main contact page: https://www.altiatech.com/contact

A hand clicks a computer mouse, connecting two digital bank icons with a glowing globe showing various currency symbols.
By Simon Poole March 13, 2026
Explores how open banking is scaling across the UAE and GCC and why strong API security and consent controls are essential for compliance, trust, and resilience.
Person holding a phone with a lock icon, using a laptop; digital security concept.
By Simon Poole March 11, 2026
A practical guide to reducing cyber risk exposure fast as geopolitical tensions rise, with clear steps to strengthen resilience, controls, and response.
A person points to an AI interface with glowing circuits, overlaid on a blue background.
By Simon Poole March 4, 2026
Explains how PPN 017 will shape AI procurement in the UK public sector and the questions buyers are likely to ask suppliers about governance, risk, and compliance.
Person using a calculator with a tablet on a wooden table.
By Wafik Rozeik February 25, 2026
Examines AI-augmented attacks targeting FortiGate devices at scale, what the risks mean for organisations, and the immediate steps to strengthen security.
Digital, pixelated person with red data streams, facing forward. Cyberpunk, data glitch effect.
By Simon Poole February 24, 2026
Examines AI-augmented attacks targeting FortiGate devices at scale, what the risks mean for organisations, and the immediate steps to strengthen security.
Person typing on laptop, cloud computing displayed on the screen, on a wooden table.
By Wafik Rozeik February 23, 2026
Explains why AI spend behaves differently and how anomaly management is becoming essential in FinOps to control costs, reduce risk, and improve cloud visibility.
Hand holding a phone displaying the Microsoft Copilot logo with the Microsoft logo blurred in the background.
By Simon Poole February 18, 2026
A practical governance checklist for Microsoft Copilot in 2026, using the Copilot Control System to manage risk, security, compliance, and oversight.
Route to market diagram: Bank to delivery platform, with steps like product mgmt and customer support.
By Simon Poole February 12, 2026
Explains what the Technology Services 4 (TS4) framework means for public sector buyers and how to procure Altiatech services through compliant routes.
Two people shaking hands between cloud data and data analytics dashboards.
By Simon Poole February 10, 2026
Explores where IT waste really comes from and how FinOps helps organisations regain control of cloud spend, improve efficiency, and turn cost visibility into advantage.
People discussing data and cloud infrastructure, near a government building.
By Simon Poole February 9, 2026
An overview of CCS Digital Outcomes 7 explaining Altiatech’s routes to market and how public sector organisations can procure services.