How a Millisecond Timing Bug Cost Hundreds of Billions

fahd.zafar • October 24, 2025

Amazon has revealed the shocking cause behind one of history's most devastating cloud outages: a simple race condition in DynamoDB's DNS management system brought down AWS services globally for an entire day, with damage estimates potentially reaching hundreds of billions of dollars.

The Failure

The incident began at 11:48 PM PDT on 19th October when customers reported increased DynamoDB API error rates in US-EAST-1. The root cause? A timing bug that left DynamoDB's DNS record completely empty—systems trying to connect literally couldn't find it.

DynamoDB's DNS management comprises two independent components: a DNS Planner that creates DNS plans, and DNS Enactors that apply changes via Route 53. This separation should prevent failures, but instead created conditions for a catastrophic race condition.

The Race Condition

One DNS Enactor experienced unusually high delays whilst processing a DNS plan. Meanwhile, the Planner continued generating newer plans, which a second Enactor began applying. As the second Enactor executed its clean-up process to remove "stale" plans, the first delayed Enactor finally completed its work.

The clean-up deleted the older plan just as it finished—immediately removing all IP addresses for DynamoDB's regional endpoint. The system entered an inconsistent state that prevented automated recovery. Manual intervention was required.

The Cascade

With DynamoDB's DNS failing, systems attempting to connect experienced failures—including internal AWS services. The DropletWorkflow Manager (DWFM), which maintains leases for EC2 physical servers, depends on DynamoDB. When DWFM state checks failed, EC2 couldn't launch new instances or modify existing ones.

Engineers restored DynamoDB at 02:25 PDT—but recovery created new problems. DWFM attempted to re-establish leases across the entire EC2 fleet simultaneously. The massive scale meant leases began timing out before completion, causing "congestive collapse" that required manual intervention until 05:28 PDT.

Network Manager then propagated a huge backlog of delayed configurations, causing newly launched EC2 instances to experience delays. This affected Network Load Balancer's health checking, creating further instability.

With EC2 impaired, every dependent service failed: Lambda, ECS, EKS, and Fargate all experienced issues—virtually every modern AWS compute service was affected.

The Global Impact

Major gaming platforms went offline. UK banks couldn't process transactions. Government services failed. Amazon.com itself was down. Alexa and Ring devices stopped working. Messaging apps experienced problems.

Early estimates suggest damage may reach hundreds of billions of dollars—reflecting lost revenue, missed opportunities, damaged customer relationships, regulatory penalties, and emergency response costs across thousands of organisations globally.

AWS's Response

AWS has disabled the DynamoDB DNS Planner and Enactor automation worldwide until safeguards prevent recurrence. Rather than implementing a quick fix, AWS essentially admitted the system cannot be trusted until fundamental architectural changes are made.

Amazon stated: "We will look for additional ways to avoid impact from a similar event in the future, and how to further reduce time to recovery."

The Uncomfortable Truths

Complexity creates unexpected failures. The DNS system followed best practices—independent components, clean-up processes, automated recovery. Yet their interaction created a race condition that left the system unrecoverable.

Cascading failures are inevitable. The dependency chain from DynamoDB to DWFM to EC2 to virtually every AWS service meant a single database failure cascaded across the entire platform.

Recovery can be harder than prevention. DynamoDB recovered in under three hours, but full restoration took over five hours because recovery efforts made problems worse.

Manual intervention remains essential. Despite extensive automation, human engineers were required at multiple points. Automated systems couldn't correct the DNS state or prevent congestive collapse.

What This Means for You

Race conditions are notoriously difficult to detect. They only manifest under specific timing conditions that don't occur during normal operations or standard testing. The "unusually high delays" that triggered this bug may have existed in production for extended periods, waiting for the right conditions.

For organisations running on AWS, critical questions demand answers:

Can you tolerate day-long outages? If not, you need strategies beyond AWS's native redundancy. Multi-cloud architectures may be the only path to required resilience.

Do you understand your dependencies? Even applications not using DynamoDB were affected through EC2 dependencies. Do you have visibility into your dependency chains?

Are your disaster recovery plans adequate? When AWS experiences failures affecting fundamental services, what can your team actually do?

The Bottom Line

A timing bug measured in milliseconds brought down one of the world's largest cloud platforms, with potential damage reaching hundreds of billions of dollars. Even the most sophisticated infrastructure, designed by world-class engineers, remains vulnerable to simple bugs in critical systems.

For organisations depending on cloud infrastructure, this is a reminder that no provider is immune to catastrophic failures. Resilience requires architectural strategies that account for provider-level outages, not just the availability zone redundancy that cloud providers offer natively.

The cloud's benefits are too significant to abandon. But the illusion of infallibility has been shattered. True resilience requires honest risk assessment, realistic disaster recovery planning, and potentially the expensive complexity of genuine multi-cloud architectures.

A single race condition brought AWS to its knees. What failure mode will we discover next?

Build Resilient Cloud Infrastructure

At Altiatech, we help organisations design cloud strategies that account for provider-level failures that no amount of native redundancy can prevent.

Our cloud services are built upon Cloud Centre of Excellence (CCoE) values with expertise in Microsoft Azure, Amazon Web Services, and Google Cloud Platform. We help you understand dependencies, design architectures that survive provider failures, and build incident response capabilities for scenarios you can't directly control.

Don't assume cloud provider best practices guarantee the resilience your business requires.

Get in touch:

📧 Email: innovate@altiatech.com
📞 Phone (UK): +44 (0)330 332 5482

Build resilience. Understand risk. Stay operational.

< Older Post

Newer Post >

Microsoft 365: Major Licensing Changes Coming in 2026

December 10, 2025

Microsoft is introducing major Microsoft 365 licensing changes in 2026. Learn what’s changing, who is affected and how businesses should prepare.

The Real Cost of Cloud Waste: How UK Organisations Are Losing Thousands Monthly

December 8, 2025

Cloud computing promised cost savings through pay-per-use models and elastic scaling. Yet many UK organisations discover their cloud bills steadily increasing without corresponding business growth. The culprit? Cloud waste - unnecessary spending on unused or inefficiently configured resources.

Zendesk Users Targeted with Sophisticated Support Platform Attack

November 28, 2025

A threat group known as Scattered Lapsus$ Hunters is targeting Zendesk users through a sophisticated campaign involving fake support sites and weaponised helpdesk tickets, according to security researchers at ReliaQuest. The operation represents an evolution in how cybercriminals exploit trust in enterprise SaaS platforms.

AWS Introduces DNS Failover for US East Region—Acknowledging Its Reliability Problem

November 28, 2025

Amazon Web Services has launched a new feature allowing customers to make DNS changes within 60 minutes during service disruptions in its US East (N. Virginia) region. The announcement tacitly acknowledges what many have long observed: AWS's largest and most critical region has a reliability problem.

Two Years After Ransomware Attack, Scottish Council Still Rebuilding Systems

November 28, 2025

A Scottish council remains unable to fully restore critical systems two years after a devastating ransomware attack, highlighting the long-term consequences of inadequate cybersecurity preparation and the challenges facing resource-constrained local authorities. Comhairle nan Eilean Siar, serving Scotland's Western Isles, suffered a ransomware attack in November 2023 that required extensive system reconstruction. According to a report published by Scotland's Accounts Commission, several systems remain unrestored even now, with large data volumes slowing the digital recovery process.

Windows 10 End-of-Life: Your Complete Migration Strategy for 2026

November 26, 2025

Ready to migrate from Windows 10? Contact Altiatech for a comprehensive migration assessment and strategy tailored to your organisation's needs.

CISA Warning: Commercial Spyware Actively Targeting Messaging App Users

November 25, 2025

The Cybersecurity and Infrastructure Security Agency has issued an alert warning that multiple cyber threat actors are actively leveraging commercial spyware to target users of mobile messaging applications including Signal and WhatsApp. The sophisticated campaigns use advanced social engineering and exploit techniques to compromise victims' devices and gain unauthorized access to their communications.

Microsoft's AI Agents in Windows: What Businesses Need to Know About Copilot Actions

By fahd.zafar • November 24, 2025

Microsoft has introduced experimental AI agent capabilities into Windows through Copilot Actions and agent workspaces, features designed to automate everyday tasks like organising files, scheduling meetings, and sending emails. However, the announcement comes with significant security warnings that business leaders and IT administrators must understand before enabling these capabilities.

First AI-Orchestrated Cyber Espionage Campaign Detected: What Businesses Need to Know

November 17, 2025

Anthropic has disclosed the first documented case of a large-scale cyberattack executed with minimal human intervention, marking a significant escalation in AI-enabled cyber threats. The campaign, attributed with high confidence to a Chinese state-sponsored group, demonstrates how rapidly AI capabilities are being weaponised for espionage operations.

Microsoft Builds First AI Superfactory: Connecting Datacentres from Wisconsin to Atlanta

November 14, 2025

Microsoft has unveiled its first "AI superfactory" - a revolutionary approach to cloud infrastructure that connects multiple datacentres across vast distances to function as a single, unified AI training system. The innovation marks a significant shift in how hyperscale computing infrastructure can be architected.

How a Millisecond Timing Bug Cost Hundreds of Billions

Amazon has revealed the shocking cause behind one of history's most devastating cloud outages: a simple race condition in DynamoDB's DNS management system brought down AWS services globally for an entire day, with damage estimates potentially reaching hundreds of billions of dollars.

The Failure

The Race Condition

The Cascade

The Global Impact

AWS's Response

The Uncomfortable Truths

What This Means for You

The Bottom Line

Build Resilient Cloud Infrastructure

See what altiatech can do for you