An outage at Amazon Web Services (AWS) on Monday took out thousands of websites and mobile phone apps across the globe.
AWS, a cloud computing service that allows companies to rent out the retail giant’s internet infrastructure, suffered a glitch during an update.
The hours-long outage brought government websites, banks, games, streaming services, airlines and crypto platforms offline from about 8am.
Luke Kehoe, an industry analyst at Ookla, which runs the internet outage tracker Downdetector, told Metro: ‘Over 16 million reports were registered on Downdetector during the outage, making it one of the most impactful to date in terms of blast radius, with over 2,000 services across more than 60 countries impacted.
‘There were over 1.5million outage reports in the UK alone, significantly above the global daily baseline of one million reports typically observed.’
Amazon engineers say they fixed the issue, and most websites and apps relying on AWS were working normally again.
What caused Amazon’s AWS outage?
Amazon has revealed the cause of the outage was a software bug related to DynamoDB, the service’s database system where customers’ data is stored.
In a lengthy and technical statement, Amazon said the bug was caused by ‘a latent defect within the service’s automated DNS [domain name system] management system’. Here’s what that means:
Every time you send an email, stream a movie or buy something online, you generate data. This data has to go somewhere, so to save space on your device, it gets stored in ‘the cloud’.
But the cloud isn’t above your head; it’s just the word used for the physical disk drives inside big data centres that your emails and posts go through.
Amazon’s cloud is made up of dozens of big data centres, and the problem impacted one in Northern Virginia called US-EAST-1.
Amazon effectively rents out data space in these centres to companies, which use them to host and run their websites and apps – AWS is home to some 76,000,000 websites.
Just like your phone, however, this technology needs to be updated from time to time. Engineers installed an update to the API – the interface between computers – to DynamoDB, which stores user information.
During the update, an error occurred in the database’s Domain Name System (DNS) – the phone book of the internet – so apps couldn’t find the correct server address to load.
Amazon’s statement explains: ‘When this issue occurred at 11:48 PM PDT, all systems needing to connect to the DynamoDB service in the N. Virginia (us-east-1) Region via the public endpoint immediately began experiencing DNS failures and failed to connect to DynamoDB.
‘This included customer traffic as well as traffic from internal AWS services that rely on DynamoDB.
‘Customers with DynamoDB global tables were able to successfully connect to and issue requests against their replica tables in other Regions, but experienced prolonged replication lag to and from the replica tables in the N. Virginia (us-east-1) Region.’
Error pages and snail-pace loading bars began within two minutes of the glitch, said Jamie Beckland, chief product officer at APIContext, which monitors APIs.
‘What happened next was the most interesting,’ he told Metro. ‘Some applications with automated failovers in place were restored within five to 15 minutes, which is commendable.
‘Others, with well-tested manual disaster recovery programs, were back online in 20-90 minutes. Yet others threw their hands up and declared that they were at the mercy of their service provider and were down for up to eight hours.’
Many users said that while websites and apps may load for them, they couldn’t log in. This is because DynamoDB, and other servers running on AWS, store or check login details.
DynamoDB, like other services of its kind, design their systems to automatically repair themselves when errors are detected.
But this time this automated process failed due to a ‘latent race condition’, Amazon said – in other words, a dormant bug that only shows itself when an extremely unlikely sequence of events in complex coding occurs.
So, Amazon wasn’t hacked? It wasn’t a cyber attack?
While supermarkets, luxury brands, airports and nursery chains are among the many firms targeted by hackers in recent months, Amazon was not one of them, said Marijus Briedis, chief technology officer at the privacy tech company NordVPN.
‘Now that the storm has passed for Amazon Web Services, their customers will be turning to them and demanding answers for why this chaos was allowed to unfold, and the simple answer is – it was a technical fault,’ Briedis told Metro.
‘Many will let out a sigh of relief that this wasn’t the work of hackers. A single glitch in one of AWS’s main data hubs caused a chain reaction that took thousands of companies and services offline.’
But Briedis cautioned that this isn’t necessarily good news, as every outage shows cyber crooks what a well-timed attack could achieve.
‘It also gives them further proof that even the biggest tech companies in the world can let down their guard at times,’ he added.
What sites went down?
It’s probably slightly easier to say which online services didn’t go down.
Downdetector told Metro that some of the biggest names impacted included:
- Snapchat, with 3,000,000+ user reports
- AWS, 2,500,000+ user reports
- Roblox, 716,000+ user reports
- Amazon, 698,000+ user reports
- Reddit, with 397,000+ user reports
- Ring, 357,000+ user reports
- Instructure – 265,000+ user reports
- Fortnite, 233,000 user reports
- Venmo, 185,000+ user reports
- Canva, 168,000+ user report
Of the people affected, they lived in:
- US – 6,300,000+ user reports
- UK – 1,500,000+ user reports
- Denmark – 774,000 user reports
- The Netherlands – 737,000+ user reports
- Brazil – 589,000 + user report
- France – 587,000+ user reports
- Australia – 516K+ user reports
- Canada – 475K+ user reports
- India – 428K+ user reports
- Japan – 368K+ user reports
Other services that went down included ChatGPT, Sky, Lloyds Bank, Duolingo, PlayStation, GOV.UK websites, Coinbase, Zoom and The New York Times’ Wordle.
What is the current status – is everything back to normal?
For the most part. AWS’ health dashboard says that the technical hiccup has been ‘resolved’, and Downdetector is reporting far fewer reports of users struggling to access websites than yesterday.
As of today, 19 websites and apps were still experiencing issues, Ookla, which owns Downdetector, told Metro.
Could it happen again?
US-EAST-1 is one of Amazon’s oldest and most widely used data centre regions, and suffered blackouts in 2020, 2021 and 2023.
Suhaib Zaheer, the senior vice president at the website host Cloudways, told Metro that the scale of yesterday’s outage is in no way surprising.
‘The AWS outage shows just how much of the internet depends on a few shared foundations,’ he said.
‘When one of them slows, the effects reach from rail networks and online payments to digital classrooms and hospital systems. It’s a reminder that the cloud isn’t invisible – it’s the infrastructure modern life runs on.’
While Stephen Kelly, CEO of Cirata, said: ‘The harsh reality is that while AWS has established practices such as distributed architecture and isolation to minimise the scope of failures, the system will never be totally immune to large-scale outages.’
As well as Amazon, many companies rely on similar cloud services offered by Google and Microsoft to stay online.
Cloud computing technology needs to be diversified, Pieter Arntz, a senior researcher at the anti-virus service Malwarebytes, added to Metro.
Only a couple of decades ago did most companies have their own data centres. Governments should focus more on building internet infrastructure to be used locally, rather than outsourcing services based miles away in Virginia.
‘Digital sovereignty is no longer a distant goal without urgency,’ Arntz said, ‘it is now viewed as a requirement for security, stability, and trust in the modern internet.’
Get in touch with our news team by emailing us at webnews@metro.co.uk.
For more stories like this, check our news page.