Amazon.com cloud service returned to normal operations in the afternoon yesterday, the company said, after an Internet outage that caused global turmoil among thousands of sites, including some of the web’s most popular apps like Snapchat and Reddit.
Still, Amazon said some AWS services had a backlog of messages that would take a few hours to process.
AWS hosts applications and computer processes for companies around the world, and the disruption knocked workers from London to Tokyo offline and halted others from conducting normal everyday tasks like paying hairdressers or changing their airline tickets.
Users on Monday afternoon had complained of lingering difficulties using services such as digital wallet Venmo and video calling site Zoom.
It was the largest Internet disruption since last year’s CrowdStrike malfunction hobbled technology systems in hospitals, banks and airports, highlighting the vulnerability of the world’s interconnected technologies.
It was at least the third time in five years that AWS’s northern Virginia cluster, known as US-EAST-1, contributed to a major Internet meltdown.
Amazon did not address a request for more clarity about why that particular data centre keeps being impacted.
The problems stemmed from what is known as the Domain Name System, or DNS, which prevented applications from finding the correct address for AWS’s DynamoDB API, a cloud database relied upon to store user information and other critical data.
Earlier, AWS said the root cause of the outage was an underlying subsystem that monitors the health of its network load balancers used to distribute traffic across several servers.
The issue, AWS said, originated from within the ‘EC2 internal network’, Amazon’s ‘Elastic Compute Cloud’ service, which provides on-demand cloud capacity within AWS.
Ken Birman, a computer science professor at Cornell University, said software developers need to build better fault tolerance.
He said AWS provides tools developers can use to protect themselves in the event of a problem at one of any of its sprawling network of data centres.
“When people cut costs and cut corners to try to get an application up, and then forget that they skipped that last step and didn’t really protect against an outage, those companies are the ones who really ought to be scrutinised later,” he said.