Cascading risk and mitigation lessons stemming from the AWS outage

Organizations are threaded together in a web of digital infrastructure. It sounds pretty until something gets stuck. Our experts weigh in on strategies that can stem the fallout.

When Amazon Web Services recently experienced domain name system (DNS) failures, organizations around the world reported cascading outages, affecting their services as portions of the web went unresponsive.

The outage exposed (yet again) our deepening reliance on what we’ve come to call hyperscalers – the major cloud providers that collectively power most of the world’s cloud computing. And it underlined the significant operational, financial, and reputational risks the disruptions can have, even though these disruptions (by now) should never be unpredictable.

How can these risk be mitigated, who needs to be involved, and how can organizations be on a stronger footing going forward in handling the ripple effects of these events?

We asked a few experts for their ideas.

First, a little primer

The DNS is a simple look-up process that translates the written name of any web service to its IP address. When DNS breaks, there is no way to lookup the address, so the requests fail, and the service itself sees no traffic arriving. Put another way, we rely on DNS to translate those website names to IP addresses so our browsers and other applications can load.

For all of us using a computer, this might mean you can’t buy something, bank online, or get some data you need stored on an external site.

When AWS went haywire last month, it was the “DNS endpoint for DynamoDB in the AWS us-east-1 region” that was not working, as AWS reported on its consistently updated status page.

Attorney John Benjamin also recommended that organizations look into business interruption insurance (and any potential carve-outs inside them) for these types of disruptions or events.

The problem here was that a lot of internal AWS services use DynamoDB to store their information, so a great number of organizations were affected, such as United and Delta Airlines, Robinhood, Lyft, DoorDash, Coinbase, Venmo, Reddit, Microsoft Teams, Zoom, Slack, T-Mobile, Spotify, Verizon, Snapchat, Hulu, and Netflix, several banks such as Lloyds and Bank of Scotland, and businesses that rely on AWS for internal operations, such as Amazon.com, and its Alexa and Ring products.

Many sites were back online within a few hours, although the internal Amazon ones took longer. The company’s last update at 6:53pm ET noted that “all AWS services returned to normal operations” shortly after 6pm ET.

AWS is the leading provider of cloud infrastructure technology, accounting for about a third of the market.

Before and while the cloud is faltering, how can your organization best manage its internal and external risks?

Multi-cloud environments and disaster recovery

John Benjamin is co-chair of the Duane Morris Technology, Media and Telecom industry group and works from the firm’s London office as a partner in the firm’s Intellectual Property practice.

He emphasized the importance of having a multiple-cloud environment that uses public or private cloud services from more than one cloud provider to support the organization’s applications.

Benjamin said that organizations might want to consider cloud providers of different sizes and consider not relying on the same one (same hyperscaler) in every region where it operates.

“This approach allows organizations to get services from different providers and avoid single-vendor outages,” he explained.

He also reminded businesses to have and update their disaster recovery plans, which an external advisory service provider could help with, to ensure the business follows certain protocols when large-scale disruptions occur.

“This plan should remain flexible, but it needs to anticipate a wide array of scenarios since these disruptions are fairly commonplace now,” he said. Those scenarios should include cloud provider DNS-related issues, plus other web service interruptions, malware attacks, etc, he said.

Benjamin also recommended that organizations look into business interruption insurance, examining closely any potential carve-outs inside them for these types of interruptions or events like cyber attacks.

“Redundancy is the key word here. Backup providers in these situations are key.”

Michael Hussey, VP, Cyber Threat Intelligence Group at BNY

As for who should be involved in crafting the disaster response and managing risk in this area, Benjamin said the security team needs to own it, but the information-sharing and strategy development must include a variety of voices and areas of expertise.

“Your security team needs to have a good information technology framework that understands the data map of services offered by the organization and the various service providers it employs. And then others with specific remits and lines of sight into specific data need to be consulted, such as the Privacy team, top executives who can weigh in on strategy and risk tolerance, plus your internal and external communications and public relations teams that craft the right messaging around these events and your organization’s approaches,” Benjamin said.

He noted that Sales and Customer Support employees have a role here too, as they are also key to helping with messaging to current, valued clients.

Redundancy and the ‘fusion center’

Michael Hussey, Vice President of the Cyber Threat Intelligence Group at the Bank of New York, said redundancy is the key word here.

“Backup providers in these situations are key,” he said. “This depends on whether backup providers are available for a particular service, but they are integral.”

Hussey also said that for these incidents to be rapidly detected, understood and escalated appropriately, a cyber-technology “fusion center” is needed – a framework for breaking down silos and establishing clear responsibilities. Or at least this is particularly true in financial services, he said.

“The team in a fusion center monitor service levels and identifies an incident proactively in response to customers or employees reporting issues. Cybersecurity teams are then engaged for proactive monitoring and external collection, particularly if this is a cyberattack issue,” he said.

And at intervals, “war games” that involve all relevant teams in a simulated incident can help prepare for these types of events. “Ideally war games lead to mutually understood standard operating procedure. It’s essential to know the right points of contact within each team involved,” Hussey advised.

I asked Hussey about the role of the service provider whose operations malfunctioned, triggering the cascading effects to other entities. He said these providers not only have the technical job of remediating an incident, “they have a responsibility to communicate and socialize as much information as possible to customers, and to communicate often and be truthful.”

Hussey said that part of this communication must include the cause of the outage as soon as the provider is certain, because this AWS outage would require a different response, obviously, than a malicious event like a cyberattack would.

“You need to be able to access the right response playbook as soon as possible,” Hussey said.

In New York

The New York Department of Financial Services’ (NYDFS) Cybersecurity Regulation, 23 NYCRR Part 500, was first enacted in 2017 and has been updated, with significant amendments requiring compliance with enhanced provisions by just this week, November 1, including multi-factor authentication.

When it comes to appreciating the risk third party service providers bring to the table, there is a lot the regulation has to say there, too.

Maria Vullo reminded us that risk management includes the concept of concentrated risk.

I asked Maria Vullo, who was NYDFS superintendent at the time the Part 500 rules were put into place, about these state’s posture toward business interruptions, like the one caused by the AWS issues.

“When a covered entity is using third-party services that impact business operations, covered entities must have detailed policies and procedures to address, among other things, business interruptions that could result from a cyber event. Here, the applicability of Part 500 would depend on whether AWS’s outages constitute a cyber event,” she said.

“But even if it isn’t a cyber event, under risk management practices in general, covered entities must be prepared to address events that could interrupt services to customers and must have sufficient procedures, such as backup services, to respond promptly to these events,” Vullo said.

Vullo reminded us that risk management includes the concept of concentrated risk. “So, if one vendor is the sole provider, such that the entire system of the covered entity rises and falls with that provider’s outage for any reason, then the covered entity has an inadequate risk-management program,” she said.

“Business continuity and risk management require considerations of such future events,” she added.

Communicate soon and honestly

Susan Peters, President and Founder of Greybridge PR, a New York-based public relations and communications firm that advises professional services companies and the legal industry, added some pointers about handling the crisis from a communications standpoint.

“Having a crisis communications plan prepared in advance with pre-approved messaging can help,” Peters advised.

“It’s important to communicate, and even over communicate frequent updates and timelines to everyone for how a company will be resolving the issues as quickly as possible; to be empathetic and apologize sincerely; and to explain in detail what measures you are taking to prevent a recurrence of this from happening,” she added.