Saturday, October 25, 2025

How Can CIOs Maintain Operations Going Throughout an Outage?


For hours on Monday, tens of millions of customers and greater than 1,000 corporations discovered themselves unable to hook up with the web. Social media platforms Reddit and Snapchat have been hit, as have been banks Lloyds Financial institution and Halifax. Even youngsters have been affected, with common video games Fortnite and Roblox knocked offline. Sen. Elizabeth Warren (D-Mass.) took to X, describing the occasion as one which broke “the complete web” and calling for a breakup of Massive Tech.

“Networking is definitely a foundational element of AWS providers,” stated Corey Beck, director of cloud applied sciences at DataStrike and a former senior options architect at AWS. “When it stumbles in a area like US-East-1, the results go method past; it ripples by EC2, S3, DynamoDB, RDS, and just about each service that relies on them.” 

But for a lot of others, it was enterprise as normal. It is because the outage affected solely AWS prospects — and particular ones at that. The supply of the outage was a DNS failure on the AWS knowledge heart cluster generally known as US-EAST-1. It is the most important of the supplier’s clusters, and one which powers a lot of AWS’s web entry — however not all of it. And any enterprise or particular person who runs Microsoft or Google merchandise was not affected in any respect. 

The outage launched mass conversations, starting from the usual narrative on overdependency on single suppliers to the necessity for higher testing protocols earlier than rollout. In a really perfect world, this scale of disruption would by no means occur once more. However CIOs cannot depend on crossed figures and dream eventualities. They should decide what duty is on their shoulders in relation to weathering a future outage — and resolve whether or not the pace and effectivity good points of utilizing a single supplier will outweigh the focus threat of counting on that main cloud vendor.

Associated:Future-Proofing Cloud Safety Priorities

Redundancy vs. Danger

Whereas politicians mentioned monopolies and customers complained about web site inaccessibility, IT leaders noticed the outage as a name for higher redundancy. The argument is sort of clear: By constructing in backups and failover capability, corporations can unfold out their reliance on anybody level of their infrastructure. To not accomplish that, some specialists argued, could be working on the edge. 

“Gamblers would possibly select to threat a core enterprise functionality by working it in a dangerous method,” stated Jon Brown, senior analyst for knowledge safety, IT operations and sustainability at Omdia. “Personally, I might advise on security, because the failure of a poorly protected, high-profile, mission-critical utility can result in a resume-generating occasion, which most of us attempt to keep away from. There’s nothing extra vital than your buyer and transaction knowledge.”

This will appear apparent, however a thousand corporations nonetheless misplaced digital performance on Monday. Why weren’t they higher ready? One reply is that whereas redundancy is not new, it additionally is not very horny. In a discipline filled with innovation and development, redundancy is about slowing down, checking your work, and taking the most secure route. It is not stunning if some corporations are extra enthusiastic about investing in new AI capabilities than implementing failsafe protocols. Neither is it essentially mistaken. 

“Generally, the smarter play is to simply accept restricted disruption threat and redirect assets towards innovation, like AI or knowledge modernization,” argued Chris Hutchins, founder and CEO of Hutchins Information Technique Consulting. “Nevertheless it should be an knowledgeable threat, not an assumed one.”

Based on Hutchins, if there are areas of the enterprise that CIOs can afford to pause within the occasion of a uncommon outage, the rewards from single-sourcing — value financial savings, tighter integration and specialised experience — might outweigh the operational threat. Tiago Azevedo, CIO at OutSystems, agreed on the necessity to see this as a monetary calculation, made on a person foundation. Somewhat than being a default requirement, he stated he sees redundancy as a focused resilience funding. CIOs need not defend each inch of their enterprise to the identical diploma, so long as the important thing areas are considerably bolstered.

“The extent ought to mirror system criticality: manufacturing or customer-facing methods benefit multi-region or multi-provider protection, whereas improvement and take a look at environments can tolerate transient downtime,” he stated. “The target is not to remove all threat however to align resilience spending with the potential value of disruption.”

Mapping out the Mission-Crucial

To find out the place CIOs ought to direct redundancy efforts, IT leaders argued that there must be honesty and understanding round what facets of infrastructure are literally elementary to enterprise operations. An outage can occur at any time, each inside inside methods and at any third-party supplier, that means that CIOs cannot delay taking strategic motion.

Over time, an organization could possibly introduce redundancy at a extra complete stage throughout all infrastructure, however this may not take advantage of monetary sense. As Hitchens described it, “redundancy that is not tied to a transparent restoration goal shortly turns into technical debt.”  So, it is crucial that CIOs do an audit of their enterprise dependencies, figuring out single factors of failure, and ordering methods primarily based on their influence on operations and belief.

“It is very important make investments the place failure creates actual threat, not simply minor inconvenience, or noise,” he added. 

This can look totally different for corporations of various sizes, however significantly for corporations inside totally different sectors. Some industries, equivalent to healthcare or finance, require the next stage of redundancy throughout the board just because the stakes are higher; lack of entry to affected person data or monetary info might have extreme repercussions when it comes to security and public belief, that are far past inconvenience or frustration.

Brown known as out organizations which can be “born within the cloud” as being significantly susceptible, whereas Azevedo stated he noticed extra stress placed on “always-on” industries equivalent to e-commerce. Industries which can be extra extremely regulated may additionally must cope with higher expectations in relation to resilience and redundancy; finance, for instance. The EU not too long ago handed DORA (Digital Operations Resilience Act) to make sure that monetary entities can “face up to, reply to, and get well” from know-how disruptions.

One Supplier, however Diversified Dependencies

Within the wake of the AWS outage, critics have been fast to name for a diversification of web companions, preaching the necessity for stronger and extra quite a few rivals to AWS. And as a part of their redundancy methods, CIOs might want to examine how reliant they’re on particular suppliers, to allow them to decide their threat within the occasion of an outage. 

However this is not so simple as tracing third-party contracts, counting how usually one title seems, and shifting some operations away from too-dominant suppliers. If a corporation has partnered predominantly with one supplier, it is in all probability for good purpose. As Hitchens defined, working with a single supplier can speed up innovation and simplify administration, providing visibility, native integrations and unified tooling. 

“The profit is effectivity; the danger is dependency,” he stated.

He added that he has no difficulty with CIOs persevering with with single-provider methods — so long as they govern them “with eyes broad open.” In observe, this may increasingly contain constructing portability into knowledge, establishing exit and failover plans, and testing restoration outdoors the ecosystem.

Brown argued that the outage is not actually a touch upon the problem of the only supplier within the first place; if organizations had constructed redundancy into their single-provider ecosystems, they may have averted most of this disruption. It is because a single supplier would not must equate to a single dependency. By using totally different areas and availability zones, CIOs can unfold their threat. In any case, the AWS outage affected solely US-EAST-1. Brown stated he believes that this strategy delivers 99% of the resilience advantages, whereas additionally being considerably extra sensible and cost-effective than a multi-provider technique.

“Cross-provider failover sounds nice on paper, however introduces substantial complexity,” he stated. “The bottom line is architecting for failure inside your chosen ecosystem.”



Related Articles

Latest Articles