Articles and Insights, Information Security

A CISO’s Analysis Of the CrowdStrike Global Outage

The update heard around the world.

Lee Vorthman

Contributing CISO

Save

Overnight, between July 18th and July 19th, 2024, Windows systems running CrowdStrike (NASDAQ: CRWD) ceased functioning and displayed the blue screen of death (BSOD). As people woke up in the morning, they discovered a wide-reaching global outage of the consumer services they rely on for their daily lives, such as healthcare, travel, fast food, and even emergency services.

The ramifications of this event will continue to be felt for weeks as businesses recover from the outage, and investors react to the realization that global businesses are extremely fragile when it comes to technology and business operations.

How Did This Happen?

An update by CrowdStrike (CS) was pushed to all customers running CS Falcon agents. This file was corrupt (reports indicate a null byte header issue) and when Windows attempted to load this file, it crashed.

Rebooting the impacted systems did not resolve the issue because of the way CS Falcon works. CS Falcon has access to the inner workings of the operating system (kernel) such as memory access, drivers, and registry entries that allow CS to detect malicious software and activity. The CS Falcon agent is designed to receive updates automatically in order to keep the agent up to date with the latest detections. In this case, the update file was not properly tested and somehow made it through quality assurance and quality control, before being pushed globally to all CS customers.

Additionally, CrowdStrike customers are clearly running CS Falcon on production systems and do not have processes in place to stage updates to CS Falcon in order to minimize the impact of failed updates.

Ripple Effects

This truly was a global outage and the list of industries affected is far-reaching, attesting to the previous successes of CS, but also the risks that can impact your software supply chain. Days after the initial outage, Delta Airlines was still experiencing flight cancellations and delays as a result of impacts to their pilot scheduling system. The list of impacted companies is widespread and numerous, but I’ll provide a short list as follows:

Travel – United, Delta, American, major airports

Banking and Trading – VISA, stock exchanges

Emergency & Security Services – Some 911 services and ADT

Cloud Providers – AWS, Azure

Consumer – Starbucks, McDonalds, FedEx

Once the immediate fallout subsides, there will be plenty of finger-pointing at CrowdStrike for failing to properly test an update, but what this event clearly shows is a lack of investment by some major global companies in site reliability engineering (SRE), business continuity planning (BCP), disaster recovery (DR), business impact analysis (BIA), and proper change control.

If companies were truly investing in SRE, BCP, DR, and BIA beyond a simple checkbox exercise, this failed update would have been a non-event. Businesses would have simply executed their BCP/DR plan and failed over, or immediately recovered their critical services to get back up and running (which some did). Or, if they were running proper change control along immutable infrastructure, they could have immediately rolled back to the last good version with minimal impact.

More work needs to be done by all of these companies to improve their plans, processes, and execution when a disruptive event occurs.

Are global companies really allowing live updates to mission-critical software in production without going through proper testing? Production systems should be immutable, preventing any change to production without being updated in the CI/CD pipeline and then re-deployed.

Failed updates became an issue almost two decades ago when Microsoft began Patch Tuesday. Companies quickly figured out they couldn’t trust the quality of the patches and instead would test the patches in staging, which runs a duplicate environment to production. While this may have created a short window of vulnerability, it came with the advantages of stability and uninterrupted business operations.

Modern-day IT Operations (called Platform Engineering or Site Reliability Engineering) now design production environments to be immutable and somewhat self-healing. All changes need to be updated in code and then re-pushed through dev, test, and staging environments to confirm proper QA and QC is followed. This minimizes the impact from failed code pushes and will also minimize disruption from failed patches and updates like this one.

SRE also closely monitors production environments for latency thresholds, availability targets, and other operational metrics. If the environment exceeds a specific threshold, then it throws alerts and will attempt to self-heal by allocating more resources, or by rolling back to the previously known good image.

Long-Term Effects

Setting aside the maturity of business and IT operations, there are some clear ramifications for this event.

I. Financial Impact and Disclosure

First, this event affected a wide variety of businesses and services on a global level. Some of the biggest impacts were felt by publicly traded companies and as a result, those entities will need to make an 8K filing with the SEC to report a material event to their business.

Even though this wasn’t a cybersecurity attack, it was still an event that disrupted business operations, and companies will need to report the expected impact and loss accordingly.

CrowdStrike in particular will need to make an 8K filing, not only for loss of stock value, but for expected loss of revenue through lost customers, contractual concessions, and other tangible impacts to their business. On the day of the outage, CS stock was down over 10% and by the following Monday morning, they were down almost 20%. The stock has started to recover, but that is clearly a material event to investors.

II. Need for Enhanced BCP/DR Investment

Recent events, such as this one and the UHC Change Healthcare ransomware attack, have clearly shown that some businesses are not investing properly in BCP/DR. They may have plans on paper, but those plans still need to be fully tested including rapidly identifying service degradation and implementing recovery operations as quickly as possible.

The reality is this should have been a non-event and any business that was impacted longer than a few hours needs to consider additional investment in their BCP/DR plan to minimize the impact of future events.

CISOs need to work with the rest of the C-Suite to review existing BCP/DR plans and update them accordingly based on the risk tolerance of the business and the desired RTO and RPO.

III. Boards Need To Step Up

During an event like this, boards need to take a step back and remember their primary purpose is to represent and protect investors.

In this case, the sub-committees that govern technology, cybersecurity, and risk should be asking hard questions about how to minimize the impact of future events like this and consider if the existing investment in BCP/DR technology and processes is sufficient to offset a projected loss of business.

This may include more frequent reports on when the last time BCP/DR plans were properly tested and if those plans are properly accounting for all of the possible scenarios that could impact the business such as ransomware, supply chain disruption, or global events like this one.

The board may also push the executive staff to accelerate plans to invest in and modernize IT operations to eliminate tech debt and adopt industry best practices such as immutable infrastructure or SRE. The board may also insist on a detailed analysis of the risks of the supply chain, including plans to minimize single points of failure while limiting the blast radius of future events.

IV. Potential Negative Perceptions

Unfortunately, this event is likely to cause a negative perception of cybersecurity in the short term for a few different reasons.

First, the obvious business disruption is one people will question. How is it that a global cybersecurity company is able to disrupt so much with a single update? Could this same process act as an attack vector for attackers?

Reports are already indicating that malicious domains have been set up to look like the fix for this event, but instead push malware. There are also malicious domains that have been created for phishing purposes, and the reality is any company impacted by this event may also be vulnerable to ransomware attacks, social engineering, and other follow-on attacks.

Second, this event may cause a negative perception of automatic updates within the IT operations groups. I personally believe this is the wrong reaction, but the reality is some businesses will turn off the auto-updates, which will leave them more vulnerable to malware and other attacks.

“The reality is this should have been a non-event and any business that was impacted longer than a few hours needs to consider additional investment in their BCP / DR plan to minimize the impact of future events.”

What CISOs Should Do

With all this in mind, what should CISOs do to help the board, the C-Suite, and the rest of the business navigate this event? Here are my suggestions:

First, review your contractual terms with 3rd party providers to understand contractually defined SLAs, liability, restitution, and other clauses that can help protect your business due to an event caused by a third party. This should also include a risk analysis of your entire supply chain to determine single points of failure and how to protect your business appropriately.

Second, insist on increased investment in your BIA, BCP, and DR plans including designing for site reliability and random events to proactively identify and recover from disruption, including review of RTO and RPO. If your BCP / DR plan is not where it needs to be, it may require investment in a multi-year technology transformation plan including resolving legacy systems and tech debt. It may also require modernizing your SDLC to shift to CI/CD including dev, test, staging, and production environments that are tightly controlled.

The ultimate goal will be to move to immutable infrastructure and IT operations best practices that allow your services to operate and recover without disruption.

Third, resist the temptation to overreact. The C-Suite and investors are going to ask some hard questions about your business and they will suggest a wide range of solutions such as turning off auto-patches, ripping out CS, or even building their own solution. All of these suggestions have a clear tradeoff in terms of risk and operational investment. Making a poor, reactive decision immediately after this event can harm the business more than it can help.

Finally, for mission-critical services, consider shifting to a heterogeneous environment that statistically minimizes the impact of any one vendor. The concept is simple, if you need security technology to protect your systems, consider purchasing multiple vendors that all have similar capabilities. This will minimize the impact of your business operations if one of them has an issue.

This obviously raises the complexity and operational cost of your environment and should only be used for mission-critical or highly sensitive services that need to minimize any risk to operations. However, this event does highlight the risks of consolidating to a single vendor and you should conduct a risk analysis to determine the best course of action for your business and supply chain.

The Wrap

For some companies, this was a non-event. Once they realized there was an outage they simply executed their recovery plans and were back online relatively quickly.

For other companies, this event highlighted a lack of investment in IT operation fundamentals like BCP/DR or supply chain risk management. On the positive side, this wasn’t a ransomware or other cybersecurity attack so recovery is relatively straightforward for most businesses. On the negative side, this event can have negative consequences if businesses overreact and make poor decisions.

As a CISO, I highly recommend you take advantage of this event to learn from your weaknesses and make plans to shore up the aspects of your operations that are sub-standard.

Save

July 31, 2024

☀️ Subscribe to the Early Morning Byte! Begin your day informed, engaged, and ready to lead with the latest in technology news and thought leadership.

☀️ Your latest edition of the Early Morning Byte is here! Kickstart your day informed, engaged, and ready to lead with the latest in technology news and thought leadership.

56 CIOs On the Move

Gartner Forecasts 148% Year-Over-Year Growth in GenAI Spend

Invisible AI Prompt Threats Undermining Gmail Security

The Key Skill That Helps CIOs Lead Through Uncertainty

Health Data of BCBS Massachusetts Members Exposed in Cierant Cyberattack

Subscribe to Newsletters

Curated Content | Thought Leadership | Technology News

Subscribe to Newsletters

Register

Sign In

A CISO’s Analysis Of the CrowdStrike Global Outage

Save

How Did This Happen?

Ripple Effects

Long-Term Effects

I. Financial Impact and Disclosure

II. Need for Enhanced BCP/DR Investment

III. Boards Need To Step Up

IV. Potential Negative Perceptions

What CISOs Should Do

The Wrap

Enter your username and password to access premium features.

Enter your username and password to access premium features.

Digital Monthly

Digital Annual

Enter your username and password to access premium features.

Log In To Access Premium Features

Sign Up For A Free Account