A CrowdStrike software issue caused widespread problems with its Falcon Sensor product. This IT outage caused by a content update affected millions of Windows hosts across multiple industries worldwide.
Let’s talk about the cause of the CrowdStrike issue, what unscathed companies did right, and what professionals have to say about preventing this from happening again.
What Caused the Software Issue: Lax Software Testing Processes or More?
Many believe adequate software testing would have prevented this catastrophe. However, others have concluded that multiple layers of bugs caused the issue, which is more difficult to catch in a fully automated testing system.
Even testing for one minute would have discovered these issues …In my mind, that one minute of testing would have been acceptable. – Kyler Middleton, senior principal software engineer at Veradigm
Testing continues to be a significant point of friction [in application development]…Software quality governance requires automation with agile, continuous quality initiatives in the face of constrained QA staff and increasing software complexity…Software testing, both for security and quality, appears to be among the most promising uses for generative AI in other IDC surveys…I am hopeful that the next few years will see improvements in these statistics…However, AI can’t fix the lack of or failure to follow policy and procedures. – IDC analyst Katie Norton
The CrowdStrike flaw was caused by multiple layers of bugs. That includes a content validator software testing tool that should have detected the flaw in the Rapid Release Content configuration template — an indirect method that, in theory, poses less of a risk of causing a system crash than updates to system files themselves …This is a challenge in fully automated systems because they, too, rely on software to progress releases from development through delivery … If there’s a bug in the software somewhere in that CI/CD pipeline … it can lead to a situation like this. So to discover the testing bug in an automated way, you’d have to test the tests. But that’s software, too, so you’d have to test the test that tests the tests and so on. – Gabe Knuth, analyst at TechTarget’s Enterprise Strategy Group.
How Some Companies Went Unscathed
Not every company that got the blue screen of death had to shut down. Some had procedures in place that helped them recover relatively quickly.
We’ve really focused on business continuity, redundancies, safety nets, and understanding of the difference between cybersecurity as a task and cybersecurity as a cultural commitment of your organization…It’s a validation of our investments while so many of our peers were languishing…The redundancies are numerous…They’re not necessarily terribly sophisticated, but we have literally gone through and said, ‘What are the critical systems of our organization? What is the interplay between them? And if it comes crashing down, what is the plan?’…The reality for cybersecurity and business continuity is the work [must be]done well ahead of the disaster. It has to be part of the fabric of your company, like compliances, like customer service…It’s hard to celebrate cybersecurity—except for the days when you’re the only ones not sweating it. – Andrew Molosky, president and CEO of Tampa-based Chapters Health System
Professionals Input on Preventing A Repeat
Everyone wants to avoid a repeat. Below is some advice from professionals on preventing this from happening again.
Phased Check-ins on Endpoint Health
I’m incredibly surprised, even though they call it ‘Rapid Response,’ that [CrowdStrike] doesn’t have some phased approach that allows them to check in on the health of the endpoints that have been deployed … Even with some logical order of customer criticality, they could have circuit breakers to stop a deployment early that they see causes health issues. For example, don’t [update]airlines until your confidence level is higher from seeing the health of endpoints from other customers. – Andy Domeier, senior director of technology at SPS Commerce
Move Away from Auto-deploying Kernel Module Updates
It is absolutely irresponsible to auto-deploy a kernel module update globally without a health-mediated process or, at least, a recovery path at a lower level of the control plane … Something that remains functional even if the OS deployed on top crashes. – David Strauss, co-founder and CTO at Pantheon
Eliminate Unmanageable Endpoint Complexity
The Windows endpoint environment has reached the point of unmanageable complexity. A steady stream of updates and layering of security features has created a web of complexity that is difficult to manage or fix and therefore promotes risk. Moving Windows to the cloud and replacing the endpoint with a secure by design operating system, such as IGEL OS, can simplify management through centralization and aid in recovery should an outage or breach occur saving millions of dollars in lost productivity. We have grown somewhat numb to the steady stream of data breaches. This latest incident of the shepherd turning on the metaphorical sheep it was protecting highlights that we must consider approaching this problem differently. The move to Windows 11 and the opportunity for cloud transformation, along with the proliferation of SaaS, are proven technologies that can enable a much more secure endpoint strategy. – Jason Mafera, Field CTO at IGEL
Platform, People and Process in Software Testing
It’s not sufficient to just have a great software platform. It’s not sufficient to have highly enabled developers. It’s also not sufficient to just have predefined workflows and governance. All three of those have to come together – Dan Rogers, CEO at LaunchDarkly
Balance Security With Tight Deadlines
What you don’t want to have happen now is that you’re so worried about making software changes that you have a very long and protracted testing cycle and you end up stifling software innovation – Dan Rogers, CEO at LaunchDarkly