“Never push an update on a Friday,” a computer scientist told a BBC reporter after information technology outages caused global blackouts, disrupting air travel as well as hospital and emergency service systems, court proceedings, finance and banking services, and restaurants. More than 1,500 flights a day have been canceled over the last three days.
A single flawed security update Friday by the cybersecurity firm CrowdStrike made its way onto an estimated 8.5 million electronic devices—knocking those devices out of service and causing costly delays, communication troubles, and technological headaches around the globe.
The rolling outages originated with a faulty update to a CrowdStrike product, the Falcon sensor, intended to catch potential communication between information technology (IT) hackers and malicious software they may have installed. “That configuration was basically not ready to be put out, and the way it interacted with Microsoft products, in particular, the Windows operating system, caused the error that we’re seeing right now,” said Yameen Huq, a cybersecurity director at the Aspen Institute. Apple and Linux systems that were administered with the same CrowdStrike update remained unaffected. But when the update was rolled out onto millions of Windows devices, those devices went offline, sparking mass confusion and a complex restoration process.
As technology specialists rush to get the millions of affected devices back online, does the global outage mark a blip of the digital era to learn from—or could it become more common with advanced digitalization?
Given that glitches caused by coding errors are nothing new—what happened to elevate the mistake from a mere minor disruption to causing chaos across continents? “It can come from a mixture of three big buckets,” Huq said. “Process, human capability or talent, and also the underlying technology. … What they’re going to be spending time now is to look at which of those forces played an impact here, and how big.” It’s too early to tell right now in which bucket the mistake originated. CrowdStrike traced the problem to a bug or coding mistake known as “logic error” that caused Windows systems to crash. “Looking at that particular process—and seeing at what part in the steps could we have caught this and potentially remediated it—is going to be pretty, pretty critical,” said Huq.
In an ideal scenario, CrowdStrike could have fixed its wide-impacting mistake by simply rolling out a new update, correcting the logic error of its predecessor. And rebooting after CrowdStrike’s corrective update did bring the devices of some Microsoft users back online. But only some. “Many of the customers are rebooting the system and it’s coming up and it’ll be operational,” CrowdStrike CEO George Kurtz said in an interview. But for some less fortunate users, “it could be some time for some systems that won’t automatically recover.”
That is the core problem facing the resolution process. If devices don’t respond automatically to the new update, it likely must be done manually. “That’s the tricky part, right? This is, right now, kind of about as manual as an IT position can be,” Huq told The Dispatch. “If you’re experiencing a blue screen [error]—which is a typical outcome of this error—it’s not easy to just go online, for example, right, and remediate that problem.”
CrowdStrike has said its “team is fully mobilized” and is “actively assisting customers,” and Microsoft also announced that hundreds of its specialists are working directly with customers to resolve the issue.
The technological disruption has had and will continue to have economic effects. Travel delays—more than 1,500 flights have been canceled over three straight days—certainly inconvenienced customers who had critical and costly events planned. Who will end up footing the bill? “CrowdStrike will have insurance, Microsoft will have insurance, the airlines will have insurance,” said Betsy Cooper, director of the Aspen Institute’s Tech Policy Hub and the founding executive director of the University of California, Berkeley’s Center for Long-Term Cybersecurity. That being said, “I think it’s going to be extremely complicated to figure out where the legal liability will lie, and it will be many years of litigation down the road.”
But there are also the macroeconomic implications: A single update caused ripple effects that stretched across both geographical borders and various production or service industries. The error illustrated the interconnectedness of the emerging technologies and the global economy. “One mistake by one company working with a major big tech organization can lead to huge ramifications across the globe, and one reason for that is that these systems are getting increasingly interconnected, so that a change in one can affect a change across many different industries and types of software,” Cooper told The Dispatch. “I think this sort of interruption is an inevitability in the future,” she said. “Preparation is the only thing we can really do to get ahead of it.”
If coding errors are inevitable, as Cooper suggests, how does one adequately prepare for such scenarios? By compartmentalizing and being prepared, said Cooper. “You want to try to ensure that not all of your systems are dependent on one particular complex software,” she explained. To limit companies’ exposure to the risk, they should, for example, have different software for its financial services than for its data storage. “You ensure that, if there’s a problem with one system, the effects of it are cabined, and do not necessarily run throughout your entire organization.”
But some also blame the concentration of the technology industry into a handful of companies and suggest that, if there were more viable alternatives to Microsoft or CrowdStrike, the faulty update’s effects would not have been as far-reaching. “Today’s massive global Microsoft outage is the result of a software monopoly that has become a single point of failure for too much of the global economy,” George Rakis, executive director of NextGen Competition—an organization that opposes market consolidation in the tech industry—said in a statement. “For decades, Microsoft’s pursuit of a vendor lock-in strategy has prevented the public and private sectors from diversifying their IT capabilities.”
Would it be better for the tech industry to produce a more diverse array of technological systems, almost like biodiversity in organisms help prevent an entire species’ exposure to a single disease? “There’s both costs and benefits to that,” said Huq. One benefit to having several large companies dominate the market—instead of a bevy of smaller entities—is simply scale: Larger companies have larger resources, which means more time, attention, and investment into its services. “They’re going to be using software that ultimately has more eyes on it,” he said. But, if a mistake does get past those eyes, its effects could be more wide-reaching. “Risky practices would obviously scale out more.”
Please note that we at The Dispatch hold ourselves, our work, and our commenters to a higher standard than other places on the internet. We welcome comments that foster genuine debate or discussion—including comments critical of us or our work—but responses that include ad hominem attacks on fellow Dispatch members or are intended to stoke fear and anger may be moderated.
With your membership, you only have the ability to comment on The Morning Dispatch articles. Consider upgrading to join the conversation everywhere.