No Need to Panic: The Linux Kernel Panic CrowdStrike Issue
For about a week, I’ve noticed a sub-topic trend in the news articles and discussions regarding CrowdStrike’s infamous Blue Screen of Death (BSOD) issue on Windows devices: CrowdStrike crashing Linux. CrowdStrike, a leading cybersecurity company known for its endpoint protection platform, recently made headlines due to a Channel File update that caused global outages and Windows BSODs. In the wake of this incident, these articles often reference or are solely about the Linux kernel panics that CrowdStrike allegedly caused months before the Windows incident. This has led to some confusion and concern among my colleagues, with many asking if we need to proactively search for problematic Linux hosts to prevent similar kernel panic issues in our production environments.
The Widespread Misconception
It’s not hard to find news articles, blogs, or tweets briefly mentioning kernel panics reported as recently as June supposedly caused by CrowdStrike. These reports even suggest that the issue affected multiple Linux distributions, including Red Hat Enterprise Linux (RHEL) and Debian. Many articles are around a week old, surfacing as the CrowdStrike BSOD news cycle ended and authors were looking to report something new (and juicy).
I continue to receive messages on Teams and Slack from colleagues asking if I’ve seen various articles, whether we need to check Linux logs for specific issues, or if we should initiate another emergency production changes to address the situation. This ongoing concern highlights the lingering confusion and anxiety surrounding the issue.
So, I figured I’d do my best to compile details from my various messages and reports here and polish them in a way that is hopefully helpful to some and potentially entertaining to others. Is this another example of the poor Quality Control for production changes to the Falcon Sensor from Crowdstrike producing a new bug causing Crashes?
The Core Issue
TLDR: The issue was actually a bug in the Linux kernel, not the CrowdStrike sensor. Specifically, it was a problem with the Linux kernel’s eBPF verifier, a component designed to protect the kernel from crashing. Ironically, this bug in the verifier itself caused the kernel to panic.
Many people in my life panic when they encounter a bug, so I don’t blame the Linux kernel.
The issue was patched in subsequent kernel releases.
If this is what you came for, you’re welcome. I value your time.
But even in the misconceptions, there is a kernel of truth (pun). Some CrowdStrike customers did experience Kernel Panics while running the sensor on various Linux distributions. For those interested in understanding the details of how a protection mechanism intended to safeguard the kernel from crashing was impacted by this bug, and CrowdStrike’s involvement in identifying and addressing the issue, read on.
Understanding the CrowdStrike Falcon Sensor for Linux
To provide some context, the Falcon sensor for Linux uses extended Berkeley Packet Filter (eBPF) programs loaded from user space, eliminating the need for a kernel module. eBPF is an in-kernel virtual machine that allows the safe execution of programs in kernel space, providing powerful tracing and monitoring capabilities. This approach allows visibility into the system without risking instability on kernels that haven’t been certified to run with CrowdStrike. If user mode isn’t supported, the sensor runs in Reduced Functionality Mode (RFM).
The sensor’s User Mode is automatically activated to ensure comprehensive coverage for a host where the kernel-mode sensor might otherwise switch to RFM. This measure is taken to avoid disrupting system stability.
User Mode was a great improvement over jumping straight to RFM if the current CrowdStrike sensor was not certified to run on the active kernel. I recall more than a year ago trying to write custom Python scripts to check sensor-kernel compatibility, fighting to dynamically deploy the latest CrowdStrike sensor to cloud environments dynamically built with the latest kernel, and even kernel pinning when the other options did not work. User Mode was a significant improvement.
However, it was indeed when switching to this User Mode on certain impacted kernel versions that the bug in the kernel was activated.
The RHEL Article and Kernel Bug
Many news articles have referred to a RHEL article “warning” customers about CrowdStrike causing these crashes. Unfortunately, the publicly available portion of this article RHEL Solutions 706803 is cut off before discussing the resolution, as that information is exclusive to paying subscribers.
I’ll avoid sharing the entirety of the paid content behind the paywall, but I will share some details that the journalists, bloggers, and you may not have access to or at least didn’t include.
The solution was to use an RHEL-patched kernel: kernel-5.14.0–427.18.1. For non-RHEL-customers, you can access information about the patched kernel release in the RHEL Security Advisory 2024:3306.
The Technical Details
The issue occurred in a customer’s environment running kernel 5.14.0–427.13.1. RedHat released a patch in kernel-5.14.0–427.18.1. A BPF upstream patch that was expected to be included in the kernel wasn’t, and RedHat corrected this mistake.
CrowdStrike identified the issue on May 15th and informed RedHat about the problem affecting everything from kernel 5.14.0–410 for RedHat 9.4.
You can see the commit where the issue was introduced in the Linux kernel on the Open Source GitHub for the Linux Kernel Project. Similarly, you can see when and how the issue was fixed. Here’s the fixing commit in the Linux Kernel Project. Unfortunately and abnormally for the Linux kernel, this significant issue took time to identify, so some customers experienced it on their systems, resulting in CrowdStrike’s investigation.
Not many users experienced this issue, however. Thus, you probably would have never heard about it, except for the Windows BSOD craze popularizing the discussion and ranting of poor CrowdStrike code control.
The Irony of it All
I’d like to take a moment to really emphasize the following: The kernel/bpf/verifier.c, a component designed explicitly to protect systems from harm and instability, became the very source of kernel panics. This verifier’s primary purpose is to act as a gatekeeper, meticulously examining eBPF programs to ensure they won’t compromise system integrity or security. It’s meant to be the safeguard that allows the powerful eBPF technology to be used safely within the kernel, thus enabling options like User Mode. However, in this case, a bug in the verifier itself led to the very issue it was created to prevent.
CrowdStrike’s Response
To prevent further issues, CrowdStrike later changed their sensor to stop it from running in User Mode on all major versions of affected kernels to prevent additional users from experiencing this issue.
Again, CrowdStrike proactively notified customers and RHEL about the kernel issue via a Tech Alert on May 15th. RHEL responded by releasing its own advisory and patch on May 23rd.
RHEL and CrowdStrike referred customers to running patched kernel versions. I personally did not experience the impact of this fix firsthand at my day job but did report on it during an intra-department meeting discussing vulnerability issues.
Testing and Responsibility: A Balanced Perspective
Now, some readers may wisely be thinking, “Well, Christian. Maybe they didn’t author the commit that executed the bug, but they still shipped the changes to customers and it made it past their testing. They made the bug go boom!” This is indeed a valid consideration. Even if the issue were incompatibility with their sensor due to a problem in the open-source kernel, we would expect to see testing that identifies such issues before changes are shipped to customers.
While we can’t know the full extent of CrowdStrike’s testing procedures, we can observe some of their actions. The sensor release notes for the impacted version did include a known issue section detailing the risk of system crashes occurring when the Falcon sensor ran in user mode on some Oracle 9 kernels. This release modified the Linux sensor to prevent it from loading into user mode on the affected Oracle 9 kernels.
It’s also worth noting that regarding the RHEL kernels mentioned in the widely-cited RHEL article about CrowdStrike triggering the bug, RHEL 9.4 was not a Falcon-supported distribution at the time.
In fact, customers running certified supported kernels should not require User Mode, as their sensor should operate in the normal full-featured Kernel Mode. However, given the challenges I mentioned earlier — such as kernel pinning and other compatibility issues — the user mode (leveraging eBPF) option has become highly desirable for enterprise environments.
In the end, CrowdStrike didn’t offer a package update to fix the bug because it wasn’t related to their Linux sensor. Instead, they informed their customers about the affected kernel versions, notified RHEL of the issue (and potentially others), and modified their Linux Sensor to avoid running in User Mode, thus preventing the triggering of the kernel vulnerability.
Lessons Learned
What we can learn from this incident is that CrowdStrike implemented multiple systems and efforts to prevent kernel instability. Unfortunately, some users experienced a crash from the kernel bug that was challenging to trigger and detect while running a Falcon Sensor in a mode designed to prevent system instability when the sensor and kernel are not verified to work together. It’s important to note that the average reader is likely unaffected unless they are running the impacted kernel version, which you can see described in CrowdStrikes Tech Alert.
This situation serves as a reminder of the complex interactions between security software and operating systems, and the need for continuous collaboration between vendors and Free and Open Source (FOSS) contributors. It also highlights the importance of thorough testing and the challenges of identifying rare but significant bugs in complex systems.
Take Away
Seeing all this news about CrowdStrike, particularly the Linux aspects, has taught me to approach even articles from reputable publications with a critical eye. While I wouldn’t claim to be an expert, I dealt with this issue in my day job and had to do my own digging months ago after reading a technical article from CrowdStrike. This experience underscores the importance of verifying information, especially in the fast-paced world of cybersecurity, where misunderstandings can quickly spread.
Mike and I discussed this a little in our previous article regarding the CrowdStrike-caused BSOD issue that was caused by a bug in CrowdStrikes Falcon Sensor. You can read our CrowdStroke: The Failed Security Update that Ruined Our Friday here.