Posted on Nov 19

How one line of Rust code and a missing graceful handling mechanism took down almost 25% of Internet traffic

Cloudflare as usual, published a detailed post-mortem on their site explaining about the root cause of November 18 outage and it was was related to their Bot Management module. Yes, the same module responsible for those CAPTCHA style checks we see whenever we visit a Cloudflare protected site. That’s why many users were seeing bot challenge errors on ChatGPT yesterday.
So, how does it work? Cloudflare runs a query to extract specific features into a feature file, which is then fed into the Bot Management module. The module uses those feature sets with machine learning models to generate the bot score. Based on that score, it decides whether a request is legitimate or from a bot. Since bot behavior changes rapidly, Cloudflare runs this feature extraction query every few minutes to keep the system updated.
Now, what actually went wrong?
A single line of Rust code ended up changing the behavior of the underlying ClickHouse query that generates the feature file. As a result, the file began containing a massive number of duplicate feature rows, effectively doubling its size. This oversized feature file was then fed into the bot management system, which had a strict maximum size limit. According to cloudflare, the limit is set to 200, well above than their current use of ~60 features. But that oversized file contained more than 200 features. Since the file exceeded that limit, and Cloudflare didn’t implement any graceful handling for this scenario, the software panicked and triggered yesterday’s outage.
The interesting part is not just that there was no graceful handling, but also that there was no fallback logic. Ideally, when encountering a malformed or suspicious feature file with duplicated entries, the system could have logged a warning, discarded the bad file, and continued using the previous valid one. But none of that was implemented. It was simply an unwrap() followed by a panic.

The code portion that was the source of that unhandled error was shared by cloudflare

This whole incident shows us how important it is to handle even the smallest edge cases in our code, because their impact can be far greater than we expect.
Thanks for reading.

DEV Community

How one line of Rust code and a missing graceful handling mechanism took down almost 25% of Internet traffic

Top comments (0)