\

Cloudflare outage on December 5, 2025

516 points - yesterday at 3:35 PM

Source
  • mixedbit

    yesterday at 10:54 PM

    This is architectural problem, the LUA bug, the longer global outage last week, a long list of earlier such outages only uncover the problem with architecture underneath. The original, distributed, decentralized web architecture with heterogeneous endpoints managed by myriad of organisations is much more resistant to this kind of global outages. Homogeneous systems like Cloudflare will continue to cause global outages. Rust won't help, people will always make mistakes, also in Rust. Robust architecture addresses this by not allowing a single mistake to bring down myriad of unrelated services at once.

      • tobyjsullivan

        yesterday at 11:51 PM

        I’m not sure I share this sentiment.

        First, let’s set aside the separate question of whether monopolies are bad. They are not good but that’s not the issue here.

        As to architecture:

        Cloudflare has had some outages recently. However, what’s their uptime over the longer term? If an individual site took on the infra challenges themselves, would they achieve better? I don’t think so.

        But there’s a more interesting argument in favour of the status quo.

        Assuming cloudflare’s uptime is above average, outages affecting everything at once is actually better for the average internet user.

        It might not be intuitive but think about it.

        How many Internet services does someone depend on to accomplish something such as their work over a given hour? Maybe 10 directly, and another 100 indirectly? (Make up your own answer, but it’s probably quite a few).

        If everything goes offline for one hour per year at the same time, then a person is blocked and unproductive for an hour per year.

        On the other hand, if each service experiences the same hour per year of downtime but at different times, then the person is likely to be blocked for closer to 100 hours per year.

        It’s not really bad end user experience that every service uses cloudflare. It’s more-so a question of why is cloudflare’s stability seeming to go downhill?

        And that’s a fair question. Because if their reliability is below average, then the value prop evaporates.

        • WD-42

          yesterday at 10:57 PM

          In other words, the consolidation on Cloudflare and AWS makes the web less stable. I agree.

            • amazingman

              yesterday at 11:16 PM

              Usually I am allergic to pithy, vaguely dogmatic summaries like this but you're right. We have traded "some sites are down some of the time" for "most sites are down some of the time". Sure the "some" is eliding an order of magnitude or two, but this framing remains directionally correct.

                • PullJosh

                  yesterday at 11:27 PM

                  Does relying on larger players result in better overall uptime for smaller players? AWS is providing me better uptime than if I assembled something myself because I am less resourced and less talented than that massive team.

                  If so, is it a good or bad trade to have more overall uptime but when things go down it all goes down together?

                    • VorpalWay

                      today at 12:03 AM

                      From a societal view it is worse when everything is down at once. Leads to a less resilient society: It is not great if I can't buy essentials from one store because their payment system is down (this happened to one super market chain in Sweden due to a hacker attack some years ago, took weeks to fully fix everything, and then there was that whole Crowdstrike debacle globally more recently).

                      It is far worse if all of the competitors are down at once. To some extent you can and should have a little bit of stock at home (water, food, medicine, ways to stay warm, etc) but not everything is practical to do so with (gasoline for example, which could have knock on effects on delivery of other goods).

          • ivanjermakov

            yesterday at 11:26 PM

            Robust architecture that is serving 80M requests/second worldwide?

            My answer would be that no one product should get this big.

            • chickensong

              yesterday at 11:26 PM

              You're not wrong, but where's the robust architecture you're referring to? The reality of providing reliable services on the internet is far beyond the capabilities of most organizations.

              • cyanydeez

                yesterday at 11:41 PM

                Bro, but how do we make shareholder value if we don't monopolize and enshittify everything

            • jacobgkau

              yesterday at 8:45 PM

              I noticed this outage last night (Cloudflare 500s on a few unrelated websites). As usual, when I went to Cloudflare's status page, nothing about the outage was present; the only thing there was a notice about the pre-planned maintenance work they were doing for the security issue, reporting that everything was being routed around it successfully.

                • cnnlives265

                  yesterday at 9:28 PM

                  This is the case with just about every status page I’ve ever seen. It takes them a while to realize there’s really a problem and then to update the page. One day these things will be automated, but until then, I wouldn’t expect more of Cloudflare than any other provider.

                  What’s more concerning to me is that now we’ve had AWS, Azure, and CloudFlare (and CliudFlare twice) go down recently. My gut says:

                  1. developers and IT are using LLMs in some part of the process, which will not be 100% reliable.

                  2. Current culture of I have (some personal activity or problem) or we don’t have staff, AI will replace me, f-this.

                  3. Pandemic after effects.

                  4. Political climate / war / drugs; all are intermingled.

                    • mikkupikku

                      yesterday at 9:34 PM

                      Management doesn't like when things like this are automated. They want to "manage" the outage/production/etc numbers before letting them out.

                        • kbolino

                          yesterday at 10:38 PM

                          There's no sweet spot I've found. I don't work for Cloudflare but when I did have a status indicator to maintain, you could never please everyone. Users would complain when our system was up but a dependent system was down, saying that our status indicator was a lie. "Fixing" that by marking our system as down or degraded whenever a dependent system was down led to the status indicator being not green regularly, causing us to unfairly develop a reputation as unreliable (most broken dependencies had limited blast radius). The juice no longer seemed worth the squeeze and we gave up on automated status indicators.

                            • jacobgkau

                              yesterday at 11:31 PM

                              > "Fixing" that by marking our system as down or degraded whenever a dependent system was down led to the status indicator being not green regularly, causing us to unfairly develop a reputation as unreliable (most broken dependencies had limited blast radius).

                              This seems like an issue with the design of your status page. If the broken dependencies truly had a limited blast radius, that should've been able to be communicated in your indicators and statistics. If not, then the unreliable reputation was deserved, and all you did by removing the status page was hide it.

                              • naniwaduni

                                yesterday at 11:59 PM

                                The headline status doesn't have to be "worst of all systems". Pick a key indicator, and as long as it doesn't look like it's all green regardless of whether you're up or down, users will imagine that "green headline, red subsystems" means whatever they're observing, even if that makes the status display utterly uninterpretable from an outside perspective.

                            • Yeri

                              yesterday at 9:36 PM

                              100% — will never be automated :)

                                • hnuser123456

                                  yesterday at 10:29 PM

                                  Still room for someone to claim the niche of the Porsche horsepower method in outage reporting - underpromise, overdeliver.

                          • TechniKris

                            yesterday at 10:04 PM

                            Thing is, these things are automated... Internally.

                            Which makes it feel that much more special when a service provides open access to all of the infrastructure diagnostics, like e.g. https://status.ppy.sh/

                              • rezonant

                                yesterday at 10:22 PM

                                Nice! Didn't know you could make a Datadog dashboard public like that!

                            • colechristensen

                              yesterday at 10:43 PM

                              >It takes them a while to realize there’s really a problem and then to update the page.

                              Not really, they're just lying. I mean yes of course they aren't oracles who discover complex problems in instant of the first failure, but naw they know when well there are problems and significantly underreport them to the extent they are are less "smoke alarms" and more "your house has burned down and the ashes are still smoldering" alarms. Incidents are intentionally underreported. It's bad enough that there ought to be legislation and civil penalties for the large providers who fail to report known issues promptly.

                          • mrb

                            yesterday at 11:16 PM

                            Only way to change that it to shame them for it: "Cloudflare is so incompetent at detecting and managing outages that even their simple status page is unable to be accurate"

                            If enough high-ranked customers report this feedback...

                            • yesterday at 9:11 PM

                          • w10-1

                            yesterday at 5:47 PM

                            Kudos to Cloudflare for clarity and diligence.

                            When talking of their earlier Lua code:

                            > we have never before applied a killswitch to a rule with an action of “execute”.

                            I was surprised that a rules-based system was not tested completely, perhaps because the Lua code is legacy relative to the newer Rust implementation?

                            It tracks what I've seen elsewhere: quality engineering can't keep up with the production engineering. It's just that I think of CloudFlare as an infrastructure place, where that shouldn't be true.

                            I had a manager who came from defense electronics in the 1980's. He said in that context, the quality engineering team was always in charge, and always more skilled. For him, software is backwards.

                              • braiamp

                                yesterday at 6:20 PM

                                This is funny, considering that someone that worked on the defense industry (guide missile system) found a memory leak on one of their products, at that time. They told him that they knew about it, but that it's timed just right with the range of the system it would be used, so it doesn't matter.

                                  • sally_glance

                                    yesterday at 9:41 PM

                                    Having observed an average of two mgmt rotations at most of the clients our company is working for this comes at absolutely no surprise to me. Engineering is acting perfectly reasonable, optimizing for cost and time within the constraints they were given. Constraints are updated at a (marketing or investor pleasure) whim without consulting engineering, cue disaster. Not even surprising to me anymore...

                                    • Etheryte

                                      yesterday at 6:51 PM

                                      This paraphrased urban legend has nothing to do with quality engineering though? As described, it's designed to the spec and working as intended.

                                        • mikkupikku

                                          yesterday at 9:47 PM

                                          It tracks with my experience in software quality engineering. Asked to find problems with something already working well in the field. Dutifully find bugs/etc. Get told that it's working though so nobody will change anything. In dysfunctional companies, which is probably most of them, quality engineering exists to cover asses, not to actually guide development.

                                            • colechristensen

                                              yesterday at 10:47 PM

                                              It is not dysfunctional to ignore unreachable "bugs". A memory leak on a missile which won't be reached because it will explode long before that amount of time has passed is not a bug.

                                                • wkat4242

                                                  yesterday at 10:58 PM

                                                  It's a debt though. Because people will forget it's there and then at some point someone changes a counter from milliseconds to microseconds and then the issue happens 1000 times sooner.

                                                  It's never right to leave structural issues even if "they don't happen under normal conditions".

                                                    • Etheryte

                                                      yesterday at 11:20 PM

                                                      I don't think this argument makes sense. You wouldn't provision a 100GB server for a service where 1GB would do just in case unexpected conditions come up. If the requirements change, then the setup can change, doing it just because is wasteful. What if we forget is not a valid argument to over engineer and over provision.

                                                        • datadrivenangel

                                                          yesterday at 11:51 PM

                                                          If a fix is relatively low cost and improves the software in a way that makes it easier to modify in the future, it makes it easier to change the requirements. In aggregate these pay off.

                                      • runlaszlorun

                                        yesterday at 11:07 PM

                                        My hunch is that we do the same with memory leaks or other bugs in web applications where the time of a request is short.

                                        • mopsi

                                          yesterday at 6:49 PM

                                          ... until the extended-range version is ordered and no one remembers to fix the leak. :]

                                            • hinkley

                                              yesterday at 8:27 PM

                                              Ariane 5 happens.

                                              • wizzwizz4

                                                yesterday at 7:33 PM

                                                They will remember, because it'll have been measured and documented, rigorously.

                                                  • SketchySeaBeast

                                                    yesterday at 7:43 PM

                                                    I've found that the real trick with documentation isn't creation, it's discovery. I wonder how that information is easily found afterwards.

                                                      • lloeki

                                                        yesterday at 7:50 PM

                                                        By reading the documentation thoroughly as a compulsory first step to designing the next system that depends on it.

                                                        I realise this may probably boggle the mind of the modern software developer.

                                                          • SkyPuncher

                                                            yesterday at 9:28 PM

                                                            I used to take this approach when building new integrations. Then I realized (1) most documentation sucks (2) there's far too much to remember (3) much of it is conditional (4) you don't always know what matters until it matters (e.g. using different paths of implementation).

                                                            What works much better is having an intentional review step that you come back to.

                                                            • hinkley

                                                              yesterday at 8:31 PM

                                                              That is not how this usually works.

                                                              Most of the time QA can tell you exactly how the product works, regardless of what the documentation says. But many of us haven’t seen a QA team in five, ten years.

                                                              • lukan

                                                                yesterday at 8:21 PM

                                                                You say this like trivial misstakes did not happen all the time in classical engineering as well.

                                                                If there is a memory leak, them this is a flaw, that might not matter so much for a specific product, but I can also easily see it being forgotten, if it was maybe mentioned somewhere in the documentation, but maybe not clear enough and deadlines and stress to ship are a thing there as well.

                                                                • switchbak

                                                                  yesterday at 8:36 PM

                                                                  Just try harder. And if it still breaks, clearly you weren't trying hard enough!

                                                                  At some point you have to admit that humans are pretty bad at some things. Keeping documentation up to date and coherent is one of those things, especially in the age of TikTok.

                                                                  Better to live in the world we have and do the best you can, than to endlessly argue about how things should be but never will become.

                                                                    • vimwizard

                                                                      yesterday at 8:54 PM

                                                                      > especially in the age of TikTok

                                                                      Shouldn't grey beards, grizzled by years of practicing rigorous engineering, be passing this knowledge on to the next generation? How did they learn it when just starting out? They weren't born with it. Maybe engineering has actually improved so much that we only need to experience outages this frequently, and such feelings of nostalgia are born from never having to deal with systems having such high degrees of complexity and, realistically, 100% availability expectations on a global scale.

                                                                        • switchbak

                                                                          yesterday at 11:33 PM

                                                                          We were talking about making a missile (v2) with an extended range, and ensuring that the developers who work on it understand the assumption of the prior model: that it doesn't use free because it's expected to blow up before that would become an issue (a perfectly valid approach, I might add). And to ensure that this assumption still holds in the v2 extended range model. The analogy to Ariane 5 is very apt.

                                                                          Now, there can be tens of thousands of similar considerations to document. And keeping up that documentation with the actual state of the world is a full time job in itself.

                                                                          You can argue all you want that folks "should" do this or that, but all I've seen in my entire career is that documentation is almost universally: out of date, and not worth relying on because it's actively steering you in the wrong direction. And I actually disagree (as someone with some gray in my beard) with your premise that this is part of "rigorous engineering" as is practiced today. I wish it was, but the reality is you have to read the code, read it again, see what it does on your desk, see what it does in the wild, and still not trust it.

                                                                          We "should" be nice to each other, I "should" make more money, and it "should" be sunny more often. And we "should" have well written, accurate and reliable docs, but I'm too old to be waiting around for that day to come, especially in the age of zero attention and AI generated shite.

                                                                          • spockz

                                                                            yesterday at 9:15 PM

                                                                            They may not have learned it but being thorough in general was more of a thing. These days things are far more rushed. And I say that as a relatively young engineer.

                                                                            The amount of dedication and meticulous and concentrated work I know from older engineers when I started work and that I remember from my grand fathers is something I very rarely observe these days. Neither in engineering specific fields nor in general.

                                                                • hinkley

                                                                  yesterday at 8:30 PM

                                                                  If ownerless code doesn’t result in discoverability efforts then the whole thing goes off the rails.

                                                                  I won’t remember this block of code because five other people have touched it. So I need to be able to see what has changed and what it talks to so I can quickly verify if my old assumptions still hold true

                                                                  • colechristensen

                                                                    yesterday at 10:53 PM

                                                                    >I wonder how that information is easily found afterwards.

                                                                    Military hardware is produced with engineering design practices that look nothing at all like what most of the HN crowd is used to. There is an extraordinary amount of documentation, requirements, and validation done for everything.

                                                                    There is a MIL-SPEC for pop tarts which defines all parts sizes, tolerances, etc.

                                                                    Unlike a lot in the software world military hardware gets DONE with design and then they just manufacture it.

                                                                    • wizzwizz4

                                                                      yesterday at 8:54 PM

                                                                      For the new system to be approved, you need to document the properties of the software component that are deemed relevant. The software system uses dynamic allocation, so "what do the allocation patterns look like? are there leaks, risks of fragmentation, etc, and how do we characterise those?" is on the checklist. The new developer could try to figure this all out from scratch, but if they're copying the old system's code, they're most likely just going to copy the existing paperwork, with a cursory check to verify that their modifications haven't changed the properties.

                                                                      They're going to see "oh, it leaks 3MiB per minute… and this system runs for twice as long as the old system", and then they're going to think for five seconds, copy-paste the appropriate paragraph, double the memory requirements in the new system's paperwork, and call it a day.

                                                                      Checklists work.

                                                                  • hinkley

                                                                    yesterday at 8:28 PM

                                                                    Was this one measured and documented rigorously?

                                                                    Well obviously not, because the front fell off. That’s a dead giveaway.

                                                        • zwnow

                                                          yesterday at 6:31 PM

                                                          "Kudos"? This is like the South Park episode in which the oil company guy just excuses himself while the company just continues to fuck up over and over again. There's nothing to praise, this shouldn't happen twice in a month. Its inexcusable.

                                                            • vpShane

                                                              yesterday at 7:32 PM

                                                              twice in a month _so far_

                                                                • hinkley

                                                                  yesterday at 8:36 PM

                                                                  We still have two holidays and associated vacations and vacation brain to go. And then the January hangover.

                                                                  Every company that has ignored my following advice has experienced a day for day slip in first quarter scheduling. And that advice is: not much work gets done between Dec 15 and Jan 15. You can rely on a week worth, more than that is optimistic. People are taking it easy and they need to verify things with someone who is on vacation so they are blocked. And when that person gets back, it’s two days until their vacation so it’s a crap shoot.

                                                                  NB: there’s work happening on Jan 10, for certain, but it’s not getting finished until the 15th. People are often still cleaning up after bad decisions they made during the holidays and the subsequent hangover.

                                                                  • Bengalilol

                                                                    yesterday at 7:45 PM

                                                                    Those AI agents are coding fast, or am I missing some obvious concept here?

                                                                      • yesterday at 7:49 PM

                                                            • yesterday at 7:14 PM

                                                          • Scaevolus

                                                            yesterday at 3:45 PM

                                                            > Disabling this was done using our global configuration system. This system does not use gradual rollouts but rather propagates changes within seconds to the entire network and is under review following the outage we recently experienced on November 18.

                                                            > As soon as the change propagated to our network, code execution in our FL1 proxy reached a bug in our rules module which led to the following LUA exception:

                                                            They really need to figure out a way to correlate global configuration changes to the errors they trigger as fast as possible.

                                                            > as part of this rollout, we identified an increase in errors in one of our internal tools which we use to test and improve new WAF rules

                                                            Warning signs like this are how you know that something might be wrong!

                                                              • testplzignore

                                                                yesterday at 6:00 PM

                                                                > They really need to figure out a way to correlate global configuration changes to the errors they trigger as fast as possible.

                                                                This is what jumped out at me as the biggest problem. A wild west deployment process is a valid (but questionable) business decision, but if you do that then you need smart people in place to troubleshoot and make quick rollback decisions.

                                                                Their timeline:

                                                                > 08:47: Configuration change deployed and propagated to the network

                                                                > 08:48: Change fully propagated

                                                                > 08:50: Automated alerts

                                                                > 09:11: Configuration change reverted and propagation start

                                                                > 09:12: Revert fully propagated, all traffic restored

                                                                2 minutes for their automated alerts to fire is terrible. For a system that is expected to have no downtime, they should have been alerted to the spike in 500 errors within seconds before the changes even fully propagated. Ideally the rollback would have been automated, but even if it is manual, the dude pressing the deploy button should have had realtime metrics on a second display with his finger hovering over the rollback button.

                                                                Ok, so they want to take the approach of roll forward instead of immediate rollback. Again, that's a valid approach, but you need to be prepared. At 08:48, they would have had tens of millions of "init.lua:314: attempt to index field 'execute'" messages being logged per second. Exact line of code. Not a complex issue. They should have had engineers reading that code and piecing this together by 08:49. The change you just deployed was to disable an "execute" rule. Put two and two together. Initiate rollback by 08:50.

                                                                How disconnected are the teams that do deployments vs the teams that understand the code? How many minutes were they scratching their butts wondering "what is init.lua"? Are they deploying while their best engineers are sleeping?

                                                                  • bostik

                                                                    yesterday at 9:42 PM

                                                                    > 2 minutes for their automated alerts to fire is terrible

                                                                    I take exception to that, to be honest. It's not desirable or ideal, but calling it "terrible" is a bit ... well, sorry to use the word ... entitled. For context, I have experience running a betting exchange. A system where it's common for a notable fraction of transactions in a medium-volume event to take place within a window of less than 30 seconds.

                                                                    Vast majority of current monitoring systems are built on Prometheus. (Well okay, these days it's more likely something Prom-compatible but more reliable.) That implies collection via recurring scrapes. A supposedly "high" frequency online service monitoring system does a scrape every 30 seconds. Well known reliability engineering practices state that you need a minimum of two consecutive telemetry points to detect any given event - because we're talking about a distributed system and network is not a reliable transport. That in turn means that with near-perfect reliability the maximum time window before you can detect something failing is the time it takes to perform three scrapes: thing A might have failed a second after the last scrape, so two consecutive failures will show up only after a delay of just-a-hair-shy-of-three scraping cycle windows.

                                                                    At Cloudflare's scale, I would not be surprised if they require three consecutive events to trigger an alert.

                                                                    As for my history? The betting exchange monitoring was tuned to run scrapes at 10-second intervals. That still meant that the first an alert fired for something failing could have been effectively 30 seconds after the failures manifested.

                                                                    Two minutes for something that does not run primarily financial transactions is a pretty decent alerting window.

                                                                      • dotancohen

                                                                        yesterday at 10:31 PM

                                                                        Prometheus compatible but more reliable? Sell it to me!

                                                                        • yearolinuxdsktp

                                                                          yesterday at 11:22 PM

                                                                          Critical high-level stats such as errors should be scraped more frequently than 30 seconds. It’s important to have multiple time granularity scraping intervals, a small set of most critical stats should be scraped closer to 10s or 15s.

                                                                          Prometheus has as an unaddressed flaw [0], where rate functions must be at least 2x the scrape interval. This means that if you scrape at 30s intervals, your rate charts won’t reflect the change until a minute after.

                                                                          [0] - https://github.com/prometheus/prometheus/issues/3746

                                                                          • parchley

                                                                            yesterday at 9:59 PM

                                                                            > At Cloudflare's scale, I would not be surprised if they require three consecutive events to trigger an alert.

                                                                            Sorry but that’s a method you use if you serve 100 requests per second, not when you are at Cloudflare scale. Cloudflare easily have big enough volume that this problem would trigger an instant change in a monitorable failure rate.

                                                                        • morpheos137

                                                                          yesterday at 8:01 PM

                                                                          I see lots of people complaining about this down time but in actuality is it really that big a deal to have 30 minutes of down time or whatever. It's not like anything behind cloudflare is "mission critical" in the sense that lives are at stake or even a huge amount of money is at stake. In many developed countries the electric power service has local down times on occasion. That's more important than not being able to load a website. I agree if CF is offering a certain standard of reliability and not meeting it then they should offer prorated refunds for the unexpected down time but otherwise I am not seeing what the big deal is here.

                                                                            • ljm

                                                                              yesterday at 9:18 PM

                                                                              > It's not like anything behind cloudflare is "mission critical" in the sense that lives are at stake or even a huge amount of money is at stake.

                                                                              This is far too dismissive of how disruptive the downtime can be and it sets the bar way too low for a company so deeply entangled in global internet infrastructure.

                                                                              I don’t think you can make such an assertion with any degree of credibility.

                                                                              • bombcar

                                                                                yesterday at 8:42 PM

                                                                                30 minutes of downtime is fine for most things, including Amazon.

                                                                                30 minutes of unplanned downtime for infrastructure is unacceptable; but we’re tending to accept it. AWS or Cloudflare have positioned themselves as The Internet so they need to be held to a higher standard.

                                                                                • odie5533

                                                                                  yesterday at 9:19 PM

                                                                                  > It's not like anything behind cloudflare is "mission critical" in the sense that lives are at stake or even a huge amount of money is at stake.

                                                                                  Yes, there are lots of mission critical systems that use cloudflare and lives and huge amounts of money are at stake.

                                                                                  • moritonal

                                                                                    yesterday at 11:25 PM

                                                                                    I am confident there is at least a few hospitals, gp offices or ticketing systems that interact directly or indirectly with Cloud flare. They've sold themselves as a major defence in security.

                                                                                    • therein

                                                                                      yesterday at 8:08 PM

                                                                                      > about this down time but in actuality is it really that big a deal to have 30 minutes of down time or whatever. It's not like anything behind cloudflare is "mission critical" in the sense that lives are at stake or even a huge amount of money is at stake.

                                                                                      This reads like sarcasm. But I guess it is not. Yes, you are a CDN, a major one at that. 30 minutes of downtime or "whatever" is not acceptable. I worked at traffic teams of social networks that looked at themselves as that mission critical. CF is absolutely that critical and it is definitely lives at stake.

                                                                              • philipwhiuk

                                                                                yesterday at 4:08 PM

                                                                                > Warning signs like this are how you know that something might be wrong!

                                                                                Yes, as they explain it's the rollback that was triggered due to seeing these errors that broke stuff.

                                                                                  • Scaevolus

                                                                                    yesterday at 4:50 PM

                                                                                    They saw errors and decided to do a second rollout to disable the component generating errors, causing a major outage.

                                                                                    • 8cvor6j844qw_d6

                                                                                      yesterday at 4:51 PM

                                                                                      Would be nice if the outage dashboards are directly linked to this instead of whatever they have now.

                                                                                  • 8note

                                                                                    yesterday at 11:07 PM

                                                                                    they arent a panacea though, internal tools like that can be super noisy on errors, and be broken more often than theyre working

                                                                                    • bombcar

                                                                                      yesterday at 8:40 PM

                                                                                      “ Uh...it's probably not a problem...probably...but I'm showing a small discrepancy in...well, no, it's well within acceptable bounds again. Sustaining sequence. Nothing you need to worry about, Gordon. Go ahead.“

                                                                                      • shadowgovt

                                                                                        yesterday at 9:22 PM

                                                                                        "Hey, this change is making the 'check engine' light turn on all the time. No problem; I just grabbed some pliers and crushed the bulb."

                                                                                    • cpncrunch

                                                                                      yesterday at 5:14 PM

                                                                                      I've noticed that in recent months, even apart from these outages, cloudflare has been contributing to a general degradation and shittification of the internet. I'm seeing a lot more "prove you're human", "checking to make sure you're human", and there is normally at the very least a delay of a few seconds before the site loads.

                                                                                      I don't think this is really helping the site owners. I suspect it's mainly about AI extortion:

                                                                                      https://blog.cloudflare.com/introducing-pay-per-crawl/

                                                                                        • james2doyle

                                                                                          yesterday at 5:26 PM

                                                                                          You call it extortion of the AI companies, but isn’t stealing/crawling/hammering a site to scrape their content to resell just as nefarious? I would say Cloudflare is giving these site owners an option to protect their content and as a byproduct, reduce their own costs of subsidizing their thieves. They can choose to turn off the crawl protection. If they aren't, that tells you that they want it, doesn’t it?

                                                                                            • cpncrunch

                                                                                              yesterday at 6:52 PM

                                                                                              >You call it extortion of the AI companies, but isn’t stealing/crawling/hammering a site to scrape their content to resell just as nefarious?

                                                                                              You can easily block ChatGPT and most other AI scrapers if you want:

                                                                                              https://habeasdata.neocities.org/ai-bots

                                                                                                • james2doyle

                                                                                                  yesterday at 7:16 PM

                                                                                                  This is just using robots.txt and asking "pretty please, don’t scrape me".

                                                                                                  Here is an article (from TODAY) about the case where Perplexity is being accused of ignoring robots.txt: https://www.theverge.com/news/839006/new-york-times-perplexi...

                                                                                                  If you think a robots.txt is the answer to stopping the billion-dollar AI machine from scraping you, I don’t know what to say.

                                                                                                  • jacobgkau

                                                                                                    yesterday at 8:01 PM

                                                                                                    I'm guessing you don't manage any production web servers?

                                                                                                    robots.txt isn't even respected by all of the American companies. Chinese ones (which often also use what are essentially botnets in Latin American and the rest of the world to evade detection) certainly don't care about anything short of dropping their packets.

                                                                                                      • dingnuts

                                                                                                        yesterday at 8:22 PM

                                                                                                        [dead]

                                                                                                    • Sohcahtoa82

                                                                                                      yesterday at 11:44 PM

                                                                                                      How are you this naive? Do you really think scrapers give a damn about your robots.txt?

                                                                                                      • chrneu

                                                                                                        yesterday at 10:52 PM

                                                                                                        this is the equivalent of asking people not to speed on your street.

                                                                                                        • literalAardvark

                                                                                                          yesterday at 11:35 PM

                                                                                                          Tell me you don't run a site without telling me you don't run a site

                                                                                                          • mplewis

                                                                                                            yesterday at 10:43 PM

                                                                                                            No you cannot! I blocked all of the user agents on a community wiki I run, and the traffic came back hours later masquerading as Firefox and Chrome. They just fucking lie to you and continue vacuuming your CPU.

                                                                                                    • NooneAtAll3

                                                                                                      yesterday at 5:23 PM

                                                                                                      it can't even spy on us silently, damn

                                                                                                  • lionkor

                                                                                                    yesterday at 4:45 PM

                                                                                                    Cloudflare is now below 99.9% uptime, for anyone keeping track. I reckon my home PC is at least 99.9%.

                                                                                                      • ryandvm

                                                                                                        yesterday at 7:30 PM

                                                                                                        Indeed. AWS too.

                                                                                                        I feel like the cloud hosting companies have lost the plot. "They can provide better uptime than us" is the entire rationale that a lot of small companies have when choosing to run everything in the cloud.

                                                                                                        If they cost more AND they're less reliable, what exactly is the reason to not self host?

                                                                                                          • toomuchtodo

                                                                                                            yesterday at 10:02 PM

                                                                                                            > If they cost more AND they're less reliable, what exactly is the reason to not self host?

                                                                                                            Shifting liability. You're paying someone else for it to be their problem, and if everyone does it, no one will take flak for continuing to do so. What is the average tenure of a CIO or decision maker electing to move to or remain at a cloud provider? This is why you get picked to talk on stage at cloud provider conferences.

                                                                                                            (have been in the meetings where these decisions are made)

                                                                                                            • XCSme

                                                                                                              yesterday at 7:44 PM

                                                                                                              Plus, when you self-host, you can likely fix the issue yourself in a couple of hours max, instead of waiting indefinitely for a fix or support that might never come.

                                                                                                                • bombcar

                                                                                                                  yesterday at 8:44 PM

                                                                                                                  These global cloud outages aren’t the real issue; they affect everyone and get fixed.

                                                                                                                  What is killer is when there is a KNOWN issue that affects YOU but basically only you so why bother fixing it!

                                                                                                                    • XCSme

                                                                                                                      yesterday at 9:07 PM

                                                                                                                      I mean, I still prefer to have the ability to fix it myself, because I know I can probably do it in 1h max. I know this doesn't apply to most people, especially those outside of HN though.

                                                                                                                        • al_borland

                                                                                                                          yesterday at 9:25 PM

                                                                                                                          Even if resolution times are equal, there is some comfort in being able to see the problem and make progress on it to feel like you're actively doing something. I work in a large enterprise and we have a team dedicated to managing critical incidents and getting everyone together for a resolution. When a 3rd party vendor is the reason for the outage, those calls are really awkward. It's a bunch of people sitting around pressing F5, all frantically trying to make it look like they are actively helping, when no one is actually doing anything, because they can't.

                                                                                                                          I equate it to driving. I'd rather be moving at a normal speed on side streets than sitting in traffic on the expressway, even if the expressway is technically faster.

                                                                                                                            • XCSme

                                                                                                                              yesterday at 9:30 PM

                                                                                                                              Today a client is having some issue with Zoom because of some artificial rate limits they impose. Their support is not responding, the account can't be used, courses can not be held and there's not much we can do.

                                                                                                                              We already started looking into moving away from Zoom, I suggested self-hosting http://jitsi.org Based on their docs, self-hosting is well supported, and probably a $50-$100 server is more than enough, so a lot cheaper than Zoom.

                                                                                                                                • carl_dr

                                                                                                                                  yesterday at 10:06 PM

                                                                                                                                  Artifical limits because they are on the free plan?

                                                                                                                                    • XCSme

                                                                                                                                      yesterday at 10:17 PM

                                                                                                                                      Artifical limits, because they have 40 paid licenses that they can not use, because of a non-disclosed assignment limit that is NOT mentioned in the pricing page nor in the ToS.

                                                                                                                                      A lot of people are angry about this, and I think it's borderline illegal: https://devforum.zoom.us/t/you-have-exceeded-the-limit-of-li...

                                                                                                                                      You pay for something, and you can't use it.

                                                                                                                                        • bombcar

                                                                                                                                          yesterday at 10:32 PM

                                                                                                                                          This is why we never changed the licenses, we just made long-running identical ID meetings that everyone can join.

                                                                                                                                          But we’re moving away as it’s only going to get worse.

                                                                                                                                            • XCSme

                                                                                                                                              yesterday at 10:41 PM

                                                                                                                                              That's a cool work-around.

                                                                                                                                              What I don't like, is that whenever you contact Zoom, their representatives are taught to say one thing: buy more licenses.

                                                                                                                                              Not only that, but their API/pricing is specifically designed to cover edge-cases that will force you to buy a license.

                                                                                                                                              For example, they don't expose an API to assign a co-host. You can do that via the UI, manually, but not via the API.

                                                                                                                                              Can you share which solution are you moving to?

                                                                                                                                  • al_borland

                                                                                                                                    yesterday at 9:47 PM

                                                                                                                                    It's interesting to see Comcast is using that. I would have expected them to go with the mainstream vendors.

                                                                                                                            • mewpmewp2

                                                                                                                              yesterday at 9:46 PM

                                                                                                                              Are you always available to react within 1h?

                                                                                                              • odie5533

                                                                                                                yesterday at 9:22 PM

                                                                                                                When a piece of hardware goes or a careless backup process fails, downtime of a self-hosted service can be measured in days or weeks.

                                                                                                                • markus_zhang

                                                                                                                  yesterday at 5:59 PM

                                                                                                                  TBF, it depends on the number of outages locally. In my area it is one outage every thunderstorm/snow storm, so unfortunately the up time of my laptop, even with the help of a large, portable battery charging station (which can charge multiple laptops at the same time), is not optimistic.

                                                                                                                  I sometimes fancy that I could just take cash, go into the wood, build a small solar array, collect & cleanse river water, and buy a starlink console.

                                                                                                                    • roguecoder

                                                                                                                      yesterday at 7:06 PM

                                                                                                                      Costco had a deal on solid-state UPS & solar panels a while back that I was happy to partake of

                                                                                                                      • SoftTalker

                                                                                                                        yesterday at 9:50 PM

                                                                                                                        Yeah, I'd guess I average a power drop once a month or so at home. Never calculated the nines of uptime average, but it's not that infrequent.

                                                                                                                        I know when I need to reset the clock on my microwave oven.

                                                                                                                          • lionkor

                                                                                                                            yesterday at 10:02 PM

                                                                                                                            99.9 is like 9 hours of downtime a year.

                                                                                                                        • DANmode

                                                                                                                          yesterday at 8:36 PM

                                                                                                                          Far more achievable pricing and logistics than even ten years ago.

                                                                                                                      • hashstring

                                                                                                                        yesterday at 11:12 PM

                                                                                                                        Where/how are you keeping track of this? What is their current uptime percentage?

                                                                                                                          • ivanjermakov

                                                                                                                            yesterday at 11:31 PM

                                                                                                                            1 - downtime/period. I suspect period is 1 year. 99.9% is 8.76 hours of downtime a year.

                                                                                                                        • chickensong

                                                                                                                          yesterday at 9:32 PM

                                                                                                                          That's a pretty silly comparison though.

                                                                                                                          • tripplyons

                                                                                                                            yesterday at 7:48 PM

                                                                                                                            Do they include uptime guarantees in any contracts?

                                                                                                                      • RA_Fisher

                                                                                                                        today at 12:01 AM

                                                                                                                        As a reliability statistician (and web user!), I'd love to see Cloudflare investing in reliability statistics. :)

                                                                                                                        • uyzstvqs

                                                                                                                          yesterday at 5:03 PM

                                                                                                                          What I'm missing here is a test environment. Gradual or not; why are they deploying straight to prod? At Cloudflare's scale, there should be a dedicated room in Cloudflare HQ with a full isolated model-scale deployment of their entire system. All changes should go there first, with tests run for every possible scenario.

                                                                                                                          Only after that do you use gradual deployment, with a big red oopsie button which immediately rolls the changes back. Languages with strong type systems won't save you, good procedure will.

                                                                                                                            • tetha

                                                                                                                              yesterday at 9:58 PM

                                                                                                                              This is kinda what I'm thinking. We're absolutely not at the scale Cloudflare is at.

                                                                                                                              But we run software and configuration changes through three tiers - first stage for the dev-team only, second stage with internal customers and other teams depending on it for integration and internal usage -- and finally production. Some teams have also split production into different rings depending on the criticality of the customers and the number of customers.

                                                                                                                              This has lead to a bunch of discussions early on, because teams with simpler software and very good testing usually push through dev and testing with no or little problem. And that's fine. If you have a track record of good changes, there is little reason to artificially prolong deployment in dev and test just because. If you want to, just go through it in minutes.

                                                                                                                              But after a few spicy production incidents, even the better and faster teams understood and accepted that once technical velocity exists, actual velocity is a choice, or a throttle if you want an analogy.

                                                                                                                              If you do good, by all means, promote from test to prod within minutes. If you fuck up production several times in a row and start threatening SLAs, slow down, spend more resources on manual testing and improving automated testing, give changes time to simmer in the internally productive environment, spend more time between promotions from production ring to production ring.

                                                                                                                              And this is on top of considerations of e.g. change risk. Some frontend-only application can move much faster than the PostgreSQL team, because one rollback is a container restart, and the other could be a multi-hour recovery from backups.

                                                                                                                              • bombcar

                                                                                                                                yesterday at 8:45 PM

                                                                                                                                They have millions of “free” subscribers; said subscribers should be the test pigs for rollouts; paying (read: big) subscribers can get the breaking changes later.

                                                                                                                                  • beardedetim

                                                                                                                                    yesterday at 9:02 PM

                                                                                                                                    This feels like such a valid solution and is how past $dayjobs released things: send to the free users, rollout to Paying Users once that's proven to not blow up.

                                                                                                                                      • sznio

                                                                                                                                        yesterday at 10:54 PM

                                                                                                                                        If your target is availability, that's correct.

                                                                                                                                        If your target is security, then _assuming your patch is actually valid_ you're giving better security coverage for free customers than to your paying ones.

                                                                                                                                        Cloudflare is both, and their tradeoffs seem to be set on maximizing security at cost of availability. And it makes sense. A fully unavailable system is perfectly secure.

                                                                                                                                          • yesterday at 11:39 PM

                                                                                                                                    • ectospheno

                                                                                                                                      yesterday at 9:28 PM

                                                                                                                                      Free tier doesn’t get WAF. We kept working.

                                                                                                                                        • bsdpqwz

                                                                                                                                          yesterday at 9:34 PM

                                                                                                                                          Their December 3rd blog about React states:

                                                                                                                                          "These new protections are included in both the Cloudflare Free Managed Ruleset (available to all Free customers) ..... "

                                                                                                                                          having some burn in time in free tier before it hits the whole network would have been good?!

                                                                                                                                  • yesterday at 5:52 PM

                                                                                                                                    • vouwfietsman

                                                                                                                                      yesterday at 7:43 PM

                                                                                                                                      > Languages with strong type systems won't save you

                                                                                                                                      Neither will seatbelts if you drive into the ocean, or helmets if you drink poison. I'm not sure what your point is.

                                                                                                                                        • djmips

                                                                                                                                          yesterday at 9:16 PM

                                                                                                                                          I think you strengthened their point.

                                                                                                                                  • miyuru

                                                                                                                                    yesterday at 3:50 PM

                                                                                                                                    Whats going on with cloudflare's software team?

                                                                                                                                    I have seen similar bugs in cloudflare API recently as well.

                                                                                                                                    There is an endpoint for a feature that is available only to enterprise users, but the check for whether the user is on an enterprise plan is done at the last step.

                                                                                                                                      • archon810

                                                                                                                                        yesterday at 5:35 PM

                                                                                                                                        I recently ran into an issue with the Cloudflare API feature that if you want to roll back requires contacting the support team because there's no way to roll it back with the API or GUI. Even when the exact issue was pointed out, it took multiple days to change the setting and to my knowledge there's still no API fix available.

                                                                                                                                        https://www.answeroverflow.com/m/1234405297787764816

                                                                                                                                        • 65

                                                                                                                                          yesterday at 5:49 PM

                                                                                                                                          My guess? Code written by AI

                                                                                                                                            • markus_zhang

                                                                                                                                              yesterday at 7:46 PM

                                                                                                                                              TBF they are still hiring a lot of eng people from US/UK/EU:

                                                                                                                                              https://www.cloudflare.com/careers/jobs/?department=Engineer...

                                                                                                                                              • system2

                                                                                                                                                yesterday at 6:01 PM

                                                                                                                                                100%. Upper managements try to cut costs and hire remote bullshitters.

                                                                                                                                                  • venturecruelty

                                                                                                                                                    yesterday at 9:14 PM

                                                                                                                                                    Agreed in re cost cutting, but there's no need to disparage those of us who don't want to be traffic for two hours every day.

                                                                                                                                            • LelouBil

                                                                                                                                              yesterday at 5:49 PM

                                                                                                                                              Can you elaborate? I'm not sure what you mean by "at the last step"

                                                                                                                                                • miyuru

                                                                                                                                                  yesterday at 9:16 PM

                                                                                                                                                  The API endpoint I am talking about needs a external verification. they allow to do the external verification before checking if the user is on the enterprise plan or not.

                                                                                                                                                  The feature is only available to enterprise plans, it should not even allow external verification.

                                                                                                                                                  • Etheryte

                                                                                                                                                    yesterday at 6:54 PM

                                                                                                                                                    I'm not sure which endpoint gp meant, but as I understood it, as an example, imagine a three-way handshake that's only available to enterprise users. Instead of failing a regular user on the first step, they allow steps one and two, but then do the check on step three and fail there.

                                                                                                                                            • flaminHotSpeedo

                                                                                                                                              yesterday at 3:49 PM

                                                                                                                                              What's the culture like at Cloudflare re: ops/deployment safety?

                                                                                                                                              They saw errors related to a deployment, and because it was related to a security issue instead of rolling it back they decided to make another deployment with global blast radius instead?

                                                                                                                                              Not only did they fail to apply the deployment safety 101 lesson of "when in doubt, roll back" but they also failed to assess the risk related to the same deployment system that caused their 11/18 outage.

                                                                                                                                              Pure speculation, but to me that sounds like there's more to the story, this sounds like the sort of cowboy decision a team makes when they've either already broken all the rules or weren't following them in the first place

                                                                                                                                                • dkyc

                                                                                                                                                  yesterday at 4:16 PM

                                                                                                                                                  One thing to keep in mind when judging what's 'appropriate' is that Cloudflare was effectively responding to an ongoing security incident outside of their control (the React Server RCE vulnerability). Part of Cloudlfare's value proposition is being quick to react to such threats. That changes the equation a bit: any hour you wait longer to deploy, your customers are actively getting hacked through a known high-severity vulnerability.

                                                                                                                                                  In this case it's not just a matter of 'hold back for another day to make sure it's done right', like when adding a new feature to a normal SaaS application. In Cloudflare's case moving slower also comes with a real cost.

                                                                                                                                                  That isn't to say it didn't work out badly this time, just that the calculation is a bit different.

                                                                                                                                                    • flaminHotSpeedo

                                                                                                                                                      yesterday at 4:44 PM

                                                                                                                                                      To clarify, I'm not trying to imply that I definitely wouldn't have made the same decision, or that cowboy decisions aren't ever the right call.

                                                                                                                                                      However, this preliminary report doesn't really justify the decision to use the same deployment system responsible for the 11/18 outage. Deployment safety should have been the focus of this report, not the technical details. My question that I want answered isn't "are there bugs in Cloudflare's systems" it's "has Cloudflare learned from it's recent mistakes to respond appropriately to events"

                                                                                                                                                        • vlovich123

                                                                                                                                                          yesterday at 5:45 PM

                                                                                                                                                          > doesn't really justify the decision to use the same deployment system responsible for the 11/18 outage

                                                                                                                                                          There’s no other deployment system available. There’s a single system for config deployment and it’s all that was available as they haven’t yet done the progressive roll out implementation yet.

                                                                                                                                                            • locknitpicker

                                                                                                                                                              yesterday at 7:13 PM

                                                                                                                                                              > There’s no other deployment system available.

                                                                                                                                                              Hindsight is always 20/20, but I don't know how that sort of oversight could happen in an organization whose business model rides on reliability. Small shops understand the importance of safeguards such as progressive deployments or one-box-style deployments with a baking period, so why not the likes of Cloudflare? Don't they have anyone on their payroll who warns about the risks of global deployments without safeguards?

                                                                                                                                                              • edoceo

                                                                                                                                                                yesterday at 6:16 PM

                                                                                                                                                                Ok. Sure But shouldn't they have some beta/staging/test area they could deploy to, run tests for an hour then do the global blast?

                                                                                                                                                                  • vlovich123

                                                                                                                                                                    yesterday at 6:59 PM

                                                                                                                                                                    Config changes are distinctly more difficult to have that set up for and as the blog says they’re working on it. They just don’t have it ready yet and are pausing any more config changes until it’s set up. They just did this one in response to try to mitigate an ongoing security vulnerability and missed the mark.

                                                                                                                                                                    I’m happy to see they’re changing their systems to fail open which is one of the things I mentioned in the conversation about their last outage.

                                                                                                                                                                    • yesterday at 6:50 PM

                                                                                                                                                              • dkyc

                                                                                                                                                                yesterday at 8:33 PM

                                                                                                                                                                The 11/18 outage was 2.5 weeks ago. Any learning & changes they made as a result for that probably didn't make its way yet to production.

                                                                                                                                                                Particularly if we're asking them to be careful & deliberate about deployments, hard to ask them fast-track this.

                                                                                                                                                            • Already__Taken

                                                                                                                                                              yesterday at 4:41 PM

                                                                                                                                                              the cve isn't a zero day though how come cloudflare werent at the table for early disclosure?

                                                                                                                                                                • flaminHotSpeedo

                                                                                                                                                                  yesterday at 4:58 PM

                                                                                                                                                                  Do you have a public source about an embargo period for this one? I wasn't able to find one

                                                                                                                                                                    • charcircuit

                                                                                                                                                                      yesterday at 5:49 PM

                                                                                                                                                                      Considering there were patched libraries at the time of disclosure, those libraries' authors must have been informed ahead of time.

                                                                                                                                                                      • Pharaoh2

                                                                                                                                                                        yesterday at 5:33 PM

                                                                                                                                                                        https://react.dev/blog/2025/12/03/critical-security-vulnerab...

                                                                                                                                                                        Privately Disclosed: Nov 29 Fix pushed: Dec 1 Publicly disclosed: Dec 3

                                                                                                                                                                          • drysart

                                                                                                                                                                            yesterday at 5:38 PM

                                                                                                                                                                            Then even in the worst case scenario, they were addressing this issue two days after it was publicly disclosed. So this wasn't a "rush to fix the zero day ASAP" scenario, which makes it harder to justify ignoring errors that started occuring in a small scale rollout.

                                                                                                                                                                • cowsandmilk

                                                                                                                                                                  yesterday at 6:52 PM

                                                                                                                                                                  Cloudflare had already decided this was a rule that could be rolled out using their gradual deployment system. They did not view it as being so urgent that it required immediate global roll out.

                                                                                                                                                                  • udev4096

                                                                                                                                                                    yesterday at 4:53 PM

                                                                                                                                                                    Clownflare did what it does best, mess up and break everything. It will keep happening again and again

                                                                                                                                                                      • toomuchtodo

                                                                                                                                                                        yesterday at 5:06 PM

                                                                                                                                                                        Indeed, but it is what it is. Cloudflare comes out of my budget, and even with downtime, its better than not paying them. Do I want to deal with what Cloudflare offers? I do not, I have higher value work to focus on. I want to pay someone else to deal with this, and just like when cloud providers are down, it'll be back up eventually. Grab a coffee or beer and hang; we aren't savings lives, we're just building websites. This is not laziness or nihilism, but simply being rational and pragmatic.

                                                                                                                                                                          • yesterday at 11:45 PM

                                                                                                                                                                            • locknitpicker

                                                                                                                                                                              yesterday at 7:28 PM

                                                                                                                                                                              > Do I want to deal with what Cloudflare offers? I do not, I have higher value work to focus on. I want to pay someone else to deal with this, and just like when cloud providers are down, it'll be back up eventually.

                                                                                                                                                                              This is specious reasoning. How come I had to endure a total outage due to the rollout of a mitigation of a Nextjs vulnerability when my organization doesn't even own any React app, let alone a Nextjs one?

                                                                                                                                                                              Also specious reasoning #2, not wanting to maintain a service does not justify blindly rolling out config changes globally without any safeguards.

                                                                                                                                                                                • toomuchtodo

                                                                                                                                                                                  yesterday at 7:57 PM

                                                                                                                                                                                  If you are a customer of Cloudflare, and not happy, I encourage you to evaluate other providers more to your liking. Perhaps you'll find someone more fitting to your use case and operational preferences, but perhaps not. My day job org pays Cloudflare hundreds of thousands of dollars a year, and am satisfied with how they operate. Everyone has choice, exercise it if you choose. I'm sure your account exec would be happy to take the feedback. Feedback, including yours, is valuable and important to attempt to improve the product and customer experience (imho; i of course do not speak for Cloudflare, only myself).

                                                                                                                                                                                  As a recovering devops/infra person from a lifetime ago (who has, much to my heartbreak, broken prod more than once), perhaps that is where my grace in this regard comes from. Systems and their components break, systems and processes are imperfect, and urgency can lead to unexpected failure. Sometimes its Cloudflare, other times it's Azure, GCP, Github, etc. You can always use something else, but most of us continue to pick the happy path of "it works most of the time, and sometimes it does not." Hopefully the post mortem has action items to improve the safeguards you mention. If there are no process and technical improvements from the outage, certainly, that is where the failure lies (imho).

                                                                                                                                                                                  China-nexus cyber threat groups rapidly exploit React2Shell vulnerability (CVE-2025-55182) - https://aws.amazon.com/blogs/security/china-nexus-cyber-thre... - December 4th, 2025

                                                                                                                                                                                  https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...

                                                                                                                                                                                  https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...

                                                                                                                                                                  • liampulles

                                                                                                                                                                    yesterday at 4:17 PM

                                                                                                                                                                    Rollback is a reliable strategy when the rollback process is well understood. If a rollback process is not well known and well experienced, then it is a risk in itself.

                                                                                                                                                                    I'm not sure of the nature of the rollback process in this case, but leaning on ill-founded assumptions is a bad practice. I do agree that a global rollout is a problem.

                                                                                                                                                                      • newsoftheday

                                                                                                                                                                        yesterday at 5:56 PM

                                                                                                                                                                        Rollback carries with it the contextual understanding of complete atomicity; otherwise it's slightly better than a yeet. It's similar to backups that are untested.

                                                                                                                                                                          • marcosdumay

                                                                                                                                                                            yesterday at 6:45 PM

                                                                                                                                                                            Complete atomicity carries with it the idea that the world is frozen, and any data only needs to change when you allow it to.

                                                                                                                                                                            That's to say, it's an incredibly good idea when you can physically implement it. It's not something that everybody can do.

                                                                                                                                                                              • newsoftheday

                                                                                                                                                                                yesterday at 7:17 PM

                                                                                                                                                                                No, complete atomicity doesn't require a frozen state, it requires common sense and fail-proof, fool-proof guarantees derived from assurances gained from testing.

                                                                                                                                                                                There is another name for rolling forward, it's called tripping up.

                                                                                                                                                                        • programd

                                                                                                                                                                          yesterday at 6:51 PM

                                                                                                                                                                          Global rollout of security code on a timeframe of seconds is part of Cloudflare's value proposition.

                                                                                                                                                                          In this case they got unlucky with an incident before they finished work on planned changes from the last incident.

                                                                                                                                                                      • crote

                                                                                                                                                                        yesterday at 5:45 PM

                                                                                                                                                                        > They saw errors related to a deployment, and because it was related to a security issue instead of rolling it back they decided to make another deployment with global blast radius instead?

                                                                                                                                                                        Note that the two deployments were of different components.

                                                                                                                                                                        Basically, imagine the following scenario: A patch for a critical vulnerability gets released, during rollout you get a few reports of it causing the screensaver to show a corrupt video buffer instead, you roll out a GPO to use a blank screensaver instead of the intended corporate branding, a crash in a script parsing the GPOs on this new value prevents users from logging in.

                                                                                                                                                                        There's no direct technical link between the two issues. A mitigation of the first one merely exposed a latent bug in the second one. In hindsight it is easy to say that the right approach is obviously to roll back, but in practice a roll forward is often the better choice - both from an ops perspective and from a safety perspective.

                                                                                                                                                                        Given the above scenario, how many people are genuinely willing to do a full rollback, file a ticket with Microsoft, and hope they'll get around to fixing it some time soon? I think in practice the vast majority of us will just look for a suitable temporary workaround instead.

                                                                                                                                                                        • lukeasrodgers

                                                                                                                                                                          yesterday at 4:09 PM

                                                                                                                                                                          Roll back is not always the right answer. I can’t speak to its appropriateness in this particular situation of course, but sometimes “roll forward” is the better solution.

                                                                                                                                                                            • flaminHotSpeedo

                                                                                                                                                                              yesterday at 4:55 PM

                                                                                                                                                                              Like the other poster said, roll back should be the right answer the vast majority of the time. But it's also important to recognize that roll forward should be a replacement for the deployment you decided not to roll back, not a parallel deployment through another system.

                                                                                                                                                                              I won't say never, but a situation where the right answer to avoid a rollback (that it sounds like was technically fine to do, just undesirable from a security/business perspective) is a parallel deployment through a radioactive, global blast radius, near instantaneous deployment system that is under intense scrutiny after another recent outage should be about as probable as a bowl of petunias in orbit

                                                                                                                                                                                • crote

                                                                                                                                                                                  yesterday at 5:55 PM

                                                                                                                                                                                  Is a roll back even possible at Cloudflare's size?

                                                                                                                                                                                  With small deployments it usually isn't too difficult to re-deploy a previous commit. But once you get big enough you've got enough developers that half a dozen PRs will have been merged since the start of the incident and now. How viable is it to stop the world, undo everything, and start from scratch any time a deployment causes the tiniest issues?

                                                                                                                                                                                  Realistically the best you're going to get is merging a revert of the problematic changeset - but with the intervening merges that's still going to bring the system in a novel state. You're rolling forwards, not backwards.

                                                                                                                                                                                    • jamesog

                                                                                                                                                                                      yesterday at 7:48 PM

                                                                                                                                                                                      Disclosure: Former Cloudflare SRE.

                                                                                                                                                                                      The short answer is "yes" due to the way the configuration management works. Other infrastructure changes or service upgrades might get undone, but it's possible. Or otherwise revert the commit that introduced the package bump with the new code and force that to rollout everywhere rather than waiting for progressive rollout.

                                                                                                                                                                                      There shouldn't be much chance of bringing the system to a novel state because configuration management will largely put things into the correct state. (Where that doesn't work is if CM previously created files, it won't delete them unless explicitly told to do so.)

                                                                                                                                                                                        • mewpmewp2

                                                                                                                                                                                          yesterday at 9:53 PM

                                                                                                                                                                                          > service upgrades might get undone, but it's possible.

                                                                                                                                                                                          But who knows what issues might reverting other team's stuff bring?

                                                                                                                                                                                      • newsoftheday

                                                                                                                                                                                        yesterday at 5:58 PM

                                                                                                                                                                                        If companies like Cloudflare haven't figured out how to do reliable rollbacks, there seems little hope for any of us.

                                                                                                                                                                                        • gabrielhidasy

                                                                                                                                                                                          yesterday at 7:35 PM

                                                                                                                                                                                          That will depend on how you structure your deployments, on some large tech companies, while thousands of changes little are made every hour, and deployments are mande in n-day cycles. A cut-off point in time is made where the first 'green' commit after that is picked for the current deployment, and if that fails in an unexpected way you just deploy the last binary back, fix (and test) whatever broke and either try again or just abandon the release if the next cut is already close-by.

                                                                                                                                                                                          • yuliyp

                                                                                                                                                                                            yesterday at 6:05 PM

                                                                                                                                                                                            I'd presume they have the ability to deploy a previous artifact vs only tip-of-master.

                                                                                                                                                                                    • echelon

                                                                                                                                                                                      yesterday at 4:40 PM

                                                                                                                                                                                      You want to build a world where roll back is 95% the right thing to do. So that it almost always works and you don't even have to think about it.

                                                                                                                                                                                      During an incident, the incident lead should be able to say to your team's on call: "can you roll back? If so, roll back" and the oncall engineer should know if it's okay. By default it should be if you're writing code mindfully.

                                                                                                                                                                                      Certain well-understood migrations are the only cases where roll back might not be acceptable.

                                                                                                                                                                                      Always keep your services in "roll back able", "graceful fail", "fail open" state.

                                                                                                                                                                                      This requires tremendous engineering consciousness across the entire org. Every team must be a diligent custodian of this. And even then, it will sometimes break down.

                                                                                                                                                                                      Never make code changes you can't roll back from without reason and without informing the team. Service calls, data write formats, etc.

                                                                                                                                                                                      I've been in the line of billion dollar transaction value services for most of my career. And unfortunately I've been in billion dollar outages.

                                                                                                                                                                                        • drysart

                                                                                                                                                                                          yesterday at 5:41 PM

                                                                                                                                                                                          "Fail open" state would have been improper here, as the system being impacted was a security-critical system: firewall rules.

                                                                                                                                                                                          It is absolutely the wrong approach to "fail open" when you can't run security-critical operations.

                                                                                                                                                                                            • echelon

                                                                                                                                                                                              yesterday at 8:16 PM

                                                                                                                                                                                              Cloudflare is supposed to protect me from occasional ddos, not take my business offline entirely.

                                                                                                                                                                                              This can be architected in such a way that if one rules engine crashes, other systems are not impacted and other rules, cached rules, heuristics, global policies, etc. continue to function and provide shielding.

                                                                                                                                                                                              You can't ask for Cloudflare to turn on a dime and implement this in this manner. Their infra is probably very sensibly architected by great engineers. But there are always holes, especially when moving fast, migrating systems, etc. And there's probably room for more resiliency.

                                                                                                                                                                                  • this_user

                                                                                                                                                                                    yesterday at 4:04 PM

                                                                                                                                                                                    The question is perhaps what the shape and status of their tech stack is. Obviously, they are running at massive scale, and they have grown extremely aggressively over the years. What's more, especially over the last few years, they have been adding new product after new product. How much tech debt have they accumulated with that "move fast" approach that is now starting to rear its head?

                                                                                                                                                                                      • sandeepkd

                                                                                                                                                                                        yesterday at 5:19 PM

                                                                                                                                                                                        I think this is probably a bigger root cause and is going to show up in different ways in future. The mere act of adding new products to an existing architecture/system is bound to create knowledge silos around operations and tech debt. There is a good reason why big companies keep smart people on their payroll to just change couple of lines after a week of debate.

                                                                                                                                                                                    • otterley

                                                                                                                                                                                      yesterday at 4:23 PM

                                                                                                                                                                                      From the post:

                                                                                                                                                                                      “We have spoken directly with hundreds of customers following that incident and shared our plans to make changes to prevent single updates from causing widespread impact like this. We believe these changes would have helped prevent the impact of today’s incident but, unfortunately, we have not finished deploying them yet.

                                                                                                                                                                                      “We know it is disappointing that this work has not been completed yet. It remains our first priority across the organization.”

                                                                                                                                                                                      • yesterday at 4:39 PM

                                                                                                                                                                                        • NicoJuicy

                                                                                                                                                                                          yesterday at 5:33 PM

                                                                                                                                                                                          Where I work, all teams were notified about the React CVE.

                                                                                                                                                                                          Cloudflare made it less of an expedite.

                                                                                                                                                                                          • ignoramous

                                                                                                                                                                                            yesterday at 5:29 PM

                                                                                                                                                                                            > this sounds like the sort of cowboy decision

                                                                                                                                                                                            Ouch. Harsh given that Cloudflare's being over-honest (to disabling the internal tool) and the outage's relatively limited impact (time wise & no. of customers wise). It was just an unfortunate latent bug: Nov 18 was Rust's Unwrap, Dec 5 its Lua's turn with its dynamic typing.

                                                                                                                                                                                            Now, the real cowboy decision I want to see is Cloudflare [0] running a company-wide Rust/Lua code-review with Codex / Claude...

                                                                                                                                                                                            cf TFA:

                                                                                                                                                                                              if rule_result.action == "execute" then
                                                                                                                                                                                                rule_result.execute.results = ruleset_results[tonumber(rule_result.execute.results_index)]
                                                                                                                                                                                              end
                                                                                                                                                                                            
                                                                                                                                                                                              This code expects that, if the ruleset has action="execute", the "rule_result.execute" object will exist ... error in the [Lua] code, which had existed undetected for many years ... prevented by languages with strong type systems. In our replacement [FL2 proxy] ... code written in Rust ... the error did not occur.
                                                                                                                                                                                            
                                                                                                                                                                                            [0] https://news.ycombinator.com/item?id=44159166

                                                                                                                                                                                              • yesterday at 7:10 PM

                                                                                                                                                                                            • deadbabe

                                                                                                                                                                                              yesterday at 3:51 PM

                                                                                                                                                                                              As usual, Cloudflare is the man in the arena.

                                                                                                                                                                                                • samrus

                                                                                                                                                                                                  yesterday at 3:59 PM

                                                                                                                                                                                                  There are other men in the arena who arent tripping on their own feet

                                                                                                                                                                                                    • usrnm

                                                                                                                                                                                                      yesterday at 4:05 PM

                                                                                                                                                                                                      Like who? Which large tech company doesn't have outages?

                                                                                                                                                                                                        • k8sToGo

                                                                                                                                                                                                          yesterday at 4:10 PM

                                                                                                                                                                                                          It's not about outages. It's about the why. Hardware can fail. Bugs can happen. But to continue a roll out despite warning sings and without understanding the cause and impact is on another level. Especially if it is related to the same problem as last time.

                                                                                                                                                                                                            • udev4096

                                                                                                                                                                                                              yesterday at 4:56 PM

                                                                                                                                                                                                              And yet, it's always clownflare breaking everything. Failures are inevitable, which is widely known, therefore we build resilience systems to overcome the inevitable

                                                                                                                                                                                                                • deadbabe

                                                                                                                                                                                                                  yesterday at 5:23 PM

                                                                                                                                                                                                                  It is healthy for tech companies to have outages, as they will build experience in resolving them. Success breeds complacency.

                                                                                                                                                                                                                    • wizzwizz4

                                                                                                                                                                                                                      yesterday at 8:48 PM

                                                                                                                                                                                                                      You don't need outages to build experience in resolving them, if you identify conditions that increase the risk of outages. Airlines can develop a lot of experience resolving issues that would lead to plane crashes, without actually crashing any planes.

                                                                                                                                                                                                          • nish__

                                                                                                                                                                                                            yesterday at 4:58 PM

                                                                                                                                                                                                            Google does pretty good.

                                                                                                                                                                                                              • hansonkd

                                                                                                                                                                                                                yesterday at 6:59 PM

                                                                                                                                                                                                                Google docs was just down a couple weeks ago almost the whole day.

                                                                                                                                                                                                            • k__

                                                                                                                                                                                                              yesterday at 4:11 PM

                                                                                                                                                                                                              "tripping on their own feet" == "not rolling back"

                                                                                                                                                                                                  • rvz

                                                                                                                                                                                                    yesterday at 4:14 PM

                                                                                                                                                                                                    > Not only did they fail to apply the deployment safety 101 lesson of "when in doubt, roll back" but they also failed to assess the risk related to the same deployment system that caused their 11/18 outage.

                                                                                                                                                                                                    Also there seems to be insufficient testing before deployment with very junior level mistakes.

                                                                                                                                                                                                    > As soon as the change propagated to our network, code execution in our FL1 proxy reached a bug in our rules module which led to the following LUA exception:

                                                                                                                                                                                                    Where was the testing for this one? If ANY exception happened during the rules checking, the deployment should fail and rollback. Instead, they didn't assess that as a likely risk and pressed on with the deployment "fix".

                                                                                                                                                                                                    I guess those at Cloudflare are not learning anything from the previous disaster.

                                                                                                                                                                                                    • nine_k

                                                                                                                                                                                                      yesterday at 4:06 PM

                                                                                                                                                                                                      > more to the story

                                                                                                                                                                                                      From a more tinfoil-wearing angle, it may not even be a regular deployment, given the idea of Cloudflare being "the largest MitM attack in history". ("Maybe not even by Cloudflare but by NSA", would say some conspiracy theorists, which is, of course, completely bonkers: NSA is supposed to employ engineers who never let such blunders blow their cover.)

                                                                                                                                                                                                      • NoSalt

                                                                                                                                                                                                        yesterday at 4:34 PM

                                                                                                                                                                                                        Ooh ... I want to be on a cowboy decision making team!!!

                                                                                                                                                                                                    • liampulles

                                                                                                                                                                                                      yesterday at 4:35 PM

                                                                                                                                                                                                      The lesson presented by the last few big outages is that entropy is, in fact, inescapable. The comprehensibility of a system cannot keep up with its growing and aging complexity forever. The rate of unknown unknowns will increase.

                                                                                                                                                                                                      The good news is that a more decentralized internet with human brain scoped components is better for innovation, progress, and freedom anyway.

                                                                                                                                                                                                        • agentifysh

                                                                                                                                                                                                          yesterday at 7:14 PM

                                                                                                                                                                                                          yet my dedicated server has been up since 2015 with zero downtimes

                                                                                                                                                                                                          i dont think this is an entropy issue its human error bubbling up and cloudflare charges a premium for it

                                                                                                                                                                                                          my faith in cloudflare is shoook for sure two major outages weeks apart ad this wont be the last

                                                                                                                                                                                                            • ectospheno

                                                                                                                                                                                                              yesterday at 9:30 PM

                                                                                                                                                                                                              Which 2015 kernel are you running?

                                                                                                                                                                                                              • PKop

                                                                                                                                                                                                                yesterday at 9:25 PM

                                                                                                                                                                                                                Why is the stability of your dedicated server a counterpoint that cloud behemoths can't keep up with their increasing entropy? Seems more like a supporting argument of OP at best, a non sequitur at worst.

                                                                                                                                                                                                                • samdoesnothing

                                                                                                                                                                                                                  yesterday at 8:11 PM

                                                                                                                                                                                                                  With all due respect, your dedicated server is not quite as complex as Cloudflare...

                                                                                                                                                                                                                    • venturecruelty

                                                                                                                                                                                                                      yesterday at 9:15 PM

                                                                                                                                                                                                                      Eppur si muove. A random server serving things is exactly what the internet was supposed to be: a decentralized network of nodes.

                                                                                                                                                                                                              • hnthrowaway0328

                                                                                                                                                                                                                yesterday at 5:16 PM

                                                                                                                                                                                                                I'm not sure how decentralization helps though. People in a bazzar are going to care even less about sharing shadow knowledge. Linux IMO succeeds not because of the bazaar but because of Linus.

                                                                                                                                                                                                                  • venturecruelty

                                                                                                                                                                                                                    yesterday at 9:15 PM

                                                                                                                                                                                                                    Decentralization is resilience; that's why the internet even works at all. That was the entire point of it, in fact.

                                                                                                                                                                                                                    • marcosdumay

                                                                                                                                                                                                                      yesterday at 7:00 PM

                                                                                                                                                                                                                      You don't keep a bazaar running with shadow knowledge. Either the important things are published or it doesn't run.

                                                                                                                                                                                                                      • liampulles

                                                                                                                                                                                                                        yesterday at 6:44 PM

                                                                                                                                                                                                                        What is the shadow knowledge in this case?

                                                                                                                                                                                                                • paradite

                                                                                                                                                                                                                  yesterday at 3:56 PM

                                                                                                                                                                                                                  The deployment pattern from Cloudflare looks insane to me.

                                                                                                                                                                                                                  I've worked at one of the top fintech firms, whenever we do a config change or deployment, we are supposed to have rollback plan ready and monitor key dashboards for 15-30 minutes.

                                                                                                                                                                                                                  The dashboards need to be prepared beforehand on systems and key business metrics that would be affected by the deployment and reviewed by teammates.

                                                                                                                                                                                                                  I've never seen a downtime longer than 1 minute while I was there, because you get a spike on the dashboard immediately when something goes wrong.

                                                                                                                                                                                                                  For the entire system to be down for 10+ minutes due to a bad config change or deployment is just beyond me.

                                                                                                                                                                                                                    • vlovich123

                                                                                                                                                                                                                      yesterday at 5:50 PM

                                                                                                                                                                                                                      That is also true at Cloudflare for what it’s worth. However, the company is so big that there’s so many different products all shipping at the same time it can be hard to correlate it to your release, especially since there’s a 5 min lag (if I recall correctly) in the monitoring dashboards to get all the telemetry from thousands of servers worldwide.

                                                                                                                                                                                                                      Comparing the difficulty of running the world’s internet traffic with hundreds of customer products with your fintech experience is like saying “I can lift 10 pounds. I don’t know why these guys are struggling to lift 500 pounds”.

                                                                                                                                                                                                                        • autoexec

                                                                                                                                                                                                                          yesterday at 6:52 PM

                                                                                                                                                                                                                          > However, the company is so big that there’s so many different products all shipping at the same time it can be hard to correlate it to your release

                                                                                                                                                                                                                          This kind of thing would be more understandable for a company without hundreds of billions of dollars, and for one that hasn't centralized so much of the internet. If a company has grown too large and complex to be well managed and effective and it's starting to look like a liability for large numbers of people there are obvious solutions for that.

                                                                                                                                                                                                                            • evanelias

                                                                                                                                                                                                                              yesterday at 8:04 PM

                                                                                                                                                                                                                              What "hundreds of billions of dollars"? Cloudflare's annual revenue is around $2 billion, and they are not yet profitable.

                                                                                                                                                                                                                                • froober

                                                                                                                                                                                                                                  yesterday at 11:15 PM

                                                                                                                                                                                                                                  Given how well-established cloudflare is, I would've figured they'd be profitable by now. That raises the question: why does so much of the web rely on a company which does not have the means to sustain itself?

                                                                                                                                                                                                                                    • bdangubic

                                                                                                                                                                                                                                      yesterday at 11:17 PM

                                                                                                                                                                                                                                      given how much of the population relies on Uber for their transportation… ;)

                                                                                                                                                                                                                                  • autoexec

                                                                                                                                                                                                                                    yesterday at 10:09 PM

                                                                                                                                                                                                                                    That was admittedly hyperbole, but since we're talking about a company with assets and revenue in the billions I'm not sure it matters. The fact remains that a lack of money/resources is not their problem.

                                                                                                                                                                                                                                      • evanelias

                                                                                                                                                                                                                                        yesterday at 10:32 PM

                                                                                                                                                                                                                                        They don't have unlimited resources. They have ~5000 employees. That's not small but it's not huge either. For sake of comparison, Google hit that headcount level literally 20 years ago.

                                                                                                                                                                                                                                          • autoexec

                                                                                                                                                                                                                                            yesterday at 11:14 PM

                                                                                                                                                                                                                                            They have enough money to buy anything they need. The CEO alone has billions. He could pay for as many employees as he wants out of his own pocket and not notice. In fact he's good at buying people, even senators.

                                                                                                                                                                                                                                              • evanelias

                                                                                                                                                                                                                                                yesterday at 11:58 PM

                                                                                                                                                                                                                                                That doesn't make sense. It would be like saying Twitter, SpaceX, and Tesla all should be incapable of engineering mistakes because their owner is rich. The world doesn't work that way.

                                                                                                                                                                                                                                • pulkitsh1234

                                                                                                                                                                                                                                  yesterday at 7:10 PM

                                                                                                                                                                                                                                  Genuinely curious, how to actually implement detection systems for a large scale global infra which that works with < 1 minute SLO ? Given cost is no constraint.

                                                                                                                                                                                                                                    • autoexec

                                                                                                                                                                                                                                      yesterday at 7:34 PM

                                                                                                                                                                                                                                      Right now I'd say maybe don't push changes to your entire global infra all at once and certainty not without testing your change first to make sure it doesn't break anything, but it's really not about a specific failure/fix as much as it is about a single company getting too big to do the job well or just plain doing more than it should in the first place.

                                                                                                                                                                                                                                      Honestly we shouldn't have created a system where any single company's failure is able to impact such a huge percentage of the network. The internet was designed for resilience and we abandoned that ideal to put our trust in a single company that maybe isn't up for the job. Maybe no one company ever could do it well enough, but I suspect that no single company should carry that responsibility in the first place.

                                                                                                                                                                                                                                        • mewpmewp2

                                                                                                                                                                                                                                          yesterday at 10:50 PM

                                                                                                                                                                                                                                          But then would a customer have to use 10 different vendors to get the same things that Cloudflare currently provides? E.g. protection against various threats online?

                                                                                                                                                                                                                                  • vlovich123

                                                                                                                                                                                                                                    yesterday at 6:58 PM

                                                                                                                                                                                                                                    Can you name a major cloud provider that doesn’t have major outages?

                                                                                                                                                                                                                                    If this were purely a money problem it would have been solved ages ago. It’s a difficult problem to solve. Also, they’re the youngest of the major cloud providers and have a fraction of the resources that Google, Amazon, and Microsoft have.

                                                                                                                                                                                                                                      • autoexec

                                                                                                                                                                                                                                        yesterday at 7:08 PM

                                                                                                                                                                                                                                        > Can you name a major cloud provider that doesn’t have major outages?

                                                                                                                                                                                                                                        That fact that no major cloud provider is actually good is not an argument that cloudflare isn't bad, or even that they couldn't/shouldn't do better than they are. They have fewer resources than Google or Microsoft but they're also in a unique position that makes us differently vulnerable when they fuck up. It's not all their fault, since it was a mistake to centralize the internet to the extent that we have in the first place, but now that they are responsible for so much they have to expect that people will be upset when they fail.

                                                                                                                                                                                                                                • theplatman

                                                                                                                                                                                                                                  yesterday at 9:29 PM

                                                                                                                                                                                                                                  With all due respect, engineers in finance can’t allow for outages like this because then you are losing massive amounts of money and potentially going out of business.

                                                                                                                                                                                                                              • dehrmann

                                                                                                                                                                                                                                yesterday at 5:46 PM

                                                                                                                                                                                                                                Cloudflare is orders of magnitude larger than any fintech. Rollouts likely take much longer, and having a human monitoring a dashboard doesn't scale.

                                                                                                                                                                                                                                  • notepad0x90

                                                                                                                                                                                                                                    yesterday at 6:08 PM

                                                                                                                                                                                                                                    That means they engineered their systems incorrectly then? Precisely because they are much bigger, they should be more resilient. You know who's bigger than Cloudflare? tier-1 ISPs, if they had an outage the whole internet would know about it, and they do have outages except they don't cascade into a global mess like this.

                                                                                                                                                                                                                                    Just speculating based on my experience: It's more likely than not that they likely refused to invest in fail-safe architectures for cost reasons. Control-plane and data-plane should be separate, a react patch shouldn't affect traffic forwarding.

                                                                                                                                                                                                                                    Forget manual rollbacks, there should be automated reversion to a known working state.

                                                                                                                                                                                                                                      • vlovich123

                                                                                                                                                                                                                                        yesterday at 7:04 PM

                                                                                                                                                                                                                                        > Control-plane and data-plane should be separate

                                                                                                                                                                                                                                        They are separate.

                                                                                                                                                                                                                                        > a react patch shouldn't affect traffic forwarding.

                                                                                                                                                                                                                                        If you can’t even bother to read the blog post maybe you shouldn’t be so confident in your own analysis of what should and shouldn’t have happened?

                                                                                                                                                                                                                                        This was a configuration change to change the buffered size of a body from 256kb to 1mib.

                                                                                                                                                                                                                                        The ability to be so wrong in so few words with such confidence is impressive but you may want to take more of a curiosity first approach rather than reaction first.

                                                                                                                                                                                                                                          • notepad0x90

                                                                                                                                                                                                                                            yesterday at 7:58 PM

                                                                                                                                                                                                                                            You really should take some of your pill.

                                                                                                                                                                                                                                            > Instead, it was triggered by changes being made to our body parsing logic while attempting to detect and mitigate an industry-wide vulnerability disclosed this week in React Server Components.

                                                                                                                                                                                                                                            > Unfortunately, in our FL1 version of our proxy, under certain circumstances, the second change of turning off our WAF rule testing tool caused an error state that resulted in 500 HTTP error codes to be served from our network.

                                                                                                                                                                                                                                            The body parsing logic is in react or nextjs, that's my takeaway, is it that incorrect? and the WAF rule testing tool (control plane) was interdependent with the WAF's body parsing logic, is that also incorrect?

                                                                                                                                                                                                                                            > This was a configuration change to change the buffered size of a body from 256kb to 1mib.

                                                                                                                                                                                                                                            Yes, and if it was resilient,the body parsing is done on a discrete forwarding plane. Any config changes should be auto-tested for forwarding failures by the separate control plane and auto-revered when there are errors. If the waf rule testing tool was part of that test then it being down shouldn't have affected data-plane because it would be a separate system.

                                                                                                                                                                                                                                            data/control plane separate means the run time of the two and any dependencies they have are separate. It isn't cheap to do this right, that's why I speculated (I made clear i was speculating) that it was because they wanted to save costs.

                                                                                                                                                                                                                                            > The ability to be so wrong in so few words with such confidence is impressive but you may want to take more of a curiosity first approach rather than reaction first.

                                                                                                                                                                                                                                            Please tone down the rage a bit and leave room for some discussion. You should take your own pill and be curious about what I meant instead of taking a rage-first approach.

                                                                                                                                                                                                                                              • mewpmewp2

                                                                                                                                                                                                                                                yesterday at 10:18 PM

                                                                                                                                                                                                                                                To be clear:

                                                                                                                                                                                                                                                1. There is an active vulnerability unrelated to Cloudflare where React/Next.JS can be abused via a malicious payload. The payload could be up to 1MB.

                                                                                                                                                                                                                                                2. Cloudflare had buffer size that wasn't enough to prevent that payload from being sent to the Customer of the Cloudflare.

                                                                                                                                                                                                                                                3. Cloudflare to protect their customers wanted to increase the buffer size to 1MB.

                                                                                                                                                                                                                                                4. Internal Testing Tool wasn't able to handle change to 1MB and started failing.

                                                                                                                                                                                                                                                5. They wanted to stop Internal Testing Tool from failing, but the Internal Testing Tool required disabling a ruleset which an existing system was depending on (due to a long existing bug). This caused the wider incident.

                                                                                                                                                                                                                                                It does seem to be like a mess in the sense that in order to stop internal testing tool from failing they had to endanger things globally in production, yes. It looks like legacy, tech debt mess.

                                                                                                                                                                                                                                                It seems like bad decisions done in the past though.

                                                                                                                                                                                                                                                • jadamson

                                                                                                                                                                                                                                                  yesterday at 9:44 PM

                                                                                                                                                                                                                                                  > The body parsing logic is in react or nextjs, that's my takeaway, is it that incorrect?

                                                                                                                                                                                                                                                  The exploit they were trying to protect against is in React services run by their customers.

                                                                                                                                                                                                                                      • cowsandmilk

                                                                                                                                                                                                                                        yesterday at 6:57 PM

                                                                                                                                                                                                                                        > Rollouts likely take much longer

                                                                                                                                                                                                                                        Cloudflare’s own post says the configuration change that resulted in the outage rolled out in seconds.

                                                                                                                                                                                                                                    • markus_zhang

                                                                                                                                                                                                                                      yesterday at 4:37 PM

                                                                                                                                                                                                                                      My guess is that CF has so many external customers that they need to move fast and try not to break things. My hunch is that their culture always favors moving fast. As long as they are not breaking too many things, customers won't leave them.

                                                                                                                                                                                                                                        • paradite

                                                                                                                                                                                                                                          yesterday at 4:39 PM

                                                                                                                                                                                                                                          There is nothing wrong with moving fast and deploying fast.

                                                                                                                                                                                                                                          I'm more talking about how slow it was to detect the issue caused by the config change, and perform the rollback of the config change. It took 20 minutes.

                                                                                                                                                                                                                                          • linhns

                                                                                                                                                                                                                                            yesterday at 6:50 PM

                                                                                                                                                                                                                                            I think everyone favors moving fast. We humans want to see results of our action early.

                                                                                                                                                                                                                                        • nova22033

                                                                                                                                                                                                                                          yesterday at 6:40 PM

                                                                                                                                                                                                                                          Speaking of fintech

                                                                                                                                                                                                                                          https://www.henricodolfing.ch/case-study-4-the-440-million-s...

                                                                                                                                                                                                                                          • theideaofcoffee

                                                                                                                                                                                                                                            yesterday at 4:23 PM

                                                                                                                                                                                                                                            Same, my time at a F100 ecommerce retailer showed me the same. Every change control board justification needed an explicit back-out/restoration plan with exact steps to be taken, what was being monitored to ensure that was being held to, contacts of prominent groups anticipated to have an effect, emergency numbers/rooms for quick conferences if in fact something did happen.

                                                                                                                                                                                                                                            The process was pretty tight, almost no revenue-affecting outages from what I can remember because it was such a collaborative effort (even though the board presentation seemed a bit spiky and confrontational at the time, everyone was working together).

                                                                                                                                                                                                                                              • prdonahue

                                                                                                                                                                                                                                                yesterday at 4:35 PM

                                                                                                                                                                                                                                                And you moved at a glacial pace compared to Cloudflare. There are tradeoffs.

                                                                                                                                                                                                                                                  • theideaofcoffee

                                                                                                                                                                                                                                                    yesterday at 5:40 PM

                                                                                                                                                                                                                                                    Yes, of course, I want the organization that inserted itself into handling 20% of the world's internet traffic to move fast and break things. Like breaking the internet on a bi-weekly basis. Yep, great tradeoff there.

                                                                                                                                                                                                                                                    Give me a break.

                                                                                                                                                                                                                                                      • jimmydorry

                                                                                                                                                                                                                                                        yesterday at 6:29 PM

                                                                                                                                                                                                                                                        While you're taking your break, exploits gain traction in the wild and one of the value propositions for using a service provider like CloudFlare is catching and mitigating theses exploits as fast as possible. From the OP, this outage was in relation to handling a nasty RCE.

                                                                                                                                                                                                                                                        • wvenable

                                                                                                                                                                                                                                                          yesterday at 5:45 PM

                                                                                                                                                                                                                                                          But if your job is mitigate attacks/issues then things can very broken while you're being slow to mitigate it.

                                                                                                                                                                                                                                                          • JeremyNT

                                                                                                                                                                                                                                                            yesterday at 6:46 PM

                                                                                                                                                                                                                                                            Lest we forget, they initially rose to prominence by being cheaper than the existing solutions, not better, and I suppose this is a tradeoff a lot of their customers are willing to make.

                                                                                                                                                                                                                                                    • lljk_kennedy

                                                                                                                                                                                                                                                      yesterday at 6:07 PM

                                                                                                                                                                                                                                                      This sounds just as bad as yolo-merges, just on the other end of the spectrum.

                                                                                                                                                                                                                                                  • draw_down

                                                                                                                                                                                                                                                    yesterday at 4:09 PM

                                                                                                                                                                                                                                                    [dead]

                                                                                                                                                                                                                                                • rachr

                                                                                                                                                                                                                                                  yesterday at 4:24 PM

                                                                                                                                                                                                                                                  Time for Cloudflare to start using the BOFH excuse generator. https://bofh.d00t.org/

                                                                                                                                                                                                                                                    • yesterday at 7:03 PM

                                                                                                                                                                                                                                                  • seanparsons

                                                                                                                                                                                                                                                    yesterday at 10:27 PM

                                                                                                                                                                                                                                                    "This type of code error is prevented by languages with strong type systems. In our replacement for this code in our new FL2 proxy, which is written in Rust, the error did not occur." It's starting to sound like a broken record at this point, languages are still seen as equal and as a result, interchangeable.

                                                                                                                                                                                                                                                    • ferat

                                                                                                                                                                                                                                                      yesterday at 8:02 PM

                                                                                                                                                                                                                                                      Today, after the Cloudflare outage, I noticed that almost all upload routes for my applications were being blocked.

                                                                                                                                                                                                                                                      After some investigation, I realized that none of these routes passed through Cloudflare OWASP. The reported anomalies total 50, exceeding the pre-configured maximum of 40 (Medium).

                                                                                                                                                                                                                                                      Despite being simple image or video uploads, the WAF is generating anomalies that make no sense, such as the following:

                                                                                                                                                                                                                                                      Cloudflare OWASP Core Ruleset Score (+5)

                                                                                                                                                                                                                                                      933100: PHP Injection Attack: PHP Open Tag Found

                                                                                                                                                                                                                                                      Cloudflare OWASP Core Ruleset Score (+5)

                                                                                                                                                                                                                                                      933180: PHP Injection Attack: Variable Function Call Found

                                                                                                                                                                                                                                                      For now, I’ve had to raise the OWASP Anomaly Score Threshold to 60 and enable the JS Challenge, but I believe something is wrong with the WAF after today’s outage.

                                                                                                                                                                                                                                                      This issue was still not solved to this moment.

                                                                                                                                                                                                                                                      • jakub_g

                                                                                                                                                                                                                                                        yesterday at 5:00 PM

                                                                                                                                                                                                                                                        The interesting part:

                                                                                                                                                                                                                                                        After rolling out a bad ruleset update, they tried a killswitch (rolled out immediately to 100%) which was a code path never executed before:

                                                                                                                                                                                                                                                        > However, we have never before applied a killswitch to a rule with an action of “execute”. When the killswitch was applied, the code correctly skipped the evaluation of the execute action, and didn’t evaluate the sub-ruleset pointed to by it. However, an error was then encountered while processing the overall results of evaluating the ruleset

                                                                                                                                                                                                                                                        > a straightforward error in the code, which had existed undetected for many years

                                                                                                                                                                                                                                                          • 8cvor6j844qw_d6

                                                                                                                                                                                                                                                            yesterday at 5:03 PM

                                                                                                                                                                                                                                                            > have never before applied a killswitch to a rule with an action of “execute”

                                                                                                                                                                                                                                                            One might think a company on the scale of Cloudflare would have a suite of comprehensive tests to cover various scenarios.

                                                                                                                                                                                                                                                              • yesterday at 11:49 PM

                                                                                                                                                                                                                                                                • hnthrowaway0328

                                                                                                                                                                                                                                                                  yesterday at 5:12 PM

                                                                                                                                                                                                                                                                  I kinda think most companies out there are like that. Moving fast is the motto I heard the most.

                                                                                                                                                                                                                                                                  They are probably OK with occasional breaks as long as customers don't mind.

                                                                                                                                                                                                                                                          • egorfine

                                                                                                                                                                                                                                                            yesterday at 4:36 PM

                                                                                                                                                                                                                                                            > provides customers with protection against malicious payloads, allowing them to be detected and blocked. To do this, Cloudflare’s proxy buffers HTTP request body content in memory for analysis.

                                                                                                                                                                                                                                                            I have a mixed feeling about this.

                                                                                                                                                                                                                                                            On the other hand, I absolutely don't want a CDN to look inside my payloads and decide what's good for me or. Today it's protection, tomorrow it's censorship.

                                                                                                                                                                                                                                                            At the same time this is exactly what CloudFlare is good for - to protect sites from malicious requests.

                                                                                                                                                                                                                                                              • udev4096

                                                                                                                                                                                                                                                                yesterday at 4:59 PM

                                                                                                                                                                                                                                                                We need a decentralized ddos mitigation network based on incentives. Donate X amount of bandwidth, get Y amount of protection from other peers. Yes, we gotta do TLS inspection on every end for effective L7 mitigation but at least filtering can be done without decrypting any packets

                                                                                                                                                                                                                                                                  • mewpmewp2

                                                                                                                                                                                                                                                                    yesterday at 10:55 PM

                                                                                                                                                                                                                                                                    How would that work for latency / reliability of the requests?

                                                                                                                                                                                                                                                            • gkoz

                                                                                                                                                                                                                                                              yesterday at 3:54 PM

                                                                                                                                                                                                                                                              I sometimes feel we'd be better off without all the paternalistic kitchensink features. The solid, properly engineered features used intentionally aren't causing these outages.

                                                                                                                                                                                                                                                                • ilkkao

                                                                                                                                                                                                                                                                  yesterday at 4:03 PM

                                                                                                                                                                                                                                                                  Agreed, I don't really like Cloudflare trying to magically fix every web exploit there is in frameworks my site has never used.

                                                                                                                                                                                                                                                                    • nish__

                                                                                                                                                                                                                                                                      yesterday at 5:05 PM

                                                                                                                                                                                                                                                                      Honestly. This feels outside of their domain.

                                                                                                                                                                                                                                                                  • venturecruelty

                                                                                                                                                                                                                                                                    yesterday at 9:17 PM

                                                                                                                                                                                                                                                                    The good news is that you can have that right now. Just don't use Cloudflare.

                                                                                                                                                                                                                                                                • 8cvor6j844qw_d6

                                                                                                                                                                                                                                                                  yesterday at 4:47 PM

                                                                                                                                                                                                                                                                  Is there some underlying factors that resulted in the recent outages (e.g., new processes, layoffs, etc.) or just a series of pure coincidences?

                                                                                                                                                                                                                                                                    • Elucalidavah

                                                                                                                                                                                                                                                                      yesterday at 5:02 PM

                                                                                                                                                                                                                                                                      Sounds like their "FL1 -> FL2" transition is involved in both.

                                                                                                                                                                                                                                                                        • Someone1234

                                                                                                                                                                                                                                                                          yesterday at 5:23 PM

                                                                                                                                                                                                                                                                          It was involved in the previous one, but not in this latest one. All FL2 did was prevent the outage being even wider spread than it was. None of this had anything to do with migration.

                                                                                                                                                                                                                                                                            • tetha

                                                                                                                                                                                                                                                                              yesterday at 5:47 PM

                                                                                                                                                                                                                                                                              If FL2 didn't have the outage, and FL1 did, the pace of the migration did have an impact.

                                                                                                                                                                                                                                                                              Though this is showing the problem with these things: Migrating faster could have reduced the impact of this outage, while increasing the impact of the last outage. Migrating slower could have reduced the impact of the last outage, while increasing the impact of this outage.

                                                                                                                                                                                                                                                                              This is a hard problem: How fast do you rip old working infrastructure out and risk finding new problems in the new stack, yet, how long do you tolerate shortcomings of the old stack that caused you to build the new stack?

                                                                                                                                                                                                                                                                      • venturecruelty

                                                                                                                                                                                                                                                                        yesterday at 9:16 PM

                                                                                                                                                                                                                                                                        I'm sure everything slowly falling apart all at the same time is due to some strange coincidence, and not the regular and steady firing of thousands of people.

                                                                                                                                                                                                                                                                        • gernigg

                                                                                                                                                                                                                                                                          yesterday at 5:09 PM

                                                                                                                                                                                                                                                                          [flagged]

                                                                                                                                                                                                                                                                      • xnorswap

                                                                                                                                                                                                                                                                        yesterday at 3:47 PM

                                                                                                                                                                                                                                                                        My understanding, paraphrased: "In order to gradually roll out one change, we had to globally push a different configuration change, which broke everything at once".

                                                                                                                                                                                                                                                                        But a more important takeaway:

                                                                                                                                                                                                                                                                        > This type of code error is prevented by languages with strong type systems

                                                                                                                                                                                                                                                                          • jsnell

                                                                                                                                                                                                                                                                            yesterday at 3:53 PM

                                                                                                                                                                                                                                                                            That's a bizarre takeaway for them to suggest, when they had exactly the same kind of bug with Rust like three weeks ago. (In both cases they had code implicitly expecting results to be available. When the results weren't available, they terminated processing of the request with an exception-like mechanism. And then they had the upstream services fail closed, despite the failing requests being to optional sidecars rather than on the critical query path.)

                                                                                                                                                                                                                                                                              • littlestymaar

                                                                                                                                                                                                                                                                                yesterday at 4:04 PM

                                                                                                                                                                                                                                                                                In fairness, the previous bug (with the Rust unwrap) should never have happened: someone explicitly called the panicking function, the review didn't catch it and the CI didn't catch it.

                                                                                                                                                                                                                                                                                It required a significant organizational failure to happen. These happen but they ought to be rarer than your average bug (unless your organization is fundamentally malfunctioning, that is)

                                                                                                                                                                                                                                                                                  • greatgib

                                                                                                                                                                                                                                                                                    yesterday at 4:13 PM

                                                                                                                                                                                                                                                                                    The issue would also not have happened, if someone did the right code, tests, and the review or CI caught it...

                                                                                                                                                                                                                                                                                      • marcosdumay

                                                                                                                                                                                                                                                                                        yesterday at 7:05 PM

                                                                                                                                                                                                                                                                                        It's different to expect somebody to write the correct program every time than to expect somebody not to call the "break_my_system" procedure that was warnings all over it telling people it's there for quick learning-to-use examples or other things you'll never run.

                                                                                                                                                                                                                                                                                • Hamuko

                                                                                                                                                                                                                                                                                  yesterday at 6:47 PM

                                                                                                                                                                                                                                                                                  Yeah, my first thought was that had they used Rust, maybe we would've seen them point out a rule_result.unwrap() as the issue.

                                                                                                                                                                                                                                                                                  • pdimitar

                                                                                                                                                                                                                                                                                    yesterday at 5:42 PM

                                                                                                                                                                                                                                                                                    To be precise, the previous problem with Rust was because somebody copped out and used a temporary escape hatch function that absolutely has no place in production code.

                                                                                                                                                                                                                                                                                    It was mostly an amateur mistake. Not Rust's fault. Rust could never gain adoption if it didn't have a few escape hatches.

                                                                                                                                                                                                                                                                                    "Damned if they do, damned if they don't" kind of situation.

                                                                                                                                                                                                                                                                                    There are even lints for the usage of the `unwrap` and `expect` functions.

                                                                                                                                                                                                                                                                                    As the other sibling comment points out, the previous Cloudflare problem was an acute and extensive organizational failure.

                                                                                                                                                                                                                                                                                      • zozbot234

                                                                                                                                                                                                                                                                                        yesterday at 7:55 PM

                                                                                                                                                                                                                                                                                        You can make an argument that .unwrap() should have no place in production code, but .expect("invariant violated: etc. etc.") very much has its place. When the system is in an unpredicted and not-designed-for state it is supposed to shut down promptly, because this makes it easier to troubleshoot the root cause failure whereas not doing so may have even worse consequences.

                                                                                                                                                                                                                                                                                          • pdimitar

                                                                                                                                                                                                                                                                                            yesterday at 8:31 PM

                                                                                                                                                                                                                                                                                            I don't disagree but you might as well also manually send an error to f.ex. Sentry and just halt processing of the request.

                                                                                                                                                                                                                                                                                            Though that really depends. In companies where k8s is used the app will be brought back up immediately anyway.

                                                                                                                                                                                                                                                                                • debugnik

                                                                                                                                                                                                                                                                                  yesterday at 3:51 PM

                                                                                                                                                                                                                                                                                  Prevented unless they assert the wrong invariant at runtime like they did last time.

                                                                                                                                                                                                                                                                                  • skywhopper

                                                                                                                                                                                                                                                                                    yesterday at 3:56 PM

                                                                                                                                                                                                                                                                                    This is the exact same type of error that happened in their Rust code last time. Strong type systems don’t protect you from lazy programming.

                                                                                                                                                                                                                                                                                      • inejge

                                                                                                                                                                                                                                                                                        yesterday at 7:09 PM

                                                                                                                                                                                                                                                                                        It's not remotely the same type of error -- error non-handling is very visible in the Rust code, while the Lua code shows the happy path, with no indication that it could explode at runtime.

                                                                                                                                                                                                                                                                                        Perhaps it's the similar way of not testing the possible error path, which is an organizational problem.

                                                                                                                                                                                                                                                                                • jokoon

                                                                                                                                                                                                                                                                                  yesterday at 10:02 PM

                                                                                                                                                                                                                                                                                  I still don't understand what is cloudflare's business model, yet they manage to make news.

                                                                                                                                                                                                                                                                                  I don't see how their main product is ddos protection, yet cloudflare goes down for some reason.

                                                                                                                                                                                                                                                                                  This company makes zero sense to me.

                                                                                                                                                                                                                                                                                    • OneDeuxTriSeiGo

                                                                                                                                                                                                                                                                                      yesterday at 10:11 PM

                                                                                                                                                                                                                                                                                      Cloudflare protects against DDOS but also various forms of malicious traffic (bots, low reputation IP users, etc) and often with a DDOS or similar attacks, it's better to have the site go down from time to time than for the attackers to hammer the servers behind cloudflare and waste mass amounts of resources.

                                                                                                                                                                                                                                                                                      i.e. it's the difference between "site goes down for a few hours every few months" and "an attacker slammed your site, and through in on-demand scaling or serverless component cloud fees blew your entire infrastructure budget for the year.

                                                                                                                                                                                                                                                                                      Doubly so when your service is part of a larger platform and attacks on your service risk harming your reputation for the larger platform.

                                                                                                                                                                                                                                                                                  • hrimfaxi

                                                                                                                                                                                                                                                                                    yesterday at 3:57 PM

                                                                                                                                                                                                                                                                                    Having their changes fully propagate within 1 minute is pretty fantastic.

                                                                                                                                                                                                                                                                                      • denysvitali

                                                                                                                                                                                                                                                                                        yesterday at 4:10 PM

                                                                                                                                                                                                                                                                                        This is most likely a strong requisite for such a big scale deployment if DDOS protection and detection - which explains their architectural choices (ClickHouse & co) and the need of a super low latency config changes.

                                                                                                                                                                                                                                                                                        Since attackers might rotate IPs more frequently than once per minute, this effectively means that the whole fleet of servers should be able to quickly react depending on the decisions done centrally.

                                                                                                                                                                                                                                                                                        • reassess_blind

                                                                                                                                                                                                                                                                                          yesterday at 10:16 PM

                                                                                                                                                                                                                                                                                          Why wasn’t the rollback fixed within the second minute after they saw the 500s?

                                                                                                                                                                                                                                                                                          • chatmasta

                                                                                                                                                                                                                                                                                            yesterday at 4:07 PM

                                                                                                                                                                                                                                                                                            The coolest part of Cloudflare’s architecture is that every server is the same… which presumably makes deployment a straightforward task.

                                                                                                                                                                                                                                                                                      • rany_

                                                                                                                                                                                                                                                                                        yesterday at 4:07 PM

                                                                                                                                                                                                                                                                                        > As part of our ongoing work to protect customers using React against a critical vulnerability, CVE-2025-55182, we started rolling out an increase to our buffer size to 1MB, the default limit allowed by Next.js applications.

                                                                                                                                                                                                                                                                                        Why would increasing the buffer size help with that security vulnerability? Is it just a performance optimization?

                                                                                                                                                                                                                                                                                          • redslazer

                                                                                                                                                                                                                                                                                            yesterday at 4:12 PM

                                                                                                                                                                                                                                                                                            If the request data is larger than the limit it doesn’t get processed by the Cloudflare system. By increasing buffer size they process (and therefore protect) more requests.

                                                                                                                                                                                                                                                                                            • boxed

                                                                                                                                                                                                                                                                                              yesterday at 4:11 PM

                                                                                                                                                                                                                                                                                              I think the buffer size is the limit on what they check for malicious data, so the old 128k would mean it would be trivial to circumvent by just having 128k ok data and then put the exploit after.

                                                                                                                                                                                                                                                                                                • whs

                                                                                                                                                                                                                                                                                                  yesterday at 6:19 PM

                                                                                                                                                                                                                                                                                                  I got curious and I checked AWS WAF. Apparently AWS WAF default limit for CloudFront is 16KB and max is 64KB.

                                                                                                                                                                                                                                                                                          • aeyes

                                                                                                                                                                                                                                                                                            yesterday at 6:22 PM

                                                                                                                                                                                                                                                                                            How hard can it be for a company with 1000 engineers to create a canary region before blasting their centralized changes out to everyone.

                                                                                                                                                                                                                                                                                            Every change is a deployment, even if its config. Treat it as such.

                                                                                                                                                                                                                                                                                            Also you should know that a strongly typed language won't save you from every type of problem. And especially not if you allow things like unwrap().

                                                                                                                                                                                                                                                                                            It is just mind boggling that they very obviously have completely untested code which proxies requests for all their customers. If you don't want to write the tests then at least fuzz it.

                                                                                                                                                                                                                                                                                            • rubatuga

                                                                                                                                                                                                                                                                                              yesterday at 10:52 PM

                                                                                                                                                                                                                                                                                              Honestly a lot of these problems are because they don't test a staging environment, like isn't this software engineering basics?

                                                                                                                                                                                                                                                                                              • Bender

                                                                                                                                                                                                                                                                                                yesterday at 5:05 PM

                                                                                                                                                                                                                                                                                                Suggestion for Cloudflare: Create an early adopter option for free accounts.

                                                                                                                                                                                                                                                                                                Benefit: Earliest uptake of new features and security patches.

                                                                                                                                                                                                                                                                                                Drawback: Higher risk of outages.

                                                                                                                                                                                                                                                                                                I think this should be possible since they already differentiate between free, pro and enterprise accounts. I do not know how the routing for that works but I bet they could do this. Think crowd-sourced beta testers. Also a perk for anything PCI audit or FEDRAMP security prioritized over uptime.

                                                                                                                                                                                                                                                                                                  • rfmoz

                                                                                                                                                                                                                                                                                                    yesterday at 10:30 PM

                                                                                                                                                                                                                                                                                                    They do in some way because the LaLiga blocking problems in Spain don’t affect the paid accounts=large websites.

                                                                                                                                                                                                                                                                                                    An other suggestion is to do it along night shift in every country, right now they only take into account EEUU night.

                                                                                                                                                                                                                                                                                                    • ectospheno

                                                                                                                                                                                                                                                                                                      yesterday at 9:33 PM

                                                                                                                                                                                                                                                                                                      If that meant free tier had WAF then sure, I’d enable that.

                                                                                                                                                                                                                                                                                                      • LelouBil

                                                                                                                                                                                                                                                                                                        yesterday at 5:55 PM

                                                                                                                                                                                                                                                                                                        I would for sure enable this, my personal server can handle being unreachable for a few hours in exchange for (potentially) interesting features.

                                                                                                                                                                                                                                                                                                    • roguecoder

                                                                                                                                                                                                                                                                                                      yesterday at 7:13 PM

                                                                                                                                                                                                                                                                                                      I notice that this is the kind of thing that solid sociable tests ought to have caught. I am very curious how testable that code is (random procedural if-statements don't inspire high confidence.)

                                                                                                                                                                                                                                                                                                      • markus_zhang

                                                                                                                                                                                                                                                                                                        yesterday at 5:39 PM

                                                                                                                                                                                                                                                                                                        I wonder anyone from internal could share the culture a bit. I'm mostly interested in the following part:

                                                                                                                                                                                                                                                                                                        If someone messes up royally, is there someone who says "if you break the build/whatever super critical, then your ass is the grass and I'm the lawn mower"?

                                                                                                                                                                                                                                                                                                        • yesterday at 3:46 PM

                                                                                                                                                                                                                                                                                                          • _pdp_

                                                                                                                                                                                                                                                                                                            yesterday at 4:07 PM

                                                                                                                                                                                                                                                                                                            So no static compiler checks and apparently no fuzzers used to ensure these rules work as intended?

                                                                                                                                                                                                                                                                                                              • perching_aix

                                                                                                                                                                                                                                                                                                                yesterday at 5:04 PM

                                                                                                                                                                                                                                                                                                                Such tooling exists for Lua? Didn't know.

                                                                                                                                                                                                                                                                                                            • away0x01ct

                                                                                                                                                                                                                                                                                                              yesterday at 9:03 PM

                                                                                                                                                                                                                                                                                                              1.1.1.1 domain test server, whether a relay or endpoints including /cdn-cgi/trace is WAF testing error, for 500 HTTP network & Cloudflare managed R-W-X permissions

                                                                                                                                                                                                                                                                                                              • dznodes

                                                                                                                                                                                                                                                                                                                yesterday at 6:55 PM

                                                                                                                                                                                                                                                                                                                When should we just give up on Cloudflare? Seems like this just keeps happening. Like some kind of backdoor triggered willy nilly, Hmmm?

                                                                                                                                                                                                                                                                                                                  • venturecruelty

                                                                                                                                                                                                                                                                                                                    yesterday at 9:18 PM

                                                                                                                                                                                                                                                                                                                    Now. Right now. Seriously, stop using this terrible service. We also need to change the narrative that step 1 in every tutorial is "sign up for Cloudflare". This is partly a culture problem.

                                                                                                                                                                                                                                                                                                                • yesterday at 6:47 PM

                                                                                                                                                                                                                                                                                                                  • stego-tech

                                                                                                                                                                                                                                                                                                                    yesterday at 8:01 PM

                                                                                                                                                                                                                                                                                                                    The problem that irks me isn’t that Cloudflare is having outages (everyone does and will at some point, no matter how many 9’s your SLA states), it’s that the internet is so damn centralized that a Cloudflare issue can take out a continent-sized chunk of the internet. Kudos to them on their success story, but oh my god that’s way too many eggs in one basket in general.

                                                                                                                                                                                                                                                                                                                    • arjie

                                                                                                                                                                                                                                                                                                                      yesterday at 10:36 PM

                                                                                                                                                                                                                                                                                                                      Classic. Things always get worse before they get better. I remember when Netflix was going through their annus horribilis, and AWS before that, and Twitter before that, and so on. Everyone goes through this. Good luck to you guys getting to FL2 quickly enough that this class of error reduces.

                                                                                                                                                                                                                                                                                                                      • nish__

                                                                                                                                                                                                                                                                                                                        yesterday at 5:01 PM

                                                                                                                                                                                                                                                                                                                        Is it crazy to anyone else that they deploy every 5 minutes? And that it's not just config updates, but actual code changes with this "execute" action.

                                                                                                                                                                                                                                                                                                                          • kccqzy

                                                                                                                                                                                                                                                                                                                            yesterday at 6:50 PM

                                                                                                                                                                                                                                                                                                                            Config updates are not so clear cut from code changes.

                                                                                                                                                                                                                                                                                                                            Once I worked with a team in the anti-abuse space where the policy is that code deployments must happen over 5 days and config updates can take a few minutes. Then an engineer on the team argued that deploying new Python code doesn’t count as a code change because the CPython interpreter did not change; it didn’t even restart. And indeed given how dynamic Python is, it is totally possible to import new Python modules that did not exist when the interpreter process is launched.

                                                                                                                                                                                                                                                                                                                        • yesterday at 4:17 PM

                                                                                                                                                                                                                                                                                                                          • bradly

                                                                                                                                                                                                                                                                                                                            yesterday at 7:09 PM

                                                                                                                                                                                                                                                                                                                            Dang… I don’t even use React and it still brings down my sites. Good beats I guess.

                                                                                                                                                                                                                                                                                                                            • iLoveOncall

                                                                                                                                                                                                                                                                                                                              yesterday at 4:29 PM

                                                                                                                                                                                                                                                                                                                              The most surprising from this article is that CloudFlare handles only around 85M TPS.

                                                                                                                                                                                                                                                                                                                                • blibble

                                                                                                                                                                                                                                                                                                                                  yesterday at 4:58 PM

                                                                                                                                                                                                                                                                                                                                  it can't really be that small, can it?

                                                                                                                                                                                                                                                                                                                                  that's maybe half a rack of load

                                                                                                                                                                                                                                                                                                                                    • nish__

                                                                                                                                                                                                                                                                                                                                      yesterday at 5:09 PM

                                                                                                                                                                                                                                                                                                                                      Given the number of lua scripts they seem to be running, it has to take more than half a rack.

                                                                                                                                                                                                                                                                                                                              • antiloper

                                                                                                                                                                                                                                                                                                                                yesterday at 3:52 PM

                                                                                                                                                                                                                                                                                                                                Make faster websites:

                                                                                                                                                                                                                                                                                                                                > we started rolling out an increase to our buffer size to 1MB, the default limit allowed by Next.js applications.

                                                                                                                                                                                                                                                                                                                                Why is the Next.js limit 1 MB? It's not enough for uploading user generated content (photographs, scanned invoices), but a 1 MB request body for even multiple JSON API calls is ridiculous. There frameworks need to at least provide some pushback to unoptimized development, even if it's just a lower default request body limit. Otherwise all web applications will become as slow as the MS office suite or reddit.

                                                                                                                                                                                                                                                                                                                                  • ramon156

                                                                                                                                                                                                                                                                                                                                    yesterday at 4:08 PM

                                                                                                                                                                                                                                                                                                                                    The update was to update it to 3MB (paid 10MB)

                                                                                                                                                                                                                                                                                                                                    • AmazingTurtle

                                                                                                                                                                                                                                                                                                                                      yesterday at 4:10 PM

                                                                                                                                                                                                                                                                                                                                      a) They serialize tons of data into requests b) Headers. Mostly cookies. They are a thing. They are being abused all over the world by newbies.

                                                                                                                                                                                                                                                                                                                                  • dwa3592

                                                                                                                                                                                                                                                                                                                                    yesterday at 9:50 PM

                                                                                                                                                                                                                                                                                                                                    I am not sure if it's just me or there have been too many outages this year to count. Is it the AI slop making into production?

                                                                                                                                                                                                                                                                                                                                    • MagicMoonlight

                                                                                                                                                                                                                                                                                                                                      yesterday at 6:01 PM

                                                                                                                                                                                                                                                                                                                                      If you had a 99.99% availability requirement they will have already cost you a fortune

                                                                                                                                                                                                                                                                                                                                      • mmmlinux

                                                                                                                                                                                                                                                                                                                                        yesterday at 5:15 PM

                                                                                                                                                                                                                                                                                                                                        Messing around on a Friday? Brave.

                                                                                                                                                                                                                                                                                                                                          • chickensong

                                                                                                                                                                                                                                                                                                                                            yesterday at 10:00 PM

                                                                                                                                                                                                                                                                                                                                            You don't really want security updates waiting around on a luxury schedule.

                                                                                                                                                                                                                                                                                                                                            • roguecoder

                                                                                                                                                                                                                                                                                                                                              yesterday at 7:29 PM

                                                                                                                                                                                                                                                                                                                                              Or overworked.

                                                                                                                                                                                                                                                                                                                                              We can deploy on Fridays. We don't, because we aren't donating our time to the shareholders.

                                                                                                                                                                                                                                                                                                                                              • orphea

                                                                                                                                                                                                                                                                                                                                                yesterday at 6:04 PM

                                                                                                                                                                                                                                                                                                                                                If you're afraid of deploying on Friday, you're doing it wrong.

                                                                                                                                                                                                                                                                                                                                            • yesterday at 3:54 PM

                                                                                                                                                                                                                                                                                                                                              • j45

                                                                                                                                                                                                                                                                                                                                                yesterday at 10:15 PM

                                                                                                                                                                                                                                                                                                                                                Curious if there isn't a way to ingest the incoming traffic at scale, but route it to a secondary infrastructure to make sure it's resolving correctly, before pushing it to production?

                                                                                                                                                                                                                                                                                                                                                • denysvitali

                                                                                                                                                                                                                                                                                                                                                  yesterday at 3:53 PM

                                                                                                                                                                                                                                                                                                                                                  Ironically, this time around the issue was in the proxy they're going to phase out (and replace with the Rust one).

                                                                                                                                                                                                                                                                                                                                                  I truly believe they're really going to make resilience their #1 priority now, and acknowledging the release process errors that they didn't acknowledge for a while (according to other HN comments) is the first step towards this.

                                                                                                                                                                                                                                                                                                                                                  HugOps. Although bad for reputation, I think these incidents will help them shape (and prioritize!) resilience efforts more than ever.

                                                                                                                                                                                                                                                                                                                                                  At the same time, I can't think of a company more transparent than CloudFlare when it comes to these kind of things. I also understand the urgency behind this change: CloudFlare acted (too) fast to mitigate the React vulnerability and this is the result.

                                                                                                                                                                                                                                                                                                                                                  Say what you want, but I'd prefer to trust CloudFlare who admits and act upon their fuckups, rather than trying to cover them up or downplaying them like some other major cloud providers.

                                                                                                                                                                                                                                                                                                                                                  @eastdakota: ignore the negative comments here, transparency is a very good strategy and this article shows a good plan to avoid further problems

                                                                                                                                                                                                                                                                                                                                                    • yesterday at 6:08 PM

                                                                                                                                                                                                                                                                                                                                                      • yesterday at 6:22 PM

                                                                                                                                                                                                                                                                                                                                                        • iLoveOncall

                                                                                                                                                                                                                                                                                                                                                          yesterday at 4:49 PM

                                                                                                                                                                                                                                                                                                                                                          > I truly believe they're really going to make resilience their #1 priority now

                                                                                                                                                                                                                                                                                                                                                          I hope that was their #1 priority from the very start given the services they sell...

                                                                                                                                                                                                                                                                                                                                                          Anyway, people always tend to overthink about those black-swan events. Yes, 2 happened in a quick succession, but what is the average frequency overall? Insignificant.

                                                                                                                                                                                                                                                                                                                                                            • roguecoder

                                                                                                                                                                                                                                                                                                                                                              yesterday at 7:20 PM

                                                                                                                                                                                                                                                                                                                                                              This is Cloudflare. They've repeatedly broken DNS for years.

                                                                                                                                                                                                                                                                                                                                                              Looking across the errors, it points to some underlying practices: a lack of systems metaphors, modularity, testability, and an reliance on super-generic configuration instead of software with enforced semantics.

                                                                                                                                                                                                                                                                                                                                                              • denysvitali

                                                                                                                                                                                                                                                                                                                                                                yesterday at 5:09 PM

                                                                                                                                                                                                                                                                                                                                                                I think they have to strike a balance between being extremely fast (reacting to vulnerabilities and DDOS attacks) while still being resilient. I don't think it's an easy situation

                                                                                                                                                                                                                                                                                                                                                                • yesterday at 5:22 PM

                                                                                                                                                                                                                                                                                                                                                              • trashburger

                                                                                                                                                                                                                                                                                                                                                                yesterday at 4:01 PM

                                                                                                                                                                                                                                                                                                                                                                I would very much like for him not to ignore the negativity, given that, you know, they are breaking the entire fucking Internet every time something like this happens.

                                                                                                                                                                                                                                                                                                                                                                  • denysvitali

                                                                                                                                                                                                                                                                                                                                                                    yesterday at 4:05 PM

                                                                                                                                                                                                                                                                                                                                                                    This is the kind of comment I wish he would ignore.

                                                                                                                                                                                                                                                                                                                                                                    You can be angry - but that doesn't help anyone. They fucked up, yes, they admitted it and they provided plans on how to address that.

                                                                                                                                                                                                                                                                                                                                                                    I don't think they do these things on purpose. Of course given their good market penetration they end up disrupting a lot of customers - and they should focus on slow rollouts - but I also believe that in a DDOS protection system (or WAF) you don't want or have the luxury to wait for days until your rule is applied.

                                                                                                                                                                                                                                                                                                                                                                      • beanjuiceII

                                                                                                                                                                                                                                                                                                                                                                        yesterday at 4:26 PM

                                                                                                                                                                                                                                                                                                                                                                        I hope he doesn't ignore it, the internet has been forgiving enough toward cloudflares string of failures..its getting pretty old, and creates a ton of choas. I work with life saving devices, being impacted in any way in data monitoring has a huge impact in many ways. "sorry ma'am we can't give your child t1d readings on your follow app because our provider decided to break everything in the pursuit of some react bug." has a great ring to it

                                                                                                                                                                                                                                                                                                                                                                          • Anon1096

                                                                                                                                                                                                                                                                                                                                                                            yesterday at 5:46 PM

                                                                                                                                                                                                                                                                                                                                                                            Cloudflare and other cloud infra providers are only providing primitives to use, in this case WAF. They have target uptimes and it's never 100%. It's up to the people actually making end user services (like your medical devices) to judge whether that is enough and if not to design your service around it.

                                                                                                                                                                                                                                                                                                                                                                            (and also, rolling your own version of WAF is probably not the right answer if you need better uptime. It's exceedingly unlikely a medical devices company will beat CF at this game.)

                                                                                                                                                                                                                                                                                                                                                                            • esseph

                                                                                                                                                                                                                                                                                                                                                                              yesterday at 5:00 PM

                                                                                                                                                                                                                                                                                                                                                                              Half your medical devices are probably opening up data leakage to China.

                                                                                                                                                                                                                                                                                                                                                                              https://www.csoonline.com/article/3814810/backdoor-in-chines...

                                                                                                                                                                                                                                                                                                                                                                              Most hospital and healthcare IT teams are extremely under funded, undertrained, overworked, and the software, configurations and platforms are normally not the most resilient things.

                                                                                                                                                                                                                                                                                                                                                                              I have a friend at one in the North East right now going through a hell of a security breach for multiple months now and I'm flabbergasted no one is dead yet.

                                                                                                                                                                                                                                                                                                                                                                              When it comes to tech, I get the impression most organizations are not very "healthy" in the durability of systems.

                                                                                                                                                                                                                                                                                                                                                                          • nish__

                                                                                                                                                                                                                                                                                                                                                                            yesterday at 5:05 PM

                                                                                                                                                                                                                                                                                                                                                                            Maybe not on purpose but there's such a thing as negligence.

                                                                                                                                                                                                                                                                                                                                                                    • fidotron

                                                                                                                                                                                                                                                                                                                                                                      yesterday at 4:06 PM

                                                                                                                                                                                                                                                                                                                                                                      > HugOps

                                                                                                                                                                                                                                                                                                                                                                      This childish nonsense needs to end.

                                                                                                                                                                                                                                                                                                                                                                      Ops are heavily rewarded because they're supposed to be responsible. If they're not then the associated rewards for it need to stop as well.

                                                                                                                                                                                                                                                                                                                                                                        • denysvitali

                                                                                                                                                                                                                                                                                                                                                                          yesterday at 4:12 PM

                                                                                                                                                                                                                                                                                                                                                                          I have never seen an Ops team being rewarded for avoiding incidents (focusing in tech debt reduction), but instead they get the opposite - blamed when things go wrong.

                                                                                                                                                                                                                                                                                                                                                                          I think it's human nature (it's hard to realize something is going well until it breaks), but still has a very negative psychological effect. I can barely imagine the stress the team is going through right now.

                                                                                                                                                                                                                                                                                                                                                                            • fidotron

                                                                                                                                                                                                                                                                                                                                                                              yesterday at 4:16 PM

                                                                                                                                                                                                                                                                                                                                                                              > I have never seen an Ops team being rewarded for avoiding incidents

                                                                                                                                                                                                                                                                                                                                                                              That's why their salaries are so high.

                                                                                                                                                                                                                                                                                                                                                                                • denysvitali

                                                                                                                                                                                                                                                                                                                                                                                  yesterday at 4:23 PM

                                                                                                                                                                                                                                                                                                                                                                                  Depending on the tech debt, the ops team might just be in "survival mode" and not have the time to fix every single issue.

                                                                                                                                                                                                                                                                                                                                                                                  In this particular case, they seem to be doing two things: - Phasing out the old proxy (Lua based) which is replaced by FL2 (Rust based, the one that caused the previous incident) - Reacting to an actively exploited vulnerability in React by deploying WAF rules - and they're doing them in a relatively careful way (test rules) to avoid fuckups, which caused this unknown state, which triggered the issue

                                                                                                                                                                                                                                                                                                                                                                                    • fidotron

                                                                                                                                                                                                                                                                                                                                                                                      yesterday at 4:27 PM

                                                                                                                                                                                                                                                                                                                                                                                      They deliberately ignored an internal tool that started erroring out at the given deployment and rolled it out anyway without further investigation.

                                                                                                                                                                                                                                                                                                                                                                                      That's not deserving of sympathy.

                                                                                                                                                                                                                                                                                                                                                                                  • esseph

                                                                                                                                                                                                                                                                                                                                                                                    yesterday at 4:54 PM

                                                                                                                                                                                                                                                                                                                                                                                    Ops salaries are high??? Where?!?!

                                                                                                                                                                                                                                                                                                                                                                                      • hnthrowaway0328

                                                                                                                                                                                                                                                                                                                                                                                        yesterday at 5:19 PM

                                                                                                                                                                                                                                                                                                                                                                                        Definitely commands better salaries than us pitty DEs.

                                                                                                                                                                                                                                                                                                                                                                                    • agoodusername63

                                                                                                                                                                                                                                                                                                                                                                                      yesterday at 6:22 PM

                                                                                                                                                                                                                                                                                                                                                                                      news to me.

                                                                                                                                                                                                                                                                                                                                                                              • esseph

                                                                                                                                                                                                                                                                                                                                                                                yesterday at 4:53 PM

                                                                                                                                                                                                                                                                                                                                                                                Ops has never been "rewarded" at any org I've ever been at or heard about, including physical infra companies.

                                                                                                                                                                                                                                                                                                                                                                            • da_grift_shift

                                                                                                                                                                                                                                                                                                                                                                              yesterday at 4:07 PM

                                                                                                                                                                                                                                                                                                                                                                              [ Removed by Reddit ]

                                                                                                                                                                                                                                                                                                                                                                                • denysvitali

                                                                                                                                                                                                                                                                                                                                                                                  yesterday at 4:13 PM

                                                                                                                                                                                                                                                                                                                                                                                  Wow. The three comments below parent really show how toxic HN has become.

                                                                                                                                                                                                                                                                                                                                                                                    • beanjuiceII

                                                                                                                                                                                                                                                                                                                                                                                      yesterday at 4:28 PM

                                                                                                                                                                                                                                                                                                                                                                                      being angry about something doesn't make it toxic, people have a right to be upset

                                                                                                                                                                                                                                                                                                                                                                                        • denysvitali

                                                                                                                                                                                                                                                                                                                                                                                          yesterday at 4:35 PM

                                                                                                                                                                                                                                                                                                                                                                                          The comment, before the edit, was what I would consider toxic. No wonder it has been edited.

                                                                                                                                                                                                                                                                                                                                                                                          It's fine to be upset, and especially rightfully so after the second outage in less than 30 days, but this doesn't justify toxicity.

                                                                                                                                                                                                                                                                                                                                                                          • snafeau

                                                                                                                                                                                                                                                                                                                                                                            yesterday at 3:53 PM

                                                                                                                                                                                                                                                                                                                                                                            A lot of these kind of bugs feel like they could be caught be a simple review bot like Greptile... I wonder if Cloudlare uses an equivalent tool internally?

                                                                                                                                                                                                                                                                                                                                                                              • nkmnz

                                                                                                                                                                                                                                                                                                                                                                                yesterday at 4:15 PM

                                                                                                                                                                                                                                                                                                                                                                                What makes greptile a better choice compared to claude code or codex, in your opinion?

                                                                                                                                                                                                                                                                                                                                                                                • roguecoder

                                                                                                                                                                                                                                                                                                                                                                                  yesterday at 7:30 PM

                                                                                                                                                                                                                                                                                                                                                                                  That has not been my experience with those tools.

                                                                                                                                                                                                                                                                                                                                                                                  Super-procedural code in particular is too complex for humans to follow, much less AI.

                                                                                                                                                                                                                                                                                                                                                                                  • nish__

                                                                                                                                                                                                                                                                                                                                                                                    yesterday at 5:07 PM

                                                                                                                                                                                                                                                                                                                                                                                    Any bot that runs an AI model should not be called "simple".

                                                                                                                                                                                                                                                                                                                                                                                • dreamcompiler

                                                                                                                                                                                                                                                                                                                                                                                  yesterday at 4:03 PM

                                                                                                                                                                                                                                                                                                                                                                                  "Honey we can't go on that vacation after all. In fact we can't ever take a vacation period."

                                                                                                                                                                                                                                                                                                                                                                                  "Why?"

                                                                                                                                                                                                                                                                                                                                                                                  "I've just been transferred to the Cloudflare outage explanation department."

                                                                                                                                                                                                                                                                                                                                                                                  • AtNightWeCode

                                                                                                                                                                                                                                                                                                                                                                                    yesterday at 9:30 PM

                                                                                                                                                                                                                                                                                                                                                                                    Not missing working with LUA in proxies. I think this is no big thing. They rolled back the change fairly quickly. Still bad but that outage mid November was worse since it was many bad decisions stacking up and it took too long time to resolve.

                                                                                                                                                                                                                                                                                                                                                                                    • kachapopopow

                                                                                                                                                                                                                                                                                                                                                                                      yesterday at 3:45 PM

                                                                                                                                                                                                                                                                                                                                                                                      why does this seem oddly familiar (fail-closed logic)

                                                                                                                                                                                                                                                                                                                                                                                      • nish__

                                                                                                                                                                                                                                                                                                                                                                                        yesterday at 4:57 PM

                                                                                                                                                                                                                                                                                                                                                                                        No love lost, no love found.

                                                                                                                                                                                                                                                                                                                                                                                        • yesterday at 5:13 PM

                                                                                                                                                                                                                                                                                                                                                                                          • lapcat

                                                                                                                                                                                                                                                                                                                                                                                            yesterday at 3:55 PM

                                                                                                                                                                                                                                                                                                                                                                                            > This is a straightforward error in the code, which had existed undetected for many years. This type of code error is prevented by languages with strong type systems. In our replacement for this code in our new FL2 proxy, which is written in Rust, the error did not occur.

                                                                                                                                                                                                                                                                                                                                                                                            Cloudflare deployed code that was literally never tested, not even once, neither manually nor by unit test, otherwise the straightforward error would have been detected immediately, and their implied solution seems to be not testing their code when written, or even adding 100% code coverage after the fact, but rather relying on a programming language to bail them out and cover up their failure to test.

                                                                                                                                                                                                                                                                                                                                                                                              • JohnMakin

                                                                                                                                                                                                                                                                                                                                                                                                yesterday at 5:09 PM

                                                                                                                                                                                                                                                                                                                                                                                                Large scale infrastructure changes are often by nature completely untestable. The system is too large, there are too many moving parts to replicate with any kind of sane testing, so often, you do find out in prod, which is why robust and fast rollback procedures are usually desirable and implemented.

                                                                                                                                                                                                                                                                                                                                                                                                  • roguecoder

                                                                                                                                                                                                                                                                                                                                                                                                    yesterday at 7:23 PM

                                                                                                                                                                                                                                                                                                                                                                                                    Akamai manages it.

                                                                                                                                                                                                                                                                                                                                                                                                      • winddude

                                                                                                                                                                                                                                                                                                                                                                                                        yesterday at 9:34 PM

                                                                                                                                                                                                                                                                                                                                                                                                        They don't, akamai has had several outages as well jsut no one notices. Akamai is way way smaller than cloudflare, 20% of internet traffic passes through CF networks, not sure it's even measurable on Akamai.

                                                                                                                                                                                                                                                                                                                                                                                                          • andrewf

                                                                                                                                                                                                                                                                                                                                                                                                            yesterday at 11:22 PM

                                                                                                                                                                                                                                                                                                                                                                                                            Quickly Googling about, a commonly repeated figure is that Akamai served 15% - 30% of Internet traffic in the late 2010's. They probably have less of the market today due to others growing, but they're not a minnow.

                                                                                                                                                                                                                                                                                                                                                                                                            2024 revenue figures were $1.669 billion for Cloudflare, and $3.99 billion for Akamai, per Wikipedia.

                                                                                                                                                                                                                                                                                                                                                                                                    • lapcat

                                                                                                                                                                                                                                                                                                                                                                                                      yesterday at 5:24 PM

                                                                                                                                                                                                                                                                                                                                                                                                      > Large scale infrastructure changes are often by nature completely untestable.

                                                                                                                                                                                                                                                                                                                                                                                                      You're changing the subject here and shifting focus from the specific to the vague. The two postmortems after the recent major Cloudflare outages both listed straightforward errors in source code that could have been tested and detected.

                                                                                                                                                                                                                                                                                                                                                                                                      Theoretical outages could theoretically have other causes, but these two specific outages had specific causes that we know.

                                                                                                                                                                                                                                                                                                                                                                                                      > which is why robust and fast rollback procedures are usually desirable and implemented.

                                                                                                                                                                                                                                                                                                                                                                                                      Yes, nobody is arguing against that. It's a red herring with regard to my point about source code testing.

                                                                                                                                                                                                                                                                                                                                                                                                        • JohnMakin

                                                                                                                                                                                                                                                                                                                                                                                                          yesterday at 6:30 PM

                                                                                                                                                                                                                                                                                                                                                                                                          I am not changing any subject. These are glue logic scripts connecting massive pieces of infra together, spanning what is likely several teams and orgs over the course of many years. It is impossible to blurt something out like "well, source code testing" for something like this, when the source code inputs are not possibly testable outside the scale of the larger system. They're often completely unknowable as well.

                                                                                                                                                                                                                                                                                                                                                                                                          With all due respect, it sounds like you have not worked on these types of systems, but out of curiosity - what type of test do you think would have prevented this?

                                                                                                                                                                                                                                                                                                                                                                                                            • lapcat

                                                                                                                                                                                                                                                                                                                                                                                                              yesterday at 6:45 PM

                                                                                                                                                                                                                                                                                                                                                                                                              With all due respect, it sounds like you have never heard of unit tests.

                                                                                                                                                                                                                                                                                                                                                                                                              Cloudflare states that the compiler would prevent the bug in certain programming languages. So it seems ridiculous to suggest that the bug can't be detected outside the scale of a larger system.

                                                                                                                                                                                                                                                                                                                                                                                              • borplk

                                                                                                                                                                                                                                                                                                                                                                                                yesterday at 7:17 PM

                                                                                                                                                                                                                                                                                                                                                                                                Every time they screw up they write an elaborate postmortem and pat themselves on the back. Don't get me wrong, better have the postmortem than not. But at this point it seems like the only thing they are good at is writing incident postmortem blog posts.

                                                                                                                                                                                                                                                                                                                                                                                                • yesterday at 4:40 PM

                                                                                                                                                                                                                                                                                                                                                                                                  • rudedogg

                                                                                                                                                                                                                                                                                                                                                                                                    yesterday at 6:04 PM

                                                                                                                                                                                                                                                                                                                                                                                                    I’m really sick of constantly seeing cloudflare, and their bullshit captchas. Please, look at how much grief they’re causing trying to be the gateway to the internet. Don’t give them this power

                                                                                                                                                                                                                                                                                                                                                                                                    • system2

                                                                                                                                                                                                                                                                                                                                                                                                      yesterday at 6:00 PM

                                                                                                                                                                                                                                                                                                                                                                                                      Is that me, or did CloudFlare outages increase since LLM "engineers" were hired remotely? Do you think there is a correlation?

                                                                                                                                                                                                                                                                                                                                                                                                        • roguecoder

                                                                                                                                                                                                                                                                                                                                                                                                          yesterday at 7:24 PM

                                                                                                                                                                                                                                                                                                                                                                                                          They've always been flakey. At least these only impacted their own customers instead of taking down the internet.

                                                                                                                                                                                                                                                                                                                                                                                                      • guluarte

                                                                                                                                                                                                                                                                                                                                                                                                        yesterday at 5:45 PM

                                                                                                                                                                                                                                                                                                                                                                                                        is it me or critical software bugs are more and more common?

                                                                                                                                                                                                                                                                                                                                                                                                        • jgalt212

                                                                                                                                                                                                                                                                                                                                                                                                          yesterday at 4:38 PM

                                                                                                                                                                                                                                                                                                                                                                                                          I do kind of like who they are blaming React for this.

                                                                                                                                                                                                                                                                                                                                                                                                          • blibble

                                                                                                                                                                                                                                                                                                                                                                                                            yesterday at 5:00 PM

                                                                                                                                                                                                                                                                                                                                                                                                            amateur level stuff again

                                                                                                                                                                                                                                                                                                                                                                                                            • yesterday at 5:22 PM

                                                                                                                                                                                                                                                                                                                                                                                                              • rvz

                                                                                                                                                                                                                                                                                                                                                                                                                yesterday at 4:04 PM

                                                                                                                                                                                                                                                                                                                                                                                                                > Instead, it was triggered by changes being made to our body parsing logic while attempting to detect and mitigate an industry-wide vulnerability disclosed this week in React Server Components.

                                                                                                                                                                                                                                                                                                                                                                                                                Doesn't Cloudflare rigorously test their changes before deployment to make sure that this does not happen again? This better not have been used to cover for the fact that they are using AI to fix issues like this one.

                                                                                                                                                                                                                                                                                                                                                                                                                Better not be any presence of vibe coders or AI agents being used to be touching such critical pieces of infrastructure at all and I expected Cloudflare to learn from the previous outage very quickly.

                                                                                                                                                                                                                                                                                                                                                                                                                But this is quite a pattern but might need to consider putting the unreliability next to GitHub (which goes down every week).

                                                                                                                                                                                                                                                                                                                                                                                                                • fidotron

                                                                                                                                                                                                                                                                                                                                                                                                                  yesterday at 3:50 PM

                                                                                                                                                                                                                                                                                                                                                                                                                  > This change was being rolled out using our gradual deployment system, and, as part of this rollout, we identified an increase in errors in one of our internal tools which we use to test and improve new WAF rules. As this was an internal tool, and the fix being rolled out was a security improvement, we decided to disable the tool for the time being as it was not required to serve or protect customer traffic.

                                                                                                                                                                                                                                                                                                                                                                                                                  Come on.

                                                                                                                                                                                                                                                                                                                                                                                                                  This PM raises more questions than it answers, such as why exactly China would have been immune.

                                                                                                                                                                                                                                                                                                                                                                                                                    • skywhopper

                                                                                                                                                                                                                                                                                                                                                                                                                      yesterday at 3:58 PM

                                                                                                                                                                                                                                                                                                                                                                                                                      China is probably a completely separate partition of their network.

                                                                                                                                                                                                                                                                                                                                                                                                                        • fidotron

                                                                                                                                                                                                                                                                                                                                                                                                                          yesterday at 3:59 PM

                                                                                                                                                                                                                                                                                                                                                                                                                          One that doesn't get proactive security rollouts, it would seem.

                                                                                                                                                                                                                                                                                                                                                                                                                            • roguecoder

                                                                                                                                                                                                                                                                                                                                                                                                                              yesterday at 7:22 PM

                                                                                                                                                                                                                                                                                                                                                                                                                              The deploys are very unlikely to be managed from the same system.

                                                                                                                                                                                                                                                                                                                                                                                                                              • skywhopper

                                                                                                                                                                                                                                                                                                                                                                                                                                yesterday at 4:09 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                I assume it was next on the checklist, or assigned to a different ops team.

                                                                                                                                                                                                                                                                                                                                                                                                                    • yesterday at 3:49 PM

                                                                                                                                                                                                                                                                                                                                                                                                                      • theoldgreybeard

                                                                                                                                                                                                                                                                                                                                                                                                                        yesterday at 5:23 PM

                                                                                                                                                                                                                                                                                                                                                                                                                        This is total amateur shit. Completely unacceptable for something as critical as Cloudflare.

                                                                                                                                                                                                                                                                                                                                                                                                                        • Uptrenda

                                                                                                                                                                                                                                                                                                                                                                                                                          yesterday at 6:30 PM

                                                                                                                                                                                                                                                                                                                                                                                                                          Can't believe one shitty website can take down most of the mainstream web.

                                                                                                                                                                                                                                                                                                                                                                                                                          • da_grift_shift

                                                                                                                                                                                                                                                                                                                                                                                                                            yesterday at 3:54 PM

                                                                                                                                                                                                                                                                                                                                                                                                                            It's not an outage, it's an Availability Incident™.

                                                                                                                                                                                                                                                                                                                                                                                                                            https://blog.cloudflare.com/5-december-2025-outage/#what-abo...

                                                                                                                                                                                                                                                                                                                                                                                                                              • aw1621107

                                                                                                                                                                                                                                                                                                                                                                                                                                yesterday at 9:24 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                From earlier in the very same blog post (emphasis added):

                                                                                                                                                                                                                                                                                                                                                                                                                                > This system does not perform gradual rollouts, but rather propagates changes within seconds to the entire fleet of servers in our network and is under review following the outage we experienced on November 18.

                                                                                                                                                                                                                                                                                                                                                                                                                                • perching_aix

                                                                                                                                                                                                                                                                                                                                                                                                                                  yesterday at 5:07 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                  You jest, but recently I also felt compelled to stop using the word (planned) outage where I work, because it legitimately creates confusion around the (expected) character of impact.

                                                                                                                                                                                                                                                                                                                                                                                                                                  Outage is the nuclear wasteland situation, which given modern architectural choices, is rather challenging to manifest. To avoid it is face-saving, but also more correct.

                                                                                                                                                                                                                                                                                                                                                                                                                              • jchip303

                                                                                                                                                                                                                                                                                                                                                                                                                                yesterday at 7:28 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                [dead]

                                                                                                                                                                                                                                                                                                                                                                                                                                • alwaysroot

                                                                                                                                                                                                                                                                                                                                                                                                                                  yesterday at 4:02 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                  [flagged]

                                                                                                                                                                                                                                                                                                                                                                                                                                  • kosolam

                                                                                                                                                                                                                                                                                                                                                                                                                                    yesterday at 5:12 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                    Some nonsense again. The level of negligence there is astounding. This is frightening because this entity is daily exposed to a large portion of our personal data which goes over the wire. As well as business data. It’s just a matter of time before a disaster is going to occur. Some regulatory body must take control in their hands right now.

                                                                                                                                                                                                                                                                                                                                                                                                                                    • websiteapi

                                                                                                                                                                                                                                                                                                                                                                                                                                      yesterday at 3:53 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                      i wonder why they cannot partially rollout. like the other outage they have to do a global rollout.

                                                                                                                                                                                                                                                                                                                                                                                                                                        • usrnm

                                                                                                                                                                                                                                                                                                                                                                                                                                          yesterday at 3:56 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                          I really don't see how it would've helped. In go or Rust you'd just get a panic, which is in no way different.

                                                                                                                                                                                                                                                                                                                                                                                                                                          • denysvitali

                                                                                                                                                                                                                                                                                                                                                                                                                                            yesterday at 3:58 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                            The article mentions that this Lua-based proxy is the old generation one, which is going to be replaced by the Rust based one (FL2) and that didn't fail on this scenario.

                                                                                                                                                                                                                                                                                                                                                                                                                                            So, if anything, their efforts towards a typed language were justified. They just didn't manage to migrate everything in time before this incident - which is ironically a good thing since this incident was cause mostly by a rushed change in response to an actively exploited vulnerability.

                                                                                                                                                                                                                                                                                                                                                                                                                                              • websiteapi

                                                                                                                                                                                                                                                                                                                                                                                                                                                yesterday at 4:09 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                yes, but as the article states why are they doing global fast rollouts?

                                                                                                                                                                                                                                                                                                                                                                                                                                                  • denysvitali

                                                                                                                                                                                                                                                                                                                                                                                                                                                    yesterday at 4:19 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                    I think (would love to be corrected) that this is the nature of their service. They probably push multiple config changes per minute to mitigate DDOS attacks. For sure the proxies have a local list of IPs that, for a period of time, are blacklisted.

                                                                                                                                                                                                                                                                                                                                                                                                                                                    For DDOS protection you can't really rely on multiple-hours rollouts.

                                                                                                                                                                                                                                                                                                                                                                                                                                            • yesterday at 4:24 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                          • barbazoo

                                                                                                                                                                                                                                                                                                                                                                                                                                            yesterday at 3:41 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                            > Customers that did not have the configuration above applied were not impacted. Customer traffic served by our China network was also not impacted.

                                                                                                                                                                                                                                                                                                                                                                                                                                            Interesting.

                                                                                                                                                                                                                                                                                                                                                                                                                                              • flaminHotSpeedo

                                                                                                                                                                                                                                                                                                                                                                                                                                                yesterday at 3:53 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                They kinda buried the lede there, 28% failure rate for 100% of customers isn't the same as 100% failure rate for 28% of customers

                                                                                                                                                                                                                                                                                                                                                                                                                                            • jpeter

                                                                                                                                                                                                                                                                                                                                                                                                                                              yesterday at 3:39 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                              Unwrap() strikes again

                                                                                                                                                                                                                                                                                                                                                                                                                                                • dap

                                                                                                                                                                                                                                                                                                                                                                                                                                                  yesterday at 3:50 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                  I guess you’re being facetious but for those who didn’t click through:

                                                                                                                                                                                                                                                                                                                                                                                                                                                  > This type of code error is prevented by languages with strong type systems. In our replacement for this code in our new FL2 proxy, which is written in Rust, the error did not occur.

                                                                                                                                                                                                                                                                                                                                                                                                                                                    • skywhopper

                                                                                                                                                                                                                                                                                                                                                                                                                                                      yesterday at 3:59 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                      That bit may be true, but the underlying error of a null reference that caused a panic was exactly the same in both incidents.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        • roguecoder

                                                                                                                                                                                                                                                                                                                                                                                                                                                          yesterday at 7:15 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                          Yep: it is wild for them to claim that a strongly-typed language would have saved them when it didn't.

                                                                                                                                                                                                                                                                                                                                                                                                                                                          Relying on language features instead of writing code well will always eventually backfire.

                                                                                                                                                                                                                                                                                                                                                                                                                                                            • dap

                                                                                                                                                                                                                                                                                                                                                                                                                                                              yesterday at 7:29 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                              You're right that you have to "write code well" to prevent this sort of thing. It's also true that Rust's language features, if you use them, can make this sort of mistake a compile-time error rather than something that only blows up at runtime under the wrong conditions. The problem with their last outage was that somebody explicitly opted out of the tool provided by the language. As you say, that's "not writing code well". But I think you're dismissing the value of the language feature in helping you write code well.

                                                                                                                                                                                                                                                                                                                                                                                                                                                  • throwawaymaths

                                                                                                                                                                                                                                                                                                                                                                                                                                                    yesterday at 3:41 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                    this time in lua. cloudflare can't catch a break

                                                                                                                                                                                                                                                                                                                                                                                                                                                      • RoyTyrell

                                                                                                                                                                                                                                                                                                                                                                                                                                                        yesterday at 3:45 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Or they're not thoroughly testing changes before pushing them out. As I've seen some others say, CloudFlare at this point should be considered critical infrastructure. Maybe not like power but dang close.

                                                                                                                                                                                                                                                                                                                                                                                                                                                          • esseph

                                                                                                                                                                                                                                                                                                                                                                                                                                                            yesterday at 5:04 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                            My power goes out every Wednesday around noon and normally if the weather is bad. In a major US metro.

                                                                                                                                                                                                                                                                                                                                                                                                                                                            I hope cloudflare is far more resilient than local power.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        • gcau

                                                                                                                                                                                                                                                                                                                                                                                                                                                          yesterday at 3:45 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                          The 'rewrite it in lua' crowd are oddly silent now.

                                                                                                                                                                                                                                                                                                                                                                                                                                                            • barbazoo

                                                                                                                                                                                                                                                                                                                                                                                                                                                              yesterday at 3:55 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                              How do you know?

                                                                                                                                                                                                                                                                                                                                                                                                                                                              • infrcg

                                                                                                                                                                                                                                                                                                                                                                                                                                                                yesterday at 4:03 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                [flagged]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • jcmfernandes

                                                                                                                                                                                                                                                                                                                                                                                                                                                                    yesterday at 4:06 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Did you really go through the trouble of creating an account just to spit trash? Damn!

                                                                                                                                                                                                                                                                                                                                                                                                                                                            • lexoj

                                                                                                                                                                                                                                                                                                                                                                                                                                                              yesterday at 8:00 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                              Anyone knows why lua? Or is it perhaps as a redis script in lua?

                                                                                                                                                                                                                                                                                                                                                                                                                                                              • rvz

                                                                                                                                                                                                                                                                                                                                                                                                                                                                yesterday at 4:09 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                Time to use boring languages such as Java and Go.

                                                                                                                                                                                                                                                                                                                                                                                                                                                            • yesterday at 3:44 PM