\

Cloudflare crawl endpoint

317 points - yesterday at 10:27 PM

Source
  • RamblingCTO

    today at 8:43 AM

    Doesn't work for pages protected by cloudflare in my experience. What a shame, they could've produced the problem and sold the solution.

      • chvid

        today at 9:03 AM

        As long at it gets past Azure's bot protection ...

    • jasongill

      yesterday at 11:08 PM

      I'm surprised that Cloudflare hasn't started hosting a pre-scraped version of websites that use Cloudflare's proxy - something like https://www.example.com/cdn-cgi/cached-contents.json They already have the website content in their cache, so why not just cut out the middle man of scraping services and API's like this and publish it?

      Obviously there's good reasons NOT to, but I am surprised they haven't started offering it (as an "on-by-default" option, naturally) yet.

        • cortesoft

          today at 4:35 AM

          Well, the conversion process into the JSON representation is going to take CPU, and then you have to store the result, in essence doubling your cache footprint.

          Doing it on demand still utilizes their cached version, so it saves a trip to the origin, but doesn’t require doubling the cache size. They can still cache the results if the same site is scraped multiple times, but this saves having to cache things that are never going to be requested.

          Cache footprint management is a huge factor in the cost and performance for a CDN, you want to get the most out of your storage and you want to serve as many pages from cache as possible.

          I know in my experience working for a CDN, we were doing all sorts of things to try to maximize the hit rate for our cache.. in fact, one of the easiest and most effective techniques for increasing cache hit rate is to do the OPPOSITE of what you are suggesting; instead of pre-caching content, you do ā€˜second hit caching’, where you only store a copy in the cache if a piece of content is requested a second time. The idea is that a lot of content is requested only once by one user, and then never again, so it is a waste to store it in the cache. If you wait until it is requested a second time before you cache it, you avoid those single use pages going into your cache, and don’t hurt overall performance that much, because the content that is most useful to cache is requested a lot, and you only have to make one extra origin request.

          • selcuka

            today at 12:24 AM

            Not the same thing, but they have something close (it's not on-by-default, yet) [1]:

            > Cloudflare's network now supports real-time content conversion at the source, for enabled zones using content negotiation headers. Now when AI systems request pages from any website that uses Cloudflare and has Markdown for Agents enabled, they can express the preference for text/markdown in the request. Our network will automatically and efficiently convert the HTML to markdown, when possible, on the fly.

            [1] https://blog.cloudflare.com/markdown-for-agents/

            • michaelmior

              yesterday at 11:50 PM

              > I'm surprised that Cloudflare hasn't started hosting a pre-scraped version of websites that use Cloudflare's proxy

              It's entirely possible that they're doing this under the hood for cases where they can clearly identify the content they have cached is public.

                • janalsncm

                  today at 12:27 AM

                  How would they know the content hasn’t changed without hitting the website?

                    • coreq

                      today at 2:36 AM

                      They wouldn't, well there's Etag and alike but it still a round trip on level 7 to the origin. However the pattern generally is to say when the content is good to in the Response headers, and cache on that duration, for an example a bitcoin pricing aggregator might say good for 60 seconds (with disclaimers on page that this isn't market data), whilst My Little Town news might say that an article is good for an hour (to allow Updates) and the homepage is good for 5 minutes to allow breaking news article to not appear too far behind.

                      • OptionOfT

                        today at 3:16 AM

                        Caching headers?

                        (Which, on Akamai, are by default ignored!)

                        • cortesoft

                          today at 4:36 AM

                          Keeping track of when content changes is literally the primary function of a CDN.

                      • binarymax

                        today at 12:11 AM

                        Based on the post, it seems likely that they'd just delay per the robots.txt policy no matter what, and do a full browser render of the cached page to get the content. Probably overkill for lots and lots of sites. An HTML fetch + readability is really cheap.

                    • hrmtst93837

                      today at 8:54 AM

                      Offering wholesale cache dumps blows up every assumption about origin privacy and copyright. Suddenly you are one toggle away from someone else automatically harvesting and reselling your work with Cloudflare as the unwitting middle tier.

                      You could try to gate this behind access controls but at that point you have reinvented a clunky bespoke CDN API that no site owner asked for, plus a fresh legal mess. Static file caches work because they only ever respond to the original request, not because they claim to own or index your content.

                      It is a short path from "helpful pre-scraped JSON" to handing an entire site to an AI scraper-for-hire with zero friction. The incentives do not line up unless you think every domain on Cloudflare wants their content wholesale exported by default.

                    • cmsparks

                      yesterday at 11:33 PM

                      That would prolly work for simple sites, but you still need the dedicated scraping service with a browser to render sites that are more complex (i.e. SPAs)

                      • csomar

                        today at 12:01 AM

                        It’s a bit more complicated than that. This is their product Browser Rendering, which runs a real browser that loads the page and executes JavaScript. It’s a bit more involved than a simple curl scraping.

                          • randomtools

                            today at 8:46 AM

                            So does that mean it can replace serpapi or similar?

                    • ljm

                      yesterday at 11:35 PM

                      Is cloudflare becoming a mob outfit? Because they are selling scraping countermeasures but are now selling scraping too.

                      And they can pull it off because of their reach over the internet with the free DNS.

                        • rendaw

                          today at 6:17 AM

                          I think the simple explanation is that they weren't selling scraping countermeasures, they were selling web-based denial of service protection (which may be caused by scrapers).

                            • PeterStuer

                              today at 7:47 AM

                              Ask yourself, why would a scraper ddos? Why would a ddos-protection vendor ddos?

                                • c0balt

                                  today at 10:40 AM

                                  The number of git forges behind Anubis et al and the numerous public announcements should be enough.

                                  Scrappers seem to be exceedingly careless in using public resources. The problem is often not even DDOS (as in overwhelming bandwidth usage) but rather DOS through excessive hits on expensive routes.

                                  • wongarsu

                                    today at 10:18 AM

                                    Because the scraper is either impatient, careless or indifferent; and if they scrape for training data they don't plan to come back. If they don't plan to come back they don't care if you tighten up crawling protections after they have moved on. In fact they are probably happy that they got their data and their competition won't

                            • andrepd

                              today at 10:36 AM

                              Well this scraper honours robots.txt so I'm sure most AI crawlers will find it useless.

                              • iso-logi

                                yesterday at 11:57 PM

                                Their free DNS is only a small piece of the pie.

                                The fact that 30%+ of the web relies on their caching services, routablility services and DDoS protection services is the main pull.

                                Their DNS is only really for data collection and to front as "good will"

                                  • jen729w

                                    today at 4:09 AM

                                    > The fact that 30%+ of the web relies on their caching services

                                    30% of the web might use their caching services. 'Relies on' implies that it wouldn't work without them, which I doubt is the case.

                                    It might be the case for the biggest 1% of that 30%. But not the whole lot.

                                      • reddalo

                                        today at 8:21 AM

                                        >'Relies on' implies that it wouldn't work without them

                                        Last time Cloudflare went down, their dashboard was also unavailable, so you couldn't turn off their proxy service anyway.

                                • shadowfiend

                                  today at 12:20 AM

                                  No: https://developers.cloudflare.com/browser-rendering/rest-api...

                                    • oefrha

                                      today at 3:28 AM

                                      That's not the perfect defense you think it is. Plenty of robots.txts[1] technically allow scraping their main content pages as long as your user-agent isn't explicitly disallowed, but in practice they're behind Cloudflare so they still throw up Cloudflare bot check if you actually attempt to crawl.

                                      And forget about crawling. If you have a less reputable IP (basically every IP in third world countries are less reputable, for instance), you can be CAPTCHA'ed to no end by Cloudflare even as a human user, on the default setting, so plenty of site owners with more reputable home/office IPs don't even know what they subject a subset of their users to.

                                      [1] E.g. https://www.wired.com/robots.txt to pick an example high up on HN front page.

                                  • its-kostya

                                    today at 12:00 AM

                                    Cloudflare has been trying to mediate publishers & AI companies. If publishers are behind Cloudflare and Cloudflare's bot detection stops scrapers at the request of publishers, the publishers can allow their data to be scraped (via this end point) for a price. It creates market scarcity. I don't believe the target audience is you and me. Unless you own a very popular blog that AI companies would pay you for.

                                      • PeterStuer

                                        today at 7:34 AM

                                        Next step will be their default "free" anti-bot denying all but their own bot. They know full well nearly nobody changes the default.

                                    • theamk

                                      yesterday at 11:56 PM

                                      no? it takes 10 seconds to check:

                                      > The /crawl endpoint respects the directives of robots.txt files, including crawl-delay. All URLs that /crawl is directed not to crawl are listed in the response with "status": "disallowed".

                                      You don't need any scraping countermeasures for crawlers like those.

                                        • Macha

                                          today at 1:10 AM

                                          So what’s the user agent for their bot? They don’t seem to specify the default in the docs and it looks like it’s user configurable. So yet another opt out bot which you need your web server to match on special behaviour to block

                                            • flanksteak20

                                              today at 5:58 AM

                                              Isn't this covered here? https://developers.cloudflare.com/browser-rendering/referenc...

                                                • Macha

                                                  today at 9:25 AM

                                                  No, hence all their examples using User-Agent: *

                                              • gruez

                                                today at 1:14 AM

                                                >So yet another opt out bot which you need your web server to match on special behaviour to block

                                                Given that malicious bots are allegedly spoofing real user agents, "another user agent you have to add to your list" seems like the least of your problems.

                                                  • AdamN

                                                    today at 9:45 AM

                                                    Not 'allegedly' - it's just a fact. Even if you're not malicious however it's still sometimes necessary because the server may have different sites for different browsers and check user agents for the experience they deliver. So then even for legitimate purposes you need to at least use the prefix of the user agent that the server expects.

                                            • PeterStuer

                                              today at 7:50 AM

                                              Like they explain in the docs, their crawler will respect the robots.txt dissalowed user-agents, right after the section hat explains how to change your user-agent.

                                          • isodev

                                            today at 4:26 AM

                                            They always have been.

                                            They also use their dominant position to apply political pressure when they don’t like how a country chooses to run things.

                                            So yeah, we’ve created another mega corp monster that will hurt for years to come.

                                            • subscribed

                                              today at 12:58 AM

                                              I think there's some space being absolutely snuffed by the countless bots of everyone, ignoring everything, pulling from residential proxies, and this, supposedly slower, well behavior, smarter bot.

                                              Like there's a difference between dozens of drunk teenagers thrashing the city streets in the illegal street race vs a taxi driver.

                                              • pocksuppet

                                                today at 3:31 AM

                                                Was it ever not one? They protect a lot of DDoS-for-hire sites from DDoS by their competitors. In return they increase the quantity of DDoS on the internet. They offer you a service for $150, then months later suddenly demand $150k in 24 hours or they shut down your business. If you use them as a DNS registrar they will hold your domain hostage.

                                              • rrr_oh_man

                                                yesterday at 11:45 PM

                                                [flagged]

                                                  • stri8ted

                                                    yesterday at 11:58 PM

                                                    Do you have any evidence to support this view?

                                                      • pocksuppet

                                                        today at 3:32 AM

                                                        Who else would MITM 30% of the internet?

                                                        • rolymath

                                                          today at 12:55 AM

                                                          Read who and how it was founded. It's not a secret at all.

                                                      • mtmail

                                                        yesterday at 11:58 PM

                                                        Any kind of source for the claim?

                                                    • Retr0id

                                                      yesterday at 11:48 PM

                                                      For a long time cloudflare has proudly protected DDoS-as-a-service sites (but of course, they claim they don't "host" them)

                                                        • Dylan16807

                                                          today at 3:31 AM

                                                          Are you using the word "claim" to call them wrong or for a more confusing reason?

                                                          Because I'm pretty sure they are not in fact wrong.

                                                            • Retr0id

                                                              today at 3:46 AM

                                                              The distinction between a caching proxy and an origin server is pretty meaningless when you're serving static content, if you ask me.

                                                                • Dylan16807

                                                                  today at 4:36 AM

                                                                  There's a blurry line there, true.

                                                                  On the other hand when a page is small and static enough that it's basically just a flyer, I also care a lot less about who hosts it.

                                                      • giancarlostoro

                                                        today at 12:27 AM

                                                        If they ever sell or the CEO shifts, yes. For the meantime, they have not given any strong indication that they're trying to bully anybody. I could see things changing drastically if the people in charge are swapped out.

                                                    • allixsenos

                                                      today at 10:06 AM

                                                      "Selling the wall and the ladder."

                                                      "Biggest betrayal in tech."

                                                      "Protection racket."

                                                      These hot takes sound smart but they're not.

                                                      The web was built to be open and available to everyone. Serving static HTML from disk back in the day, nobody could hurt you because there was nothing to hurt.

                                                      We need bot protection now because everything is dynamic, straight from the database with some light caching for hot content. When Facebook decides to recrawl your one million pages in the same instant, you're very much up shit creek without a paddle. A bot that crawls the full site doesn't steal anything, but it does take down the origin server. My clients never call me upset that a bot read their blog posts. They call because the bot knocked the site offline for paying customers.

                                                      Bot protection protects availability, not secrecy.

                                                      And the real bot problem isn't even crawling. It's automated signups. Fake accounts messaging your users. Bots buying out limited drops before a human can load the page. Like-farming. Credential stuffing. That's what bot protection is actually for: preventing fraud, not preventing someone from reading your public website.

                                                      Cloudflare's `/crawl` respects robots.txt. Don't want your content crawled, opt out. But if you want it indexed and can't handle the traffic spike, this gets your content out without hammering production.

                                                      As for the folks saying Cloudflare should keep blocking all crawlers forever: AI agents already drive real browsers. They click, scroll, render JavaScript. Go look at what browser automation frameworks can do today and then explain to me how you tell a bot from a person. That distinction is already gone. The hot takes are about a version of the internet that doesn't exist anymore.

                                                      • Lasang

                                                        today at 1:44 AM

                                                        The idea of exposing a structured crawl endpoint feels like a natural evolution of robots.txt and sitemaps.

                                                        If more sites provided explicit machine-readable entry points for crawlers, indexing could become a lot less wasteful. Right now crawlers spend a lot of effort rediscovering the same structure over and over.

                                                        It also raises interesting questions about whether sites will eventually provide different views for humans vs. automated agents in a more formalized way.

                                                          • _heimdall

                                                            today at 1:53 AM

                                                            I expect that if we still used REST indexing would be even less wasteful.

                                                            I've found myself falling pretty hard on the side of making APIs work for humans and expecting LLM providers to optimize around that. I don't need an MCP for a CLI tool, for example, I just need a good man page or `--help` documentation.

                                                            • berkes

                                                              today at 7:51 AM

                                                              I know in practice it no longer is the case, if it ever was.

                                                              But semantic HTML is exactly that explicit machine-readable entrypoint. I am firmly entrenched in the opinion that HTML, and the DOM is only for machines to read, it just happens to be also somewhat understandable to some humans. Take an average webpage, have a look at all characters(bytes) in there: often two third won't ever be shown to humans.

                                                              Point being: we don't need to invent something new. We just need to realize we already have it and use it correctly. Other than this requiring better understanding of web tech, it has no downsides. The low hanging fruit being the frameworks out there that should really do a better job of leveraging semantics in their output.

                                                              • PeterStuer

                                                                today at 7:38 AM

                                                                The only ones benefitting from 'wastefull' crawling are the anti-bot solution vendors. Everyone else is incentivized to crawl as efficiently as possible.

                                                                Makes you think, right?

                                                                • pocksuppet

                                                                  today at 3:32 AM

                                                                  Apart from the obvious problem: presenting something different to crawlers and humans.

                                                                  • catlifeonmars

                                                                    today at 1:48 AM

                                                                    > It also raises interesting questions about whether sites will eventually provide different views for humans vs. automated agents in a more formalized way.

                                                                    This question raises an interesting question about if this would exacerbate supply chain injection attacks. Show the innocuous page to the human, another to the bot.

                                                                    • rglover

                                                                      today at 2:14 AM

                                                                      I just do a query param to toggle to markdown/text if ?llm=true on a route. Easy pattern that's opt-in.

                                                                      • pdntspa

                                                                        today at 2:43 AM

                                                                        They already do...

                                                                        A lot of known crawlers will get a crawler-optimized version of the page

                                                                          • rafram

                                                                            today at 3:04 AM

                                                                            Do they? AFAIK Google forbids that, and they’ll occasionally test that you aren’t doing it.

                                                                              • pdntspa

                                                                                today at 3:05 AM

                                                                                I haven't checked in a while but I know for a fact that Amazon does or did it

                                                                                • 6510

                                                                                  today at 3:35 AM

                                                                                  With google covering only 3% I wonder how much people still care and if they should. Funny: I own and know sites that are by far the best resource on the topic but shouldn't have so many links google says. It's like I ask you for a page about cuban chains then you say you don't have it because they had to many links. Or your greengrocer suddenly doesn't have apples because his supplier now offers more than 5 different kinds so he will never buy there again.

                                                                      • ramblurr

                                                                        today at 7:41 AM

                                                                        It seems like there's a missed use case: web archiving. I don't see any mention of WARC as an output format. This could be useful to journalists and academically if they had it.

                                                                        • everfrustrated

                                                                          today at 12:01 AM

                                                                          Will this crawler be run behind or infront of their bot blocker logic?

                                                                        • arjie

                                                                          today at 1:03 AM

                                                                          Oh man, I was hoping I could offer a nicely-crawled version of my site. It would be cool if they offered that for site admins. Then everyone who wanted to crawl would just get a thing they could get for pure transfer cost. I suppose I could build one by submitting a crawl job against myself and then offering a `static.` subdomain on each thing that people could access. Then it's pure HTML instant-load.

                                                                            • echoangle

                                                                              today at 1:17 AM

                                                                              I don’t really get the usecase. Is your site static? Then you should just render it to html files and host the static files. And if it’s not static, how would a snapshot of the pages help if they change later? And also why not just add some caching to the site then?

                                                                                • arjie

                                                                                  today at 3:46 AM

                                                                                  Ah the use-case is archive.org but fast. But it's okay. Before I die I will make the static copy of my site myself.

                                                                          • devnotes77

                                                                            today at 12:14 AM

                                                                            Worth noting: origin owners can still detect and block CF Browser Rendering requests if needed.

                                                                            Workers-originated requests include a CF-Worker header identifying the workers subdomain, which distinguishes them from regular CDN proxying. You can match on this in a WAF rule or origin middleware.

                                                                            The trickier issue: rendered requests originate from Cloudflare ASN 13335 with a low bot score, so if you rely on CF bot scores for content protection, requests through their own crawl product will bypass that check. The practical defense is application-layer rate limiting and behavioral analysis rather than network-level scores -- which is better practice regardless.

                                                                            The structural conflict is real but similar to search engines offering webmaster tools while running the index. The incentives are misaligned, but the individual products have independent utility. The harder question is whether the combination makes it meaningfully harder to build effective bot protection on top of their platform.

                                                                              • efilife

                                                                                today at 8:29 AM

                                                                                LLM generated comment

                                                                                • azinman2

                                                                                  today at 4:41 AM

                                                                                  They say they obey robots.txt - isn’t that the easier way?

                                                                              • pupppet

                                                                                yesterday at 11:43 PM

                                                                                Cloudflare getting all the cool toys. AWS, anyone awake over there?

                                                                                • jppope

                                                                                  yesterday at 11:58 PM

                                                                                  This is actually really amazing. Cloudflare is just skating to where the puck is going to be on this one.

                                                                                  • patchnull

                                                                                    today at 12:19 AM

                                                                                    The main win here is abstracting away browser context lifecycle management. Anyone who has run Puppeteer on Workers knows the pain of handling cold starts, context reuse, and timeout cascading across navigation steps. Having crawl() bundle render-then-extract into one call covers maybe 80% of scraping use cases. The remaining 20% where you need request interception or pre-render script injection still needs the full Browser Rendering API, but for pulling structured data from public pages this is a big simplification over managing session state yourself.

                                                                                    • binarymax

                                                                                      today at 12:08 AM

                                                                                      Really hard to understand costs here. What is a reasonable pages per second? Should I assume with politeness that I'm basically at 1 page per second == 3600 pages/hour? Seems painfully slow.

                                                                                      • skybrian

                                                                                        today at 12:55 AM

                                                                                        If two customers crawl the same website and it uses crawl-delay, how does it handle that? Are they independent, or does each one run half as fast?

                                                                                          • PeterStuer

                                                                                            today at 7:56 AM

                                                                                            You put a governor on the domain, and you return from the cache instead.

                                                                                        • ed_mercer

                                                                                          today at 1:37 AM

                                                                                          > Honors robots.txt directives, including crawl-delay

                                                                                          Sounds pretty useless for any serious AI company

                                                                                            • PeterStuer

                                                                                              today at 8:02 AM

                                                                                              What % of sites have a content update volume that exceeds what you can get respecting crawl delay?

                                                                                              If your delay is 1s and you publish less than 60 updates a minute on average I can still get 100%. Most crawls are not that latency sensitive, certainly not the ai ones.

                                                                                              HFT bots, now that is an entirely different ballgame.

                                                                                                • mrweasel

                                                                                                  today at 10:00 AM

                                                                                                  > Most crawls are not that latency sensitive, certainly not the ai ones.

                                                                                                  They certainly behave like they are. We constantly see crawlers trying to do cache busting, for pages that hasn't change in days, if not weeks. It's hard to tell where the bots are coming from theses days, as most have taken to just lie and say that they are Chrome.

                                                                                                  I'd agree that the respecting robots.txt makes this a non-starter for the problematic scrapers. These are bots that that will hammer a site into the ground, they don't respect robots.txt, especially if it tells them to go away.

                                                                                                  All of this would be much less of a problem if the authors of the scrapers actually knew how to code, understood how the Internet works and had just the slightest bit of respect for others, but they don't so now all scrapers are labeled as hostile, meaning that only the very largest companies, like Google, get special access.

                                                                                          • triwats

                                                                                            yesterday at 11:03 PM

                                                                                            this could be cool to use cloudflare's edge to do some monitoring of endpoints actual content for synthetic monitoring

                                                                                            • fbrncci

                                                                                              today at 2:06 AM

                                                                                              Awesome, so I no longer have to use Firecrawl or my own crawler to scrape entire websites for an agent? Especially when needing residential proxies to do so on Cloudflare protected sites? Why though?

                                                                                                • freakynit

                                                                                                  today at 2:15 AM

                                                                                                  I have tried theirs... they are NOT proxies.. that means majority of the popular sites actually block scraping... even if they are protected by cloudflare itself.

                                                                                              • arjunchint

                                                                                                today at 1:25 AM

                                                                                                RIP @FireCrawl or at the very least they were the inspiration for this?

                                                                                                • radium3d

                                                                                                  today at 12:24 AM

                                                                                                  Instead of "should have been an email" this is "should have been a prompt" and can be run locally instead. There are a number of ways to do this from a linux terminal.

                                                                                                  ``` write a custom crawler that will crawl every page on a site (internal links to the original domain only, scroll down to mimic a human, and save the output as a WebP screenshot, HTML, Markdown, and structured JSON. Make it designed to run locally in a terminal on a linux machine using headless Google Chrome and take advantage of multiple cores to run multiple pages simultaneously while keeping in mind that it might have to throttle if the server gets hit too fast from the same IP. ```

                                                                                                  Might use available open source software such as python, playwright, beautifulsoup4, pillow, aiofiles, trafilatura

                                                                                                    • Normal_gaussian

                                                                                                      today at 12:55 AM

                                                                                                      This presumably is going to be cheap and effective. Its much easier to wrap a prompt round this and know it works that mess around with crawling it all yourself.

                                                                                                      You'll still be hand-rolling it if you want to disrespect crawling requirements though.

                                                                                                      • supermdguy

                                                                                                        today at 1:09 AM

                                                                                                        I’ve actually written a crawler like that before, and still ended up going with Firecrawl for a more recent project. There’s just so many headaches at scale: OOMs from heavy pages, proxies for sites that block cloud IPs, handling nested iframes, etc.

                                                                                                        • Keyframe

                                                                                                          today at 5:45 AM

                                                                                                          That'd be more like that draw an owl meme. Devil's in the details. Holy shit, there's so many details...

                                                                                                      • Normal_gaussian

                                                                                                        today at 1:06 AM

                                                                                                        "Well-behaved bot - Honors robots.txt directives, including crawl-delay"

                                                                                                        From the behaviour of our peers, this seems to be the real headline news.

                                                                                                        • today at 12:04 AM

                                                                                                          • coreq

                                                                                                            today at 2:32 AM

                                                                                                            The big question here is this a verified-bot on the Cloudflare WAF? Didn't Google get into trouble for using their search engine user agent and IPs to feed Gemini in Europe?

                                                                                                            • babelfish

                                                                                                              today at 12:00 AM

                                                                                                              Didn't they just throw a (very public) fit over Perplexity doing the exact same thing?

                                                                                                                • fleebee

                                                                                                                  today at 1:50 AM

                                                                                                                  The most egregious thing Perplexity did was to straight up ignore robots.txt. Cloudflare promise not to do that, so if we take their word for it, it's a quite different setup.

                                                                                                                  That said, I'm not fan of letting users forge whatever user agents they please. Instead, AIUI to opt-out of getting crawled I have to look for the existence of certain request headers[1].

                                                                                                                  [1]: https://developers.cloudflare.com/browser-rendering/referenc...

                                                                                                              • 8cvor6j844qw_d6

                                                                                                                yesterday at 11:13 PM

                                                                                                                Does this bypass their own anti-AI crawl measures?

                                                                                                                I'll need to test it out, especially with the labyrinth.

                                                                                                                  • jsheard

                                                                                                                    yesterday at 11:55 PM

                                                                                                                    They say it doesn't: https://developers.cloudflare.com/browser-rendering/faq/#wil...

                                                                                                                    Further down they also mention that the requests come from CFs ASN and are branded with identifying headers, so third party filters could easily block them too if they're so inclined. Seems reasonable enough.

                                                                                                                    • xhcuvuvyc

                                                                                                                      yesterday at 11:26 PM

                                                                                                                      Yeah, that'd be huge, like 90% of my search engine results are just cloudflare bot checks if I don't filter it out.

                                                                                                                      • mdasen

                                                                                                                        yesterday at 11:48 PM

                                                                                                                        If this does bypass their own (and others') anti-AI crawl measures, it'd basically mean that the only people who can't crawl are those without money.

                                                                                                                        We're creating an internet that is becoming self-reinforcing for those who already have power and harder for anyone else. As crawling becomes difficult and expensive, only those with previously collected datasets get to play. I certainly understand individual sites wanting to limit access, but it seems unlikely that they're limiting access to the big players - and maybe even helping them since others won't be able to compete as well.

                                                                                                                          • adi_kurian

                                                                                                                            yesterday at 11:58 PM

                                                                                                                            Common Crawl has free egress

                                                                                                                        • canpan

                                                                                                                          yesterday at 11:28 PM

                                                                                                                          I feel there is a conflict of interest here..

                                                                                                                          I'm split between: Yes! At last something to get CF protected sites! And: Uh! Now the internet is successfully centralized.

                                                                                                                      • 1vuio0pswjnm7

                                                                                                                        today at 4:10 AM

                                                                                                                        Can a CDN be a "walled garden"

                                                                                                                        • devnotes77

                                                                                                                          today at 12:15 AM

                                                                                                                          To clarify the two questions raised:

                                                                                                                          First, the Cloudflare Crawl endpoint does not require the target site to use Cloudflare. It spins up a headless Chrome instance (via the Browser Rendering API) that fetches and renders any publicly accessible URL. You could crawl a site hosted on Hetzner or a bare VPS with the same call.

                                                                                                                          Second on pricing: Browser Rendering is only available on the Workers Paid plan ($5/month). It is not part of the free tier. Usage is billed per invocation beyond the included quota - the exact limits are in the Cloudflare docs under Browser Rendering pricing, but for archival use cases with moderate crawl rates you are very unlikely to run into meaningful costs.

                                                                                                                          The practical gotcha for forum archival is pagination and authentication-gated content. If the forum requires a login to see older posts, a headless browser session with saved cookies would help, but that is more complex to orchestrate than a single-shot fetch.

                                                                                                                            • gingerlime

                                                                                                                              today at 8:20 AM

                                                                                                                              [0] seems to suggest even paid plans are effectively limited to 500 web pages per day, right?

                                                                                                                                  Crawl jobs per day 5 per day
                                                                                                                                  Maximum pages per crawl 100 pages
                                                                                                                              
                                                                                                                              [0] https://developers.cloudflare.com/browser-rendering/limits/#...

                                                                                                                              • zyz

                                                                                                                                today at 6:02 AM

                                                                                                                                > Browser Rendering is only available on the Workers Paid plan ($5/month). It is not part of the free tier.

                                                                                                                                The post says it's available for both free and paid plans. According to the pricing page of the Browser Rendering, the free plan will have 10 minutes/day browsing time.

                                                                                                                            • charcircuit

                                                                                                                              today at 3:45 AM

                                                                                                                              >Honors robots.txt

                                                                                                                              Is it possible to ignore robot.txt in the case the crawl was triggered by a human?

                                                                                                                              • greatgib

                                                                                                                                today at 12:08 AM

                                                                                                                                All what was expected, first they do a huge campaign to out evil scrapers. We should use their service to ensure your website block LLMs and bots to come scraping them. Look how bad it is.

                                                                                                                                And once that is well setup, and they have their walled garden, then they can present their own API to scrape websites. All well done to be used by your LLM. But as you know, they are the gate keeper so that the Mafia boss decide what will be the "intermediary" fee that is proper for itself to let you do what you were doing without intermediary before.

                                                                                                                                  • shadowfiend

                                                                                                                                    today at 12:19 AM

                                                                                                                                    No: https://developers.cloudflare.com/browser-rendering/rest-api...

                                                                                                                                      • greatgib

                                                                                                                                        today at 1:32 AM

                                                                                                                                        That is funny because on this page there is a warning block with the following text:

                                                                                                                                           Refer to Will Browser Rendering bypass Cloudflare's Bot Protection? for instructions on creating a WAF skip rule.
                                                                                                                                        
                                                                                                                                        And "Will Browser Rendering bypass Cloudflare's Bot Protection? " is a hash link to the FAQ page, that surprisingly doesn't anything available for this link entry.

                                                                                                                                        Is it because it was removed (/hidden) or because it is not yet available until everyone forget the "we are no evil, we are here to protect the internet"?

                                                                                                                                        • x0x0

                                                                                                                                          today at 1:00 AM

                                                                                                                                          most websites, particularly those behind cloudflare, are very restrictive even to crawlers that obey robots. Proof: a ton of my time over the last year, and my crawlers very carefully obey robots.

                                                                                                                                          It's hard to see how this isn't extorting folks by offering a working solution that, oh, cloudflare doesn't block. As long as you pay Cloudflare.

                                                                                                                                          Perhaps I'm overly cynical, but I'd be quite surprised if cloudflare subjected their own headless browsing to the same rules the rest of the internet gets.

                                                                                                                                            • gruez

                                                                                                                                              today at 1:11 AM

                                                                                                                                              >most websites, particularly those behind cloudflare, are very restrictive even to crawlers that obey robots. Proof: a ton of my time over the last year, and my crawlers very carefully obey robots.

                                                                                                                                              The docs are pretty equivocal though:

                                                                                                                                              >If you use Cloudflare products that control or restrict bot traffic such as Bot Management, Web Application Firewall (WAF), or Turnstile, the same rules will apply to the Browser Rendering crawler.

                                                                                                                                              It's not just robots.txt. Most (all?) restrictions that apply to outside bots apply to cloudflare's bot as well, at least that's what they're claiming. If they're being this explicit about it, I'm willing to give them the benefit of the doubt until there's evidence to the contrary, rather than being a cynic and assuming the worst.

                                                                                                                                  • memothon

                                                                                                                                    yesterday at 11:18 PM

                                                                                                                                    I've used browser rendering at work and it's quite nice. Most solutions in the crawling space are kind of scummy and designed for side-stepping robots.txt and not being a good citizen. A crawl endpoint is a very necessary addition!

                                                                                                                                    • tjpnz

                                                                                                                                      today at 1:53 AM

                                                                                                                                      Do I have the option to fill it with junk for LLMs?

                                                                                                                                      • rvz

                                                                                                                                        yesterday at 11:58 PM

                                                                                                                                        Selling the cure (DDoS protection) and creating the poison (Authorized AI crawling) against their customers.

                                                                                                                                        • Imustaskforhelp

                                                                                                                                          yesterday at 11:32 PM

                                                                                                                                          This might be really great!

                                                                                                                                          I had the idea after buying https://mirror.forum recently (which I talked in discord and archiveteam irc servers) that I wanted to preserve/mirror forums (especially tech) related [Think TinyCoreLinux] since Archive.org is really really great but I would prefer some other efforts as well within this space.

                                                                                                                                          I didn't want to scrape/crawl it myself because I felt like it would feel like yet another scraping effort for AI and strain resources of developers.

                                                                                                                                          And even when you want to crawl, the issue is that you can't crawl cloudflare and sometimes for good measure.

                                                                                                                                          So in my understanding, can I use Cloudflare Crawl to essentially crawl the whole website of a forum and does this only work for forums which use cloudflare ?

                                                                                                                                          Also what is the pricing of this? Is it just a standard cloudflare worker so would I get free 100k requests and 1 Million per the few cents (IIRC) offer for crawling. Considering that Cloudflare is very scalable, It might even make sense more than buying a group of cheap VPS's

                                                                                                                                          Also another point but I was previously thinking that the best way was probably if maintainers of these forums could give me a backup archive of the forum in a periodic manner as my heart believes it to be most cleanest way and discussing it on Linux discord servers and archivers within that community and in general, I couldn't find anyone who maintains such tech forums who can subscribe to the idea of sharing the forum's public data as a quick backup for preservation purposes. So if anyone knows or maintains any forums myself. Feel free to message here in this thread about that too.

                                                                                                                                            • ipaddr

                                                                                                                                              yesterday at 11:58 PM

                                                                                                                                              "I didn't want to scrape/crawl it myself because I felt like it would feel like yet another scraping effort for AI and strain resources of developers"

                                                                                                                                              You feel better paying someone to do the same thimg?

                                                                                                                                                • Imustaskforhelp

                                                                                                                                                  today at 12:25 AM

                                                                                                                                                  I actually don't but it seems that cloudflare caches responses so if anything instead of straining the developer resources, it would strain more cloudflare resources and cloudflare could better handle that more efficiently with their own crawl product.

                                                                                                                                                  Also, I am genuinely open to feedback (Like a lot) so just let me know if you know of any other alternative too for the particular thing that I wish to create and I would love to have a discussion about that too! I genuinely wish that there can be other ways and part of the reason why I wrote that comment was wishing that someone who manages forums or knows people who do can comment back and we can have a discussion/something-meaningful!

                                                                                                                                                  I am also happy with you also suggesting me any good use cases of the domain in general if there can be made anything useful with it. In fact, I am happy with transferring this domain to you if this is something which is useful to ya or anyone here (Just donate some money preferably 50-100$ to any great charity in date after this comment is made and mail me details and I am absolutely willing to transfer the domain, or if you work in any charity currently and if it could help the charity in any meaningful manner!)

                                                                                                                                                  I had actually asked archive team if I could donate the domain to them if it would help archive.org in any meaningful way and they essentially politely declined.

                                                                                                                                                  I just bought this domain because someone on HN said mirror.org when they wanted to show someone else mirror and saw the price of the .org domain being so high (150k$ or similar)and I have habit of finding random nice TLD and I found mirror.forum so I bought it

                                                                                                                                                  And I was just thinking of hmm what can be a decent idea now that I have bought it and had thought of that. Obviously I have my flaws (many actually) but I genuinely don't wish any harm to anybody especially those people who are passionate about running independent forums in this centralized-web. I'd rather have this domain be expired if its activation meant harm to anybody.

                                                                                                                                                  looking forward to discussion with ya.

                                                                                                                                                    • weird-eye-issue

                                                                                                                                                      today at 4:14 AM

                                                                                                                                                      This is used to scrape third-party sites not necessarily behind cloudflare so it has nothing to do with whether cloudflare caches it or not plus when using their browser rendering it doesn't even fetch cached responses anyways....

                                                                                                                                                        • Imustaskforhelp

                                                                                                                                                          today at 7:45 AM

                                                                                                                                                          I didn't know that it doesn't fetch catched responses, my apologies. I had only read through it with a glance and it felt like something that cloudflare might've done. Is there any particular reason that they don't use the cached responses, feels like a missed opportunity but maybe I am missing something?

                                                                                                                                          • david_iqlabs

                                                                                                                                            today at 1:04 AM

                                                                                                                                            [flagged]

                                                                                                                                            • sourcecodeplz

                                                                                                                                              today at 5:29 AM

                                                                                                                                              I love this from CloudFlare!

                                                                                                                                              • pqdbr

                                                                                                                                                today at 2:00 AM

                                                                                                                                                Off-topic, but I'm having a terrible experience with Cloudflare and would love to know if someone could offer some help.

                                                                                                                                                All of a sudden, about 1/3 of all traffic to our website is being routed via EWR (New York) - me included -, even tough all our users and our origin servers are in Brazil.

                                                                                                                                                We pay for the Pro plan but support has been of no help: after 20 days of 'debugging' and asking for MTRs and traceroutes, they told us to contact Claro (which is the same as telling me to contact Verizon) because 'it's their fault'.

                                                                                                                                                  • weird-eye-issue

                                                                                                                                                    today at 4:36 AM

                                                                                                                                                    Do you think cloudflare is responsible for all of the network traffic routing in the entire world and can simply fix any problem even if it's on somebody else's network?

                                                                                                                                                    • tgrowazay

                                                                                                                                                      today at 2:09 AM

                                                                                                                                                      It is possible that Claro has a bad route that sends all traffic destined for Cloudflare through New York.

                                                                                                                                                        • tempest_

                                                                                                                                                          today at 4:07 AM

                                                                                                                                                          Every once and a while we have had Bell Canada route a request that should be going about 6 blocks away across the continent and back.

                                                                                                                                                          They are not super helpful fixing it either.