Nathan2055
today at 9:53 AM
The funniest part about WordPress is that you can usually achieve at least a 50% speed boost or more by adding a plugin that just minifies and caches the ridiculous number of dynamic CSS and JS files that most themes and plugins add to every page. Set those up with HTTP 103 Early Hints preload headers (so the browser can start sending subresource requests in the background before the HTML is even sent out, exactly the kind of thing HTTP/2 and /3 were designed to make possible) and then throw Cloudflare or another decent CDN on top, and you're suddenly getting TTFBs much closer to a more "modern" stack.
The bizarre thing is that pretty much no CMS, even the "new" ones, seems to automate all of that by default. None of those steps are that difficult to implement, and provide a serious speed boost to everything from WordPress to MediaWiki in my experience, and yet the only service that seems to get close to offering it is Cloudflare.
Even then, Cloudflare's tooling only works its best if you're already emitting minified and compressed files and custom written preload headers on the origin side, since the hit on decompressing all the origin traffic to make those adjustments and analyses is way worse for performance than just forwarding your compressed responses directly, hence why they removed Auto Minify[1] and encourage sending pre-compressed Brotli level 11 responses from the origin[2] so people on recent browsers get pass-through compression without extra cycles being spent on Cloudflare's servers.
The solution seems pretty clear: aim to get as much stuff served statically, preferably pre-compressed, as you can. But it's still weird that actually implementing that is still a manual process on most CMSes, when it shouldn't be that hard to make it a standard feature.
And as for Git web interfaces, the correct solution is to require logins to view complete history. Nobody likes saying it, nobody likes hearing it. But Git is not efficient enough on its own to handle the constant bombardment of random history paginations and diffs that AI crawlers seem to love. It wasn't an issue before, because old crawlers for things like search engines were smart enough to ignore those types of pages, or at least to accept when the sysadmin says it should ignore those types of pages. AI crawlers have no limits, ignore signals from site operators, make no attempts to skip redundant content, and in general are very dumb about how they send requests (this is a large part of why Anubis works so well; it's not a particularly complex or hard to bypass proof of work system[3], but AI bots genuinely don't care about anything but consuming as many HTTP 200s as a server can return, and give up at the slightest hint of pushback (but do at least try randomizing IPs and User-Agents, since those are effectively zero-cost to attempt).
[1]: https://community.cloudflare.com/t/deprecating-auto-minify/6...
[2]: https://blog.cloudflare.com/this-is-brotli-from-origin/
[3]: https://lock.cmpxchg8b.com/anubis.html but see also https://news.ycombinator.com/item?id=45787775 and then https://news.ycombinator.com/item?id=43668433 and https://news.ycombinator.com/item?id=43864108 for how it's working in the real world. Clearly Anubis actually does work, given testimonials from admins and wide deployment numbers, but that can only mean that AI scrapers aren't actually implementing effective bypass measures. Which does seem pretty in line with what I've heard about AI scrapers, summarized well in https://news.ycombinator.com/item?id=43397361, in that they are basically making no attempt to actually optimize how they're crawling. The general consensus seems to be that if they were going to crawl optimally, they'd just pull down a copy of Common Crawl like every other major data analysis project has done for the last two decades, but all the AI companies are so desperate to get just slightly more training data than their competitors that they're repeatedly crawling near-identical Git diffs just on the off-chance they reveal some slightly different permutation of text to use. This is also why open source models have been able to almost keep pace with the state of the art models coming out of the big firms: they're just designing way more efficient training processes, while the big guys are desperately throwing hardware and crawlers at the problem in the desperate hope that they can will it into an Amazon model instead of a Ben and Jerry’s model[4].
[4]: https://www.joelonsoftware.com/2000/05/12/strategy-letter-i-... - still probably the single greatest blog post ever written, 26 years later.
> And as for Git web interfaces, the correct solution is to require logins to view complete history.
Why logins, exactly? Who would have such logins; developers only, or anyone who signs up? I'm not sure if this is an effective long-term mitigation, or simply a “wall of minimal height” like you point out that Anubis is.