Tiled Hacker news on React Router

Messing with scraper bots

245 points - 11/15/2025

Source

simondotau
11/15/2025
The more things change, the more they stay the same.
About 10-15 years ago, the scourge I was fighting was social media monitoring services, companies paid by big brands to watch sentiment across forums and other online communities. I was running a very popular and completely free (and ad-free) discussion forum in my spare time, and their scraping was irritating for two reasons. First, they were monetising my community when I wasn’t. Second, their crawlers would hit the servers as hard as they could, creating real load issues. I kept having to beg our hosting sponsor for more capacity.
Once I figured out what was happening, I blocked their user agent. Within a week they were scraping with a generic one. I blocked their IP range; a week later they were back on a different range. So I built a filter that would pseudo-randomly[0] inject company names[1] into forum posts. Then any time I re-identified[2] their bot, I enabled that filter for their requests.
The scraping stopped within two days and never came back.
--
[0] Random but deterministic based on post ID, so the injected text stayed consistent.
[1] I collated a list of around 100 major consumer brands, plus every company name the monitoring services proudly listed as clients on their own websites.
[2] This was back around 2009 or so, so things weren't nearly as sophisticated as they are today, both in terms of bots and anti-bot strategies. One of the most effective tools I remember deploying back then was analysis of all HTTP headers. Bots would spoof a browser UA, but almost none would get the full header set right, things like Accept-Encoding or Accept-Language were either absent, or static strings that didn't exactly match what the real browser would ever send.
VladVladikoff
11/15/2025
This is a fundamental misunderstanding of what those bots are requesting. They aren’t parsing those PHP files, they are using their existence for fingerprinting — they are trying to determine the existence of known vulnerabilities. They probably immediately stop reading after receiving a http response code and discard the remainder of the request packets.
iam-TJ
11/15/2025
This reminds me of a recent discussion about using a tarpit for A.I. and other scrapers. I've kept a tab alive with a reference to a neat tool and approach called Nepenthes that VERY SLOWLY drip feeds endless generated data into the connection. I've not had an opportunity to experiment with it as yet:
https://zadzmo.org/code/nepenthes/
jcynix
11/15/2025
If you control your own Apache server and just want to shortcut to "go away" instead of feeding scrapers, the RewriteEngine is your friend, for example:
```
      RewriteEngine On

      # Block requests that reference .php anywhere (path, query, or encoded)
      RewriteCond %{REQUEST_URI} (\.php|%2ephp|%2e%70%68%70) [NC,OR]
      RewriteCond %{QUERY_STRING} \.php [NC,OR]
      RewriteCond %{THE_REQUEST} \.php [NC]
      RewriteRule .* - [F,L]
```
Notes: there's no PHP on my servers, so if someone asks for it, they are one of the "bad boys" IMHO. Your mileage may differ.
Kiro
11/15/2025
I remember when you used to get scolded on HN for preventing scrapers or bots. "How I access your site is irrelevant".
ArcHound
11/15/2025
Neat! Most of the offensive scrapers I met try and exploit WordPress sites (hence the focus on PHP). They don't want to see php files, but their outputs.
What you have here is quite close to a honeypot, sadly I don't see an easy way to counter-abuse such bots. If the attack is not following their script, they move on.
firefoxd
11/15/2025
I had to revisit my strategy after posting about my zipbombs on HN [0]. My server traffic went from tens of thousands to ~100k daily, hosted on a $6 vps. It was not sustainable.
Now I target only the most aggressive bots with zipbombs and the rest get a 403. My new spam strategy seems to work, but I don't know if I should post it on HN again...
[0]: https://news.ycombinator.com/item?id=43826798
vachina
11/15/2025
They’re not scraping for php files, they’re probing for known vulns in popular frameworks, and then using them as entry points for pwning.
This is done very efficiently. If you return anything unexpected, they’ll just drop you and move on.
BigBalli
11/15/2025
I always had fail2ban but a while back I wanted to set up something juicier...
.htaccess diverts suspicious paths (e.g., /.git, /wp-login) to decoy.php and forces decoy.zip downloads (10GB), so scanners hitting common “secret” files never touch real content and get stuck downloading a huge dummy archive.
decoy.php mimics whatever sensitive file was requested by endless streaming of fake config/log/SQL data, keeping bots busy while revealing nothing.
aduwah
11/15/2025
I wonder if the abuse bots could be somehow made to mine some crypto to give back to the bills they cause
s0meON3
11/15/2025
What about using zip bombs?
https://idiallo.com/blog/zipbomb-protection
Surac
11/15/2025
I have just cut out up ranges that can not connect. I am blocking USA, Asia and Middle East to prevent most malicious accesses
localhostinger
11/15/2025
Interesting! It's nice to see people are experimenting with these, and I wonder if this kind of junk data generators will become its own product. Or maybe at least a feature/integration in existing software. I could see it going there.
holysoles
11/15/2025
I wrote a Traefik plugin [1] that controls traffic based on known bad bot user agents, you can just block or even send them to a markov babbler if you've set one up. I've been using nepenthes [2].
[1] https://github.com/holysoles/bot-wrangler-traefik-plugin
[2] https://zadzmo.org/code/nepenthes/
11/16/2025
ronsor
11/15/2025
These aren't scraper bots; they're vulnerability scanners. They don't expect PHP source code and probably don't even read the response body at all.
I don't know why people would assume these are AI/LLM scrapers seeking PHP source code on random servers(!) short of it being related to this brainless "AI is stealing all the data" nonsense that has infected the minds of many people here.
leadgrids
11/15/2025
[dead]
NoiseBert69
11/15/2025
Hm.. why not using dumbed down small, self-hosted LLM networks to feet the big scrapers with bullshit?
I'd sacrifice two CPU cores for this just to make their life awful.
re-lre-l
11/15/2025
Don’t get me wrong, but what’s the problem with scrapers? People invest in SEO to become more visible, yet at the same time they fight against “scraper bots.” I’ve always thought the whole point of publicly available information is to be visible. If you want to make money, just put it behind a paywall. Isn’t that the idea?

Messing with scraper bots

simondotau

DamnInteresting

lelanthran

simondotau

DamnInteresting

grishka

grishka

simondotau

grishka

simondotau

grishka

AJMaxwell

tesin

thephyber

wvbdmp

simondotau

akoboldfrying

SAI_Peregrinus

akoboldfrying

rvba

simondotau

dotancohen

poopiokaka

VladVladikoff

holysoles

ajsnigrutin

holysoles

amypetrik8

VladVladikoff

mattgreenrocks

iam-TJ

jcynix

palsecam

jcynix

kijin

palsecam

quesera

MadnessASAP

Kiro

grishka

hollow-moe

elashri

Analemma_

ArcHound

jojobas

firefoxd

vachina

BigBalli

aduwah

boxedemp

jeroenhd

s0meON3

lavela

LunaSea

jeroenhd

kalkin

renegat0x0

Surac

breppp

warkdarrior

bot403

testing22321

gessha

localhostinger

arbol

holysoles

ronsor

leadgrids

NoiseBert69

Findecanor

mnau

qezz

re-lre-l

georgefrowny

akoboldfrying

akoboldfrying

ryantgtg

georgefrowny

nrhrjrjrjtntbt