Messing with scraper bots
Mood
thoughtful
Sentiment
mixed
Category
tech
Key topics
web scraping
bot detection
security
The author experiments with scraper bots and explores ways to detect and deter them on their blog.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
2h
Peak period
46
Day 1
Avg / period
46
Based on 46 loaded comments
Key moments
- 01Story posted
11/15/2025, 7:38:18 AM
4d ago
Step 01 - 02First comment
11/15/2025, 9:27:42 AM
2h after posting
Step 02 - 03Peak activity
46 comments in Day 1
Hottest window of the conversation
Step 03 - 04Latest activity
11/15/2025, 5:20:19 PM
3d ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
What you have here is quite close to a honeypot, sadly I don't see an easy way to counter-abuse such bots. If the attack is not following their script, they move on.
So as to battles of efficiency, generating a 4kb bullshit PHP is harder than running a regex.
I'd sacrifice two CPU cores for this just to make their life awful.
I would make a list of words from each word class, and a list of sentence structures where each item is a word class. Pick a pseudo-random sentence; for each word class in the sentence, pick a pseudo-random word; output; repeat. That should be pretty simple and fast.
I'd think the most important thing though is to add delays to serving the requests. The purpose is to slow the scrapers down, not to induce demand on your garbage well.
RewriteEngine On
# Block requests that reference .php anywhere (path, query, or encoded)
RewriteCond %{REQUEST_URI} (\.php|%2ephp|%2e%70%68%70) [NC,OR]
RewriteCond %{QUERY_STRING} \.php [NC,OR]
RewriteCond %{THE_REQUEST} \.php [NC]
RewriteRule .* - [F,L]
Notes: there's no PHP on my servers, so if someone asks for it, they are one of the "bad boys" IMHO. Your mileage may differ. # Nothing to hack around here, I’m just a teapot:
location ~* \.(?:php|aspx?|jsp|dll|sql|bak)$ {
return 418;
}
error_page 418 /418.html;
No hard block, instead reply to bots the funny HTTP 418 code (https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/...). That makes filtering logs easier.Live example: https://FreeSolitaire.win/wp-login.php (NB: /wp-login.php is WordPress login URL, and it’s commonly blindly requested by bots searching for weak WordPress installs.)
> You have an image on your error page, which some crappy bots will download over and over again.
Most bots won’t download subresources (almost none of them do, actually). The HTML page itself is lean (475 bytes); the image is an Easter egg for humans ;-) Moreover, I use a caching CDN (Cloudflare).
https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/...
https://maurycyz.com/misc/the_cost_of_trash/#:~:text=throw%2...
Otherwise you can also chain compression methods like: "Content-Encoding: gzip gzip".
I have a web crawler and I have both scraping byte limit and timeout, so zip bombs dont bother me much.
https://github.com/rumca-js/crawler-buddy
I think garbage blabber would be more effective.
AI scrapers will plagiarise your work and bring you zero traffic.
So to continue your analogy, I made my part of the beach accessible for visitors to enjoy, but certain people think they can carry it away for their own purpose ...
These scrapers drown peoples' servers in requests, taking up literally all the resources and driving up cost.
The line is "I technically and able to do this" and "I am engaging with a system in good faith".
Public parks are just there and I can technically drive up and dump rubbish there and if they didn't want me to they should have installed a gate and sold tickets.
Many scrapers these days are sort of equivalent in that analogy to people starting entire fleets of waste disposal vehicles that all drive to parks to unload, putting strain on park operations and making the parks a less tenable service in general.
It is completely different if I am hitting it looking for WordPress vulnerabilities or scraping content every minute for LLM training material.
When you get paid big buck to make the world worse for everyone it's really simple forgetting "little details".
The tech people are all turning against scraping, independent artists are now clamoring for brutal IP crackdowns and Disney-style copyright maximalism (which I never would've predicted just 5 years ago, that crowd used to be staunchly against such things), people everywhere want more attestation and elimination of anonymity now that it's effectively free to make a swarm of convincingly-human misinformation agents, etc.
It's making people worse.
Would be usable to ban the ip for a few hours to have the bot cool down for a bit and move onto a next domain.
The default ban for traffic detected by your crowdsec instance is 4 hours, so that concern isn't very relevant in that case.
The decisions from the Central API from other users can be quite a bit longer (I see some at ~6 days), but you also don't have to use those if you're worried about that scenario.
.htaccess diverts suspicious paths (e.g., /.git, /wp-login) to decoy.php and forces decoy.zip downloads (10GB), so scanners hitting common “secret” files never touch real content and get stuck downloading a huge dummy archive.
decoy.php mimics whatever sensitive file was requested by endless streaming of fake config/log/SQL data, keeping bots busy while revealing nothing.
This is done very efficiently. If you return anything unexpected, they’ll just drop you and move on.
[1] https://github.com/holysoles/bot-wrangler-traefik-plugin
About 10-15 years ago, the scourge I was fighting was social media monitoring services, companies paid by big brands to watch sentiment across forums and other online communities. I was running a very popular and completely free (and ad-free) discussion forum in my spare time, and their scraping was irritating for two reasons. First, they were monetising my community when I wasn’t. Second, their crawlers would hit the servers as hard as they could, creating real load issues. I kept having to beg our hosting sponsor for more capacity.
Once I figured out what was happening, I blocked their user agent. Within a week they were scraping with a generic one. I blocked their IP range; a week later they were back on a different range. So I built a filter that would pseudo-randomly[0] inject company names[1] into forum posts. Then any time I re-identified[2] their bot, I enabled that filter for their requests.
The scraping stopped within two days and never came back.
--
[0] Random but deterministic based on post ID, so the injected text stayed consistent.
[1] I collated a list of around 100 major consumer brands, plus every company name the monitoring services proudly listed as clients on their own websites.
[2] This was back around 2009 or so, so things weren't nearly as sophisticated as they are today, both in terms of bots and anti-bot strategies. One of the most effective tools I remember deploying back then was analysis of all HTTP headers. Bots would spoof a browser UA, but almost none would get the full header set right, things like Accept-Encoding or Accept-Language were either absent, or static strings that didn't exactly match what the real browser would ever send.
36 more comments available on Hacker News
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.