Humanely Dealing with Humungus Crawlers
Posted4 months agoActive4 months ago
flak.tedunangst.comTechstory
calmmixed
Debate
60/100
Web SecurityCaptchaScalability
Key topics
Web Security
Captcha
Scalability
The article discusses strategies for mitigating the load caused by automated crawlers on a website, sparking a discussion on the trade-offs between security, performance, and user experience.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
56m
Peak period
38
0-12h
Avg / period
5.8
Comment distribution46 data points
Loading chart...
Based on 46 loaded comments
Key moments
- 01Story posted
Sep 12, 2025 at 1:06 PM EDT
4 months ago
Step 01 - 02First comment
Sep 12, 2025 at 2:02 PM EDT
56m after posting
Step 02 - 03Peak activity
38 comments in 0-12h
Hottest window of the conversation
Step 03 - 04Latest activity
Sep 18, 2025 at 5:20 PM EDT
4 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45224246Type: storyLast synced: 11/20/2025, 4:53:34 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
+1000 I feel like so much bot detection (and fraud prevention against human actors, too) is so emotionally-driven. Some people hate these things so much, they're willing to cut off their nose to spite their face.
Misbehaving crawlers are a huge problem but bloggers are among the least affected by them. Something like a wiki or a forum is a better example, as they're in a category of websites where each page visit is almost unavoidably rendered on the fly using multiple expensive SQL queries due to the rapidly mutating nature of their datasets.
Git forges, like the one TFA is discussing, are also fairly expensive, especially as crawlers traverse historical states. When the crawler is poorly implemented they'll get stuck doing this basically forever. Detecting and dealing with git hosts is an absolute must for any web crawler due to this.
I actually find this surprisingly difficult to find.
I just want static hosting (like Netlify or Firebase Hosting), but there aren't many hosts that offer that.
There are lots of providers where I can buy a VPS somewhere and be in charge of configuring and patching it, but if I just want to hand someone a set of HTML files and some money in exchange for hosting, not many hosts fit the bill.
I actually don't even want a free option. I want to pay a vendor that cares about keeping my website online. I'm fine paying $20-50/mo as long as it's bounded and they don't just take my site offline if I see a spike from HN.
Dreamhost! They're still around and still lovely after how many years? I even find their custom control panel charming.
Yeah, that's true, there isn't a lot of "I give you money and HTML, you host it" services out there, surprisingly. Probably the most mature, cheapest and most reliable one today would be good ol' neocities.org (run by HN user kyledrake) which basically gives you 3TB/month for $5, pretty good deal :)
Sometimes when I miss StumbleUpon I go to https://neocities.org/browse?sort_by=random which gives a fun little glimpse of the hobby/curiosity/creative web.
So, I'm currently building pretty much this. After doing it on the side for clients for years, it's now my full-time effort. I have a solid and stable infrastructure, but not yet an API or web frontend. If somebody wants basically ssh, git, and static (or even not static!) hosting that comes with a sysadmin's contact information for a small number of dollars per month, I can be reached at sysop@biphrost.net.
Environment is currently Debian-in-LXC-on-Debian-on-DigitalOcean.
https://www.ovhcloud.com/en/web-hosting/compare/
If you're a bot which will ignore all the licenses I put on that content, then I don't want to you to be able to reach that content.
No, any amount of monetary compensation is not welcome either. I use these licenses as a matter of principle, and my principles are not for sale.
That's all, thanks.
Right, I thought the conversation was about public websites on the public internet, but I think you're talking about this in the context of a private website now? I understand keeping tighter controls if you're dealing with private content you want accessible via the internet for others but not the public.
You're conflating a legal concept that applies to areas that are shared, government owned, paid for by taxes, and the government feels like people should be able to access them.
The web is closer to a shopping mall. You're on one persons property to access other people's stuff who pay to be there. They set their own rules. If you don't follow those rules you get kicked out, charged with trespassing, and possibly banned from the mall entire.
AI bots have been asked to leave. But, since they own the mall too, the store owners are more than a little screwed.
I see it more like I'm knocking on people's doors (issuing GET requests with my web browser) and people open their door for me (the server responds with something) or not. If you don't wanna open the door, fine you do you, but if you do open the door, I'm gonna assume it was on purpose as I'm not trying to be malicious, I'm just a user with a browser.
> AI bots have been asked to leave. But, since they own the mall too, the store owners are more than a little screwed.
I don't understand what you mean with this, what is the mall here, are you're saying that people have websites hosted at OpenAI et al? I'm not sure how the "mall owner" and the people running the AI bots are the same owners.
And finally: https://www.techspot.com/news/105769-meta-reportedly-plannin...
The internet runs on backhaul. A LOT of backhaul is now owned by FAANG. Along with that, most those companies can financially ruin any business simply by banning them from the platform. So, the companies use their backhaul fiber and peering agreements to crawl everybody else. And nobody says anything because of "The Implication" that if you sue under Computer fraud and abuse Act (among others) they'll just wholesale ban you.
A "door to door" analogy doesn't work because sidewalks are generally considered "Public." The best I can tweak that analogy is a gated neighborhood and everybody has "no soliciting" signs. (NB: at least in my area, soliciting when theres a no-soliciting sign is an actual crime, on top of being trespassing)
Blocking misbehaving IP addresses isn’t new, and is another version of the same principle.
Absolutely, I agree that of course people are free to block whatever they want, misbehaving or not. Guess I'm just trying to figure out what sort of "collateral damage" people are OK with when putting up content on the public internet but want it to be selectively available.
> You have no innate right to access the content I share.
No, I guess that's true, I don't have any "rights" to do so. But I am gonna assume that if whatever you host is available without any authentication, protection or similar, you're fine with me viewing that. I'm not saying you should be fine with 1000s of requests per second, but since you made it public in the first place by sharing it, you kind of implicitly agreed for others to view it.
Crawling-prevention is not new. Many news outlets or biggish websites already was preventing access by non-human agents in various ways for a very long time.
Now, non-human agents are improved and started to leech everything they can find, so the methods are evolving, too.
News outlets are also public sites on the public internet.
Source-available code repositories are also on the public internet, but said agents crawl and use that code, too, backed by fair-use claims.
While I understand that you may need a personal bot to crawl or mirror a site, I can't guarantee that I'll grant you access.
I don't like to be that heavy-handed in the first place, but capitalism is making it harder to trust entities which you can't see and talk face to face.
The author’s goal is admirable: “My primary principle is that I’d rather not annoy real humans more than strictly intended”. However, the primary goal for many people hosting content will be “block bots and allow humans with minimal effort and tuning”.
* Type this sentence, taken from a famous copyrighted work.
* Type Tienanmen protests.
* Type this list of swear words or sexual organs.
All that's to say that you can stop some of your website contents being quoted by the chatbots verbatim, but you can't prevent the crawlers using up all your bandwidth in the way you describe. You also can't stop your website contents being rehashed in a conceptual way by the chatbot later. So if I just write something copyrighted or taboo here in this comment, that won't stop an LLM being trained on the comment as a whole, but it might stop the chatbot based on that LLM from quoting it directly.
Everything is moving so quickly with AI that my comment is probably out of date the moment I type it... take it with a grain of salt :)
1998: I swear at the computer until the page loads
2025: I swear at the computer until the page loads
all of the stuff that's being complained-about is absolute 100% table-stakes stuff that every http server on the public internet has needed to deal with since, man, i dunno, minimum 15 years now?
as a result literally nobody self-hosts their own HTTP content any more, unless they enjoy the challenge in like a problem-solving sense
if you are even self-hosting some kind of captcha system you've already make a mistake, but apparently this guy is not just hosting but building a bespoke one? which is like, my dude, respect, but this is light years off of the beaten path
the author whinges about google not doing their own internal rate limiting in some presumed distributed system, before any node in that system makes any http request over the open internet. that's fair and not doing so is maybe bad user behavior but on the open internet it's the responsibility of the server to protect itself as it needs to, it's not the other way around
everything this dude is yelling about is immediately solved by hosting thru a hosting provider, like everyone else does, and has done, since like 2005
Far as Ted's article, the first thing that popped in my head is that most AI crawlers hitting my sites are in big, datacenter cities: Dallas, Dublin, etc. I wonder if I could easily geo-block those cities or redirect them to pages with more checks built-in. I just haven't looked into that on my CDN's or in general in a long time.
They also usually request files from popular, PHP frameworks and othrr things like that. If you don't use PHP, you could autoban on the first request for a PHP page. Likewise for anything else you don't need.
Of the two, looking for .php is probably lightening quick with low, CPU/RAM utilization in comparison.
8 more comments available on Hacker News