How I Protect My Forgejo Instance From AI Web Crawlers
Key topics
The eternal struggle against AI web crawlers has sparked a lively debate, with one developer sharing their tactics for safeguarding their Forgejo instance. As commenters weighed in, it became clear that the issue isn't just about the volume of requests, but also the bandwidth and inefficiency of crawlers, with some noting that repeated downloads of whole repositories can waste resources. While some argued that a fast enough web server can render the scraping problem moot, others countered that this doesn't address the underlying issue of crawler inefficiency. The discussion also touched on potential solutions, including Cloudflare's "pay-per-crawl" feature, although some self-hosting enthusiasts balked at introducing cloud solutions into their setup.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
41m
Peak period
45
18-24h
Avg / period
9
Based on 99 loaded comments
Key moments
- 01Story posted
Dec 21, 2025 at 9:46 AM EST
12 days ago
Step 01 - 02First comment
Dec 21, 2025 at 10:27 AM EST
41m after posting
Step 02 - 03Peak activity
45 comments in 18-24h
Hottest window of the conversation
Step 03 - 04Latest activity
Dec 25, 2025 at 2:28 PM EST
8 days ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
It's easy to assume "I received a lot of requests, therefore the problem is too many requests" but you can successfully handle many requests.
This is a clever way of doing a minimally invasive botwall though - I like it.
There is a point where your web server becomes fast enough that the scraping problem becomes irrelevant. Especially at the scale of a self-hosted forge with a constrained audience. I find this to be a much easier path.
I wish we could find a way to not conflate the intellectual property concerns with the technological performance concerns. It seems like this is essential to keeping the AI scraping drama going in many ways. We can definitely make the self hosted git forge so fast that anything short of ~a federal crime would have no meaningful effect.
It isn't just the volume of requests, but also bandwidth. There have been cases where scraping represents >80% of a forge's bandwidth usage. I wouldn't want that to happen to the one I host at home.
The market price for bandwidth in a central location (USA or Europe) is around $1-2 per TB and less if you buy in bulk. I think it's somewhat cheaper in Europe than in the USA due to vastly stronger competition. Hetzner includes 20TB outgoing with every Europe VPS plan, and 1€/TB +VAT overage. Most providers aren't quite so generous but still not that bad. How much are you actually spending?
That sounds like a bug.
https://blog.cloudflare.com/introducing-pay-per-crawl/
Our open-source system allows blocking IP addresses based on rules triggered by specific behavior.
Can you elaborate on what exact type of crawlers you would like to block? Like, a leaky bucket of a certain number of requests per minute?
1. https://github.com/tirrenotechnologies/tirreno
The simplest approach is to count UA as risky or flag multiple 404 errors or HEAD requests, and block on that. Those rules we already have out of the box.
But as it's open source, there's no pain in writing specific rules for rate limiting, thus my question.
It's trivial to spoof UAs unfortunately.
I believe that if something is publicly available, it shouldn't be overprotected in most cases.
However, there are many advanced cases, such as crawlers that collect data for platform impersonation (for scams) or custom phishing attacks, or account brute-force attacks. In those cases, I use tirreno to see traffic through different dimensions.
Depends on the goal.
Author wants his instance not to get killed. Request rate limiting may achieve that easily in a way transparent to normal users.
Problem is, bots can easily can resort to resi proxies, at which point you'll end up blocking legitimate traffic.
I believe that there is a low chance that a real customer behind this residential IP will come to your resource. If you do an EU service, there is no pain to block Asian IPs and vice-versa.
What is really important here is that most people block IPs on autopilot without seeing the distribution of their actions, and this really matters.
Bad crawlers have been there since the very beginning. Some of them looking for known vulnerabilities, some scraping content for third-party services. Most of them have spoofed UAs to pretend to be legitimate bots.
Therefore, our current strategy for bad crawler mitigation is first of all to flag HEAD requests, then to check if a bot has multiple 404 errors or requests vulnerable URLs like wp-admin, etc., and finally to check for bad UAs (like Go-Client 1.0).
This is approximately 30–40% of traffic on any website.
Thanks for this tip.
I hoped to get them not stuck using a robots.txt but they refuse to obey it and keep hitting that page with various params. No problem for me, but they are going nowhere.
Hopefully it at least costs them a little bit more.
- have data to train on
- update the data more or less continuously
- answer queries from users on the fly
With a lot of AI companies, that generates a lot of scraping. Also, some of them behave terribly when scraping or is just bad at it.
Because big sites have decades of experience fighting against scrapers and have recently upped their game significantly so that they're the only ones that can train AI on their own data.
So now, when you're starting from scratch and your goal is to gather as much data as possible, targetting smaller sites with weak / non-existent scraping protection is the path of least resistence.
Because people are reporting constant traffic, which would imply that the site is being scraped millions of times per year. How does that make any sense? Are there millions of AI companies?
Git forges will expose version of every file at every commit in the project's history. If you have medium sized project consisting of say 1000 files and 10,000 commits, the crawler will identify a number of URLs on the same order of magnitude as English Wikipedia, just for that one project. This is also very expensive for the git forge, as it needs to reconstruct the historical files from a bunch of commits.
Git forges interact spectacularly poorly with naively implemented web crawlers, unless the crawlers put in logic to avoid exhaustively crawling git forges. You honestly get a pretty long way just excluding URLs with long base64-like path elements, which isn't hard but it's also not obvious.
I think the idea that you can block bots and allow humans is fallacious.
We should focus on a specific behaviour that causes problems (like making a bajillion requests one for each commit, instead of cloning the repo). To fix this we should block clients that work in such ways. If these bots learn to request at a reasonable pace why cares if they are bots, humans, bots under a control of an individual human, bots owned by a huge company scraping for training data? Once you make your code (or anything else) public, then trying to limit access to only a certain class of consumers is a waste of effort.
Also, perhaps I'm biased, because I run a searXNG and Crawl4AI (and few ancillaries like jina rerank etc) in my homelab so I can tell my AI to perform live internet searches as well as it can get any website. For code it has a way to clone stuff, but for things like issues, discussions, PRs it goes mostly to GitHub.
I like that my AI can browse almost like me. I think this is the future way to consume a lot of the web (except sites like this one that are an actual pleasure to use).
The models sometimes hit sites they can't fetch. For this I use Firecrawl. I use MCP proxy that lets me rewrite the tool descriptions so my models get access to both my local Crawl4ai and hosted (and rather expensive)firecrawl, but they are told to use Firecrawl as last resort.
The more people use these kinds of solutions the more incentive there will be for sites not to block users that use automation. Of course they will have to rely on alternative monetisation methods, but I think eventually these stupid capchas will disappear and reasonable rate limiting will prevail.
I think I see many prompt injections in your future. Like captchas with a special bypass solution just for AIs that leads to special content.
https://docs.netscaler.com/en-us/citrix-adc/current-release/...
Other vendors implement and call it slightly differently, but always around the core observation that a "normal" browsing session would land on your site via search, link, or bookmark, and progress from there. If server sees "random" URLs requests incoming that are wildly different in their sequence, that gets blocked. There is a lot more nuance in how to implement this properly, and not drown in false-positives, but that's the basic idea.
Related to "deeplink protection", but not quite the same.
Since I do actually host a couple of websites / services behind port 443, it means I can't just block everything that tries to scan my ip address at port 443. However, I've setup Cloudflare in front of those websites, so I do log and block any non-Cloudflare (using Cloudflare's ASN: 13335) traffic coming into port 443.
I also log and block IP address attempting to connect on port 80, since that essentially deprecated.
This, of course, does not block traffic coming via the DNS names of the sites, since that will be routed through Cloudflare - but as someone mentioned, Cloudflare has its own anti-scraping tools. And then as another person mentioned, this does require the use of Cloudflare, which is a vast centralising force on the Internet and therefore part of a different problem...
I don't currently split out a separate list for IP addresses that have connected to HTTP(S) ports, but maybe I'll do that over Christmas.
This is my current simple project: https://github.com/UninvitedActivity/UninvitedActivity
Apologies if the README is a bit rambling. It's evolved over time, and it's mostly for me anyway.
P.S. I always thought it was Yog Sothoth. Either way, I'm partial to Nyarlathotep. "The Crawling Chaos" always sounded like the coolest of the elder gods.
1. The websites I run get so little traffic it doesn't matter. They're mostly for my own entertainment / experimentation.
2. If they're allowing their IP address to be used for pricing scrapers then I consider that within the blurry definition of malicious anyway.
I don't mind if you disagree with me on point #2, and I grant that if I was running some super popular web service, maybe my tune would change.
I have four tiers of scanning paranoia, so I can ramp up and down if need be (I'm not sure if that's documented in GitHub though...)
[0] https://github.com/UninvitedActivity/UninvitedActivity/blob/...
Temubis
The deeper issue is that git forges are pathological for naive crawlers: every commit/file combo is a unique URL, so one medium repo explodes into Wikipedia-scale surface area if you just follow links blindly. A more robust pattern for small instances is to explicitly rate limit the expensive paths (/raw, per-commit views, “download as zip”), and treat “AI” as an implementation detail. Good bots that behave like polite users will still work; the ones that try to BFS your entire history at line rate hit a wall long before they can take your box down.
For some people it's an ideological one--we don't want AI vacuuming up all of our content. For those, "is this an AI user?" is a useful question to answer. However it's a hard one.
For many the problem is simply "there are a class of users that are putting way too much load on the system and it's causing problems". Initially I was playing wack-a-mole with this and dealing with alerts firing on a regular basis because of Meta crawling our site very aggressively, not backing off when errors were returned, etc.
I looked at rate limiting but the work involved in distributed rate limiting versus the number of offenders involved made the effort look a little silly, so I moved towards a "nuke it from orbit" strategy:
Requests are bucketed by class C subnet (31.13.80.36 -> 31.13.80.x) and request rate is tracked over 30 minute windows. If the request rate over that window exceeds a very generous threshold I've only seen a few very obvious and poorly behaved crawlers exceed it fires an alert.
The alert kicks off a flow where we look up the ASN covering every IP in that range, look up every range associated with those ASNs, and throw an alert in Slack with a big red "Block" button attached. When approved, the entire ASN is blocked at the edge.
It's never triggered on anything we weren't willing to block (e.g., a local consumer ISP). We've dropped a handful of foreign providers, some "budget" VPS providers, some more reputable cloud providers, and Facebook. It didn't take long before the alerts stopped--both for high request rates and our application monitoring seeing excessive loads.
If anyone's interested in trying to implement something similar, there's a regularly updated database of ASN <-> IP ranges announced here: https://github.com/ipverse/asn-ip
I could justify it a number of ways, but the honest answer is "expiring these is more work that just hasn't been needed yet". We hit a handful of bad actors, banned them, have heard no negative outcomes, and there's really little indication of the behaviour changing. Unless something shows up and changes the equation, right now it looks like "extra effort to invite the bad actors back to do bad things" and... my day is already busy enough.
What exactly is the source of these mappings? Never heard about ipverse before, seems to be a semi-anonymous GitHub organization and their website has had a failing certificate for more than a year by now.
If they're treating it specifically on “code forge” (because they're after coding use cases), there's lots of interesting information that you won't get by just cloning a repo.
It's not just the current state of the repo, or all commits (and their messages). It's the initial issue (and discussion) that lead to a pull request (and review comments) that eventually gets squashed into a single commit.
The way you code with an agent is a lot more similar to the: issue, comments, change, review, refinement sequence; that you get by slurping the website.
Unfortunately, this kind of scraping seems to inconvenience the host way more than the scraper.
Another tangent: there probably are better behaved scrapers, we just don't notice them as much.
Having to use a browser to crawl your site will slow down naive crawlers at scale.
But it wouldn't do much against individuals typing "what is a kumquat" into their local LLM tool that issues 20 requests to answer the question. They're not really going to care nor notice if the tool had to use a playwright instance instead of curl.
Yet it's that use-case that is responsible for ~all of my AI bot traffic according to Cloudflare which is 30x the traffic of direct human users. In my case, it made more sense to just block the traffic.
I assume most traffic comes from hosted LLM chats (e.g. chatgpt.com) where the provider (e.g. OpenAI) is making the requests from their own servers.
I would really like to easily serve some markov chain non-sense to Ai bots.
1. https://iocaine.madhouse-project.org/
2. https://poison.madhouse-project.org/
Feel free to test this with any classifier or cheapo LLM.
This is all "legitimate" traffic in that it isn't about crawling the internet but in service of a real human.
Put another way, search is moving from a model of crawl the internet and query on cached data to being able to query on live data.
But when it comes to git repos, an LLM agent like claude code can just clone them locally for local crawling which is far better than crawling remotely.
Since I do not make this node public accessable, so no worry for AI web crawlers:)
I wouldn't be surprised if all this AI stuff was just a global conspiracy to get everyone to turn on JS.
Then add some proof-of-work that works without JS and is a web standard (e.g. server sends header, client sends correct response, gets access to archive) and mainstream the culture for low-cost hosting for such archives and you're done, also make sure that this sort of feature is enabled in the most basic configuration for all web servers and such, logged separately.
Obviously such a thing will never happen, because the web and culture went in a different direction. But if it were a mainstream thing, you'd get easy to consume archives (also for regular archival and data hoarding) and the "live" versions of sites wouldn't have their logs be bogged down by stupid spam.
For a single user or a small team this could be enough.
Or have a web-proxy that matches on the pattern and extracts the cookie automatically. ;-)