Guarding My Git Forge Against AI Scrapers
Key topics
The cat-and-mouse game between Git forge maintainers and AI scrapers just got more intense, with one developer sharing their clever tactics to deter these data-hungry bots. While some commenters cheered on the author's efforts, others debated the ethics and potential consequences of "poisoning" the data used to train large language models (LLMs). The discussion took a geopolitical turn when Russia's alleged involvement in "LLM grooming" was brought up, sparking a heated exchange about the country's intentions and reputation. Amidst the controversy, some commenters pointed out the article's claims were backed by multiple sources, while others remained skeptical.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
56m
Peak period
90
0-12h
Avg / period
12.6
Based on 139 loaded comments
Key moments
- 01Story posted
Dec 12, 2025 at 2:51 AM EST
22 days ago
Step 01 - 02First comment
Dec 12, 2025 at 3:47 AM EST
56m after posting
Step 02 - 03Peak activity
90 comments in 0-12h
Hottest window of the conversation
Step 03 - 04Latest activity
Dec 18, 2025 at 9:23 PM EST
15 days ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
I'm also left wondering about what other things you could do? For example - I have several friends that built their own programming languages, I wonder what the impact would be if you translate lots of repositories to your own language and host it for bots to scrape? Could you introduce sufficient bias in a LLM to make an esoteric programming language popular?
it's called "LLM grooming"
https://thebulletin.org/2025/03/russian-networks-flood-the-i...
> undermining democracy around the globe is arguably Russia’s foremost foreign policy objective.
Right, because Russia is such a cartoonish villain it has no interest in pursuing its own development and good relations with any other country, all it cares about is annoying the democratic countries with propaganda about their own messed up politics.
When did it become acceptable for journalists to make bold, generalizing claims against whole nations without a single direct, falsifiable evidence of what they claim and worse, making claims like this that can be easily dismissed as obviously false by quickly looking at the policies and their diplomatic interactions with other countries?!
That's actually pretty much spot on.
The fact that you think this is something to do with "both sides" instead of a simple question of facts really gives you away.
It's funny how trolls like you talk about the importance of listening to what Russians say, but flat out refuse to see and listen to what Russians who are not members of Putin's gang are saying.
Oh my god, you're just ignorant.
NATO itself said they would join the alliance back in 2008.
> At the 2008 Bucharest summit, NATO declined to offer Ukraine a Membership Action Plan, but said that Ukraine would eventually join the alliance.
Source: https://en.wikipedia.org/wiki/Ukraine%E2%80%93NATO_relations
Also mentioned even in NATO's own website:
https://www.nato.int/en/what-we-do/partnerships-and-cooperat...
You can read the full statement from 2008 here: https://www.nato.int/en/about-us/official-texts-and-resource...
This was the trigger for the Russian invasion of Georgia, which was also mentioned in the above statement.
Do you think NATO is spreading Russian propaganda on its website??
I do agree with you that there are many news sites spreading misinformation, but I think that most of it is not coming from governments... and while governments are also doing this, most, I would think, do it with good intentions (they do believe the information is true and barely verify that when it favours their preconceived points of view). When propaganda spreads information you like, you tend to call it just news.
The way current Western media is currently dismissing anything at all that comes from Russian sources as lies and propaganda, however, is way overblown in my opinion. That's causing a huge blind spot in the public discourse which just makes the fake news sources seem even more attractive, since they seem to be whistleblowers fighting against a campaign of silence from the mainstream media, which is not completely incorrect.
[1] https://www.newsguardtech.com/special-reports/john-mark-doug...
[2] https://medium.com/@amithnmbr/why-its-important-to-know-how-...
Wasn't there a study a while back showing that a small sample of data is good enough to poison an LLM? So I'd say it for sure is possible.
That doesn't even sound all that bad if you happen to catch a human. You could even tell them pretty explicitly with a banner that they were browsing the site in no-links mode for AI bots. Put one link to an FAQ page in the banner since that at least is easily cached
Failing that I would use Chrome / Phantom JS or similar to browse the page in a real headless browser.
Recognize that a website is a Git repo web interface. Invoke elaborate Git-specific logic. Get the repo link, git clone it, process cloned data, mark for re-indexing, and then keep re-indexing the site itself but only for things that aren't included in the repo itself - like issues and pull request messages.
The scrapers that are designed with effort usually aren't the ones webmasters end up complaining about. The ones that go for quantity over quality are the worst offenders. AI inference-time data intake with no caching whatsoever is the second worst offender.
If you mean scrapers in terms of the bots, it is because they are basically scraping web content via HTTP(S) generally, without specific optimisations using other protocols at all. Depending on the use case intended for the model being trained, your content might not matter at all, but it is easier just to collect it and let it be useless than to optimise it away⁰. For models where your code in git repos is going to be significant for the end use, the web scraping generally proves to be sufficient so any push to write specific optimisations for bots for git repos would come from academic interest rather than an actual need.
If you mean scrapers in terms of the people using them, they are largely akin to “script kiddies” just running someone else's scraper to populate their model.
If by scrapers in terms of people writing them, then the fact that just web scraping is sufficient as mentioned above is likely the significant factor.
> why the scrappers do not do it in a smarter way
A lot of the behaviours seen are easier to reason if you stop considering scrapers (the people using scraper bots) to be intelligent, respectful, caring, people who might give a damn about the network as a whole, or who might care about doing things optimally. Things make more sense if you consider them to be in the same bucket as spammers, who are out for a quick lazy gain for themselves and don't care, or even have the foresight to realise, how much it might inconvenience¹ anyone else.
----
[0] the fact this load might be inconvenient to you is immaterial to the scraper
[1] The ones that do realise that they might cause an inconvenience usually take the view that it is only a small one, and how can the inconvenience little them are imposing really be that significant? They don't think the extra step of considering how many people like them are out there thinking the same. Or they think if other people are doing it, what is the harm in just one more? Or they just take the view “why should I care if getting what I want inconveniences anyone else?”.
I highly encourage folks to put stuff out there! Put your stuff on the internet! Even if you don't need it even if you don't think you'll necessarily benefit: leave the door open to possibility!
I have `REQUIRE_SIGNIN_VIEW=true` and I see nothing but my own traffic on Gitea's logs.
Is it because I'm using a subdomain that doesn't imply there's a Gitea instance behind?
REQUIRE_SIGNIN_VIEW=true means signin is required for all pages - that's great and definitely stops AI bots. The signin page is very cheap for Gitea to render. However, it is a barrier for the regular human visitors to your site.
'expensive' is a middle-ground that lets normal visitors browse and explore repos, view README, and download release binaries. Signin is only required for "expensive" pageloads, such as viewing file content at specific commits git history.
From Gitea's doc I was under the impression that it was going further than "true" so I didn't understood why because "true" was enough for me to not be bothered by bots.
But in your case you want a middle-ground, which is provided by "expensive"!
> Enable this to force users to log in to view any page or to use API. It could be set to "expensive" to block anonymous users accessing some pages which consume a lot of resources, for example: block anonymous AI crawlers from accessing repo code pages. The "expensive" mode is experimental and subject to change.
Forgejo doesn't seem to have copied that feature yet
do we want to change that? do we want to require scrapers to pay for network usage, like the ISPs were demanding from Netflix? is net neutrality a bad thing after all?
So much real human traffic that it brings their site down?
I mean yes it's a problem, but it's a good problem.
b) They have a complete lack of respect for robots.txt
I'm starting to think that aggressive scrapers are part of an ongoing business tactic against the decentralized web. Gmail makes self hosted mail servers jump through arduous and poorly documented hoops, and now self hosted services are being DDOSed by hordes of scrapers…
When scraping was mainly used to build things like search indexes, which are ultimately mutually beneficial to both the website owner and the search engine, and the scrapers were not abusive, nobody really had a problem.
But, for generative AI training and access, with scrapers that DDoS everything in sight, and which ultimately cause visits to the websites to fall significantly and merely return a mangled copy of its content back to the user, scraping is a bad thing. It also doesn't help that the generative AI companies haven't paid most people for their training data.
But for some reason corporations don't want that, I guess they want to be allowed to just take from the commons and give nothing in return :/
In the article, quite a few listed sources of traffic would simply be completely unable to access the server if the author could get away with a geoblock.
It's funny observing their tactics though. On the whole, spammers have moved from bare domain to various prefixes like @outreach.domain, @msg.domain, @chat.domain, @mail.domain, @contact.domain and most recently @email.domain.
It's also interesting watching the common parts before the @. Most recently I've seen a lot of marketing@, before that chat@ and about a month after I blocked that chat1@. I mostly block *@domain though, so I'm less aware of these trends.
I got accidentally locked out from my server when I connected over Starlink that IP-maps to the US even though I was physically in Greece.
As a practical advice, I would use a blocklist for commerce websites, and allowlist for infra/personal.
In the end I found another online store, paid $74, and got the device. So the better store lost the sale due to blocking non-US orders.
I don't know how much of a corner case this is.
I'm not saying don't block, just saying be aware of the unintended blocks and weigh them.
Every country has (at the very least) a few bad actors, it's a small handful of countries that actively protect their bad actors from any sort of accountability or identification.
But the numbers don't lie. In my case, I locked down to a fairly small group of European countries and the server went down from about 1500 bot scans per day down to 0.
The tradeoff is just too big to ignore.
Because of that, everyone has to come up with their own clever solution.
This will be in part people on home connections tinkering with LLMs at home, blindly running some scraper instead of (or as well as) using the common pre-scraped data-sets and their own data. A chunk of it will be from people who have been compromised (perhaps by installing/updating a browser add-in or “free” VPN client that has become (or always was) nefarious) and their home connection is being farmed out by VPN providers selling “domestic IP” services that people running scrapers are buying.
I recall that bot farms use pre-paid SIM cards for their data connections so that their traffic comes from a good residential ASN.
No client compromise required, it's a networking abuse that gives you good reputation of you use mobile data.
But yes, selling botnets made of compromised devices is also a thing.
Much more likely are those companies that pay people (or trick people) into running proxies on their home networks to help with giant scrapping projects what want to rotate through thousands of "real" IPs.
If you're just blocking their IP, e.g. "we've had to block your access because your device, or a device on your network, appears to be sending abusive traffic to us. this sometimes happens because an app on your device is doing this without your knowledge".
If you've had to block a whole AS, you could name and shame them, e.g. "we've had to block your access because your internet provider, <name>, appears to be sending abusive traffic to us. consider letting them know, or moving providers".
Of course, this would almost certainly be commercially stupid. It would also only really work if everyone decided to do this, which realistically will never happen.
Hitting the backing git implementation directly within the request/response loop seems like a good way to burn cpu cycles and create unnecessary disk reads from .git folders, possibly killing you drives prematurely. Then stick a memcache in front and call it a day, no?
In the age of cheap and reliable SSDs (approaching memory read speeds), you should just be batch rendering file pages from git commit hooks. Leverage external workers for rendering the largely static content. Web hosted git code is more often read than written in these scenarios, so why hit the underlying git implementation or DB directly at all? Do that for POSTs, sure but that's not what we're talking about (I think?)
Is this viable?
no
for many reasons
Anyway, test some scrapers and bots here [1] and let me know if they get through.
[1] - https://mirror.newsdump.org/bot_test.txt
My company runs our VPN from our datacenter (although we have our own IP block, which hopefully doesn’t get blocked)
Those with revenue generating systems should capture TCP SYN traffic for while, monitor access logs and give it that college try to correlate bots vs legit users with traffic characteristics. Sometimes generalizations can be derived from the correlation and some of those generalizations can be permitted or denied. There really isn't a one size fits all solution but hopefully my example can give ideas in additional directions to go. Git repos are probably the hardest to protect since I presume many of the git libraries and tools are using older protocols and may look a lot like bots. If one could get people to clone/commit with SSH there are additional protections that can be utilized at that layer.
And what is the effect?
Our public web dataset goes back to 2008, and is widely used by academia and startups.
- How often is that updated?
- How current is it at any point in time?
- Does it have historical / temporal access i.e. be able to check the history of a page a la The Internet Archive?
- it's a historical archive, the concept of "current" is hard to turn into a metric
- not only is our archive historical, it is included in the Internet Archive's wayback machine.
>i am lux (it/they/she in English, ça/æl/elle in French)
This blog is written by an insane activist who's claiming to be an animal.
Sell it to someone else.
>Maybe you should grow the f*ck up?
Not very "personal" when this form of psychopathy oozes through on every step and happens to be on full public display. But that's the intent, for both of youse. And silent capitulation is the minimal subscription.
But that's not happening. Eat sh*t.
You actively had to go looking for something about this blog poster to be angry about. It must be painful to live that way, needing to find fault in people so desperately. Therapy might do you some good my friend.
We’re talking about a furry, not a war criminal. “Silent capitulation” lol. You’re hilarious.
passively confirming
And the b*stiality-adjacent fetish defender continues to cry out in pain as he tries to hit me. As if there was any doubt about -this- being part and parcel of your ideology.
VNPT is a residential / mobile ISP, but they also run datacentres (e.g. [1]) and offer VPS, dedicated server rentals, etc.
And Bunny Communications (AS5065) is a pretty obvious 'residential' VPN / proxy provider trying to trick IP geolocation / reputation providers. Just look at the website [2], it's very low effort. They have a page literally called 'Sample page' up and the 'Blog' is all placeholder text, e.g. 'The Art of Drawing Readers In: Your attractive post title goes here'.
Another hint is that some of their upstreams are server-hosting companies rather than transit providers that a consumer ISP would use [3].
[1] https://vnpt.vn/doanh-nghiep/tu-van/vnpt-idc-data-center-gia... [2] https://bunnycommunications.com/ [3] https://bgp.tools/as/5065#upstreams
I do wonder though. Content scrapers that truly value data would stand to benefit from deploying heuristics that value being as efficient as possible in the info per query space.
But there is a different class of player that gains more from nuisance maximization: dominant anti-bot/ddos service providers, especially those with ambitions of becoming the ultimate internet middleman.
My personal services are only accessible from my own LAN or via a VPN. If I wanted to share it with a few friends I would use something like Tailscale and invite them to my tailnet. If the number of people grows I would put everything behind a login-wall.
This of course doesn't cover services I genuinely might want to be exposed to the public. In that case the fight with the bots is on, assuming I decide I want to bother at all
"This is depressing. Profoundly depressing. i look at the statistics board for my reverse-proxy and i never see less than 96.7% of requests classified as bots at any given moment. The web is filled with crap, bots that pretend to be real people to flood you. All of that because i want to have my little corner of the internet where i put my silly little code for other people to see."
For what it's worth, most requests kept coming in for ~4 days after -everything- returned plain 404 errors. millions. And there's still some now weeks later...
That’s the kind of result that ensures we’ll be seeing anime girls all over the web in the near future.
> In August 2024, one of my roommates and partners messaged the apartment group chat, saying she noticed the internet was slow again at our place
Also, have you considered Captchas for first contact/rate-limit?
If you have smart scrapers, then good luck. I recall that bot farms use pre-paid SIM cards for their data connections so that their traffic comes from a good residential ASN. They also have a lot of IPs and overall well-made headless browsers with JS support. Then it's a battle of JS quirks where the official implementation differs from headless one.
I have the same problem, but I decided to maintain ASN lists of known spammers [1] and combine that with my eBPF based firewall that just drops their connections before it reaches the kernel [2].
So my websites, wikis and other things are protected by the same firewall architecture, for which I can deploy a unified "blockmap" so to speak. Probably gonna open source the dashboard for maintaining that over the holidays, too, as I'm trying to make everything combinable in the plug and play for Go backends sense similar to my markdown editor UI [3].
I also open sourced my LPM hashset map library which allows to process large quantities of prefixes, because it's way faster than LPM tries (read as: takes less than 100ms to process all RIR and WHOIS data compared to around an hour with LPM tries) [4].
[1] https://github.com/cookiengineer/antispam
[2] https://github.com/tholian-network/firewall
[3] https://github.com/cookiengineer/golocron
[4] https://github.com/cookiengineer/golpm