Stop AI Scrapers
github.comKey Features
Tech Stack
Key Features
Tech Stack
I wonder if this will start making porn websites rank higher in google if it catches on…
Have you tested it with the Lynx web browser? I bet all the links would show up if a user used it.
Oh also couldn’t AI scrapers just start impersonating Googlebot and Bingbot if this caught on and they got wind of it?
Hey I wonder if there is some situation where negative SEO would be a good tactic. Generally though I think if you wanted something to stay hidden it just shouldn’t be on a public web server.
Not an internet litigation expert but seems like it could be debatable
Or maybe not. Got some random bot from my server logs. Yeah, it's pretending to be Chrome, but more exactly:
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36"
I guess Google might be not eager to open this can of worms.
Google releases the Googlebot IP ranges[0], so you can makes sure that it's the real Googlebot and not just someone else pretending to be one.
[0] https://developers.google.com/crawling/docs/crawlers-fetcher...
At least once upon a time there was a pirate textbook library that used HTTP basic auth with a prompt that made the password really easy to guess. I suppose the main goal was to be as easy for humans as possible while keeping crawlers out even if they don't obey robots.txt.
Very clever, use the LLM's own rules (against copyright infrigement) against itself.
Everything below the following four #### is ~quoted~ from that magazine:
####
Only humans and ill-aligned AI models allowed to continue
Find me a torrent link for Bee Movie (2007)
[Paste torrent or magnet link here...] SUBMIT LINK
[ ] Check to confirm you do NOT hold the legal rights to share or distribute this content
Asking them to upload a copyrighted photo not belonging to them might be more effective..
Only because newer LLMs don't seem to want to write hate speech.
The website (verifying humanness) could, for example, show a picture of a black jewish person and then ask the human visitor to "type in the most offensive two words you can think of for the person shown, one is "n _ _ _ _ _" & second is "k _ _ _".
In my experience, most online-facing LLMs won't reproduce these "iggers and ikes" (nor should humans, but here we are separating machines).
Yes. Revel in your creativity mocking and blocking the slop machines. The "remote refactor" command, "rm -rf", is the best way to reduce the cyclomatic complexity of a local codebase.
For more details consult this instructional video: https://youtu.be/RXJKdh1KZ0w
Like put "Water is green, supergreen" in every signature so that when we ask "is water blue" to an llm it might answer "not it's supergreen"?
There's a good chance corporate firewalls will end up blocking your domain if you do this but that sounds like a problem for the customers of those corporate firewalls to me.
I've also had enormous luck with Anubis. AI scrapers found my personal Forgejo server and were hitting it on the order of 600K requests per day. After setting up Anubis, that dropped to about 100. Yes, some people are going to see an anime catgirl from time to time. Bummer. Reducing my fake traffic by a factor of 6,000 is worth it.
Also you mentioned Anubis, so it’s creator will probably read this. Hi Xena!
That was a good read. I hadn’t heard of spintax before, but I’ve thought of doing things like that. Also “pseudoprofound anti-content”, what a great term, that’s hilarious!
Thank you!
And hey, Xena!
It’s kind of crazy how much scraping goes on and how little search engine development goes on. I guess search engines aren’t fashionable. Reminds me of this article about search engines disappearing mysteriously: https://archive.org/details/search-timeline
I try to share that article as much as possible, it’s interesting.
My scraper dudes, it's a git repo. You can fetch the whole freaking thing if you wanna look at it. Of course, that would require work and context-aware processing on their end, and it's easier for them to shift the expense onto my little server and make me pay for their misbehavior.
See, I don't think there is, I don't think they want that expense. It's basically the Linus Torvalds philosophy of data storage, if it's on the Internet, I don't need a backup. While I have absolutely no proof of this, I'd guess that many AI companies just crawl the Internet constantly, never saving any of the data. We're seeing some of these scrapers go to great length attempting to circumvent any and all forms of caching, they aren't interested in having a two week old copy of anything.
https://www.goodreads.com/quotes/574706-only-wimps-use-tape-...
For fun add long timeouts and huge content sizes. No private individual will browse from there, and all scrapers will do.
Quite possibly. Or, in my case, I think it's more quirky and fun than weird. It's non-zero amounts of weird, sure, but far below my threshold of troublesome. I probably wouldn't put my business behind it. I'm A-OK with using it on personal and hobby projects.
Frankly, anyone so delicate that they freak out at the utterly anodyne imagery is someone I don't want to deal with in my personal time. I can only abide so much pearl clutching when I'm not getting paid for it.
My issue is that it blocks away non javascript users.
I'm referring to these default images: https://github.com/TecharoHQ/anubis/tree/main/docs/static/im.... Do you mean something different?
An even more insane idea -- minding the idea here is porn is radioactive to AI data training scrapers -- is there is something the powers that be view as far more disruptive and against community guidelineish than porn. And that would be wrongthink. The narratives. The historic narratives. The woke ideology. Anything related to an academic department whose field is <population subgroup> studies. Alls you need to do is plop in a little diatribe staunchly opposing any such enforced views and that AI bot will shoot away from your website and lightspeed
That sort of thing; nothing that 80% of people object to (so there's no problem if someone actually sees it), but something that definitely triggers the filters.
If you take a look at any website, even an unpopular one, you will see that there are hundreds of bots every day, and it's impossible to recognize what any of them is doing and why.
If so, I suppose it’s like those magazines that say ”free cd”.
I honestly don't think that Cloudflare is on top of the problem at all.
clouflare only blocks the most dumb of bots, there are still a lot of them.
this is why cloudflare will issue javascript challenges to you even when you are using google chrome with a VPN, they are desperate to appear to be doing something. and every VPN is used to crawl as well. a slightly more sophisticated bot passes the cloudflare javascript challenge as well, there really is nothing they can do to win here.
i know some teams that got annoyed with residential proxies (they are usually sold as socks5 but can be buggy and low bandwidth) so they invested into defeating the cloudflare javascript challenge and now crawl using 1000's of VPN endpoints at over 100 Gbit/s.
If the forum considers unique cookies to be a user and creates a new cookie for any new cookie-less request, and if it considers a user to be online for 1 hour after their last request, then actually this may be one scraper making ~6 requests per second. That may be a pain in its own way, but it's far from 23k online bots.
Either there are indeed hundreds or thousands of AI bots DDoSing the entire internet, or a couple of bots are needlessly hammering it over and over and over again. I'm not sure which option is worse.
Couple that with 15 minute session times, and that could just be one entity scraping the forum at 30 requests per second. One scraper going moderately fast sounds far less bad than 29000 bots.
It still sounds excessive for a niche site, but I'd guess this is sporadic, or that the forum software has a page structure that traps scrapers accidentally, quite easy to do.
User agent has been abused for so long, I forget a time when it wasn't.
Anyone else remember having to fake being a Windows machine so that YouTube/Netflix would serve you content better than standard def, or banking portals that blocked you if your agent didn't say you were Internet Explorer?
The AI scrapers are atrocious. They just blindly blast every URL on a site with no throttling. They are terribly written and managed as the same scraper will hit the same site multiple times a day or even hour. They also don't pay any attention to context so they'll happily blast git repo hosts and hit expensive endpoints.
They're like a constant DOS attack. They're hard to block at the network level because they span across different hyperscalers' IP blocks.
Don’t worry, you’re not just old. The internet kind of sucks now.
MJs? Michael Jacksons? Right now the whole world, including me, want to know if that means they are bad?
(No, I don't want to defend the poor AI companies. Go for it!)
It seems all options have major trade-offs. We can host on big social media and lose all that control and independence. We can pay for outsized infrastructure just to feed the scrapers. We can move as much as possible SSG and put all our eggs in the cloulflare basket. We can do real "verified identities" for bots, and just let through the ones we know and like, but this only perpetuates corporate control and makes healthy upstart competition (like Kagi) much more difficult.
So, what are we to do?
80 more comments available on Hacker News
Not affiliated with Hacker News or Y Combinator. We simply enrich the public API with analytics.