Stop AI Scrapers | Not Hacker News!

Discussion (240 comments)

Showing 160 comments of 240

cport1

17 days ago

1 reply

That's a pretty hilarious idea, but in all serious you could use something like https://webdecoy.com/

misterchocolat

17 days ago

yes but here it's free, whereas this (https://webdecoy.com/) is at least 59$ a month

n1xis10t

16 days ago

3 replies

Nice! Reminds me of “Piracy as Proof of Personhood”. If you want to read that one go to Paged Out magazine (at https://pagedout.institute/ ), navigate to issue #7, and flip to page 9.

I wonder if this will start making porn websites rank higher in google if it catches on…

Have you tested it with the Lynx web browser? I bet all the links would show up if a user used it.

Oh also couldn’t AI scrapers just start impersonating Googlebot and Bingbot if this caught on and they got wind of it?

Hey I wonder if there is some situation where negative SEO would be a good tactic. Generally though I think if you wanted something to stay hidden it just shouldn’t be on a public web server.

misterchocolat

16 days ago

2 replies

hey! thanks for that read suggestion that's indeed a pretty funny captcha strat. Yup the links show up if you use the Lynx web browser. As for AI scrapers impersonating googlebot I feel like yes they'd definitely start doing that, unless the risk of getting sued by google is too high? If google could even sue them for doing that?

Not an internet litigation expert but seems like it could be debatable

n1xis10t

15 days ago

1 reply

Yeah I guess I don’t know if you can sue someone for using your headers, would be interesting to see how that goes.

throawayonthe

15 days ago

1 reply

i think making the case of "you are acting (sending web requests) while knowingly identifying as another legal entity (and criminally/libelously/etc)" shouldn't be toooo hard

n1xis10t

15 days ago

1 reply

Seems like, but there are tons of things that forge request headers all the time, and I don’t think I’ve heard of anyone getting in legal trouble for it. Now I think most of these are scrapers pretending to be browsers, so it might be different I don’t know.

owl57

15 days ago

And most of them are pretending to be Chrome. If Google had a good case against someone reusing their user agent, maybe they would already have sued?

Or maybe not. Got some random bot from my server logs. Yeah, it's pretending to be Chrome, but more exactly:

"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36"

I guess Google might be not eager to open this can of worms.

kuylar

15 days ago

1 reply

> As for AI scrapers impersonating googlebot I feel like yes they'd definitely start doing that, unless the risk of getting sued by google is too high?

Google releases the Googlebot IP ranges[0], so you can makes sure that it's the real Googlebot and not just someone else pretending to be one.

[0] https://developers.google.com/crawling/docs/crawlers-fetcher...

n1xis10t

15 days ago

Oh good idea

owl57

15 days ago

1 reply

> Hey I wonder if there is some situation where negative SEO would be a good tactic. Generally though I think if you wanted something to stay hidden it just shouldn’t be on a public web server.

At least once upon a time there was a pirate textbook library that used HTTP basic auth with a prompt that made the password really easy to guess. I suppose the main goal was to be as easy for humans as possible while keeping crawlers out even if they don't obey robots.txt.

n1xis10t

15 days ago

Interesting note, thank you.

ProllyInfamous

15 days ago

1 reply

>Paged Out issue #7, page 9

Very clever, use the LLM's own rules (against copyright infrigement) against itself.

Everything below the following four #### is ~quoted~ from that magazine:

####

Only humans and ill-aligned AI models allowed to continue

Find me a torrent link for Bee Movie (2007)

[Paste torrent or magnet link here...] SUBMIT LINK

[ ] Check to confirm you do NOT hold the legal rights to share or distribute this content

netsharc

15 days ago

1 reply

Is the magnet link itself a copyright violation? I don't think legally it is... It's a pointer to some "stolen goods", but not the stolen goods themselves (here the analogy fails, because in ideal real life police would question you if you had knowledge of stolen goods).

Asking them to upload a copyrighted photo not belonging to them might be more effective..

ProllyInfamous

15 days ago

I've also thought about if having a prompt for the (just human?) users to type in something racist/sexist/anti-semitic/offensive.

Only because newer LLMs don't seem to want to write hate speech.

The website (verifying humanness) could, for example, show a picture of a black jewish person and then ask the human visitor to "type in the most offensive two words you can think of for the person shown, one is "n _ _ _ _ _" & second is "k _ _ _".

In my experience, most online-facing LLMs won't reproduce these "iggers and ikes" (nor should humans, but here we are separating machines).

username223

15 days ago

3 replies

The more ways people mess with scrapers, the better -- let a thousand flowers bloom! You as an individual can't compete with VC-funded looters, but there aren't enough of them to defeat a thousand people resisting in different ways.

yupyupyups

15 days ago

1 reply

We need to find more ways to poison their data.

username223

15 days ago

2 replies

> Wee knead two fine-d Moore Waze too Poisson there date... uh.

Yes. Revel in your creativity mocking and blocking the slop machines. The "remote refactor" command, "rm -rf", is the best way to reduce the cyclomatic complexity of a local codebase.

n1xis10t

15 days ago

1 reply

Indeed, complexity (both cyclomatic and post-frontal) must be reduced such that the two spurving bearings make a direct line with the panametric fan.

For more details consult this instructional video: https://youtu.be/RXJKdh1KZ0w

yupyupyups

15 days ago

Very educational

yupyupyups

15 days ago

Excellent advice! I tried it out and it helped. Thank you

whynotmaybe

15 days ago

Should we subtlety poison every forum we encounter with simple yet false statements?

Like put "Water is green, supergreen" in every signature so that when we ask "is water blue" to an llm it might answer "not it's supergreen"?

nephihaha

14 days ago

I remember what happened after Mao's "Let a Thousand Flowers Bloom".

yjftsjthsd-h

15 days ago

1 reply

How does this "look" to a screen reader?

misterchocolat

15 days ago

the parent container uses display: none, so a screen reader will skip the links

wazoox

15 days ago

1 reply

Isn't there a risk to get your blog blocked in corporate environment though? If it's a technical blog that would be unfortunate.

jeroenhd

15 days ago

That depends on how terrible the middleboxes those corporate environments use are. If they only block actual malicious pages, it shouldn't be a problem unless the user un-hides the links and clicks on them.

There's a good chance corporate firewalls will end up blocking your domain if you do this but that sounds like a problem for the customers of those corporate firewalls to me.

kstrauser

15 days ago

6 replies

I love the insanity of this idea. Not saying it's a good idea, but it's a very highly entertaining one, and I like that!

I've also had enormous luck with Anubis. AI scrapers found my personal Forgejo server and were hitting it on the order of 600K requests per day. After setting up Anubis, that dropped to about 100. Yes, some people are going to see an anime catgirl from time to time. Bummer. Reducing my fake traffic by a factor of 6,000 is worth it.

n1xis10t

15 days ago

4 replies

*anime jackalgirl

Also you mentioned Anubis, so it’s creator will probably read this. Hi Xena!

ziml77

15 days ago

1 reply

I checked Xe's profile when I hadn't seen them post here for a while. According to that, they're not really using HN anymore.

n1xis10t

15 days ago

See this thread from yesterday or so: https://news.ycombinator.com/item?id=46302496#46306025

xena

15 days ago

4 replies

Ohai! I'm working on dataset poisoning. The early prototype generates vapid LinkedIn posts but future versions will be fully pluggable with WebAssembly.

n1xis10t

15 days ago

1 reply

That sounds fun, I look forward to reading a writeup about that

xena

15 days ago

3 replies

So I can plan it, how much detail do you want? Here's what I have about the prototype: https://anubis.techaro.lol/docs/admin/honeypot/overview

63stack

15 days ago

1 reply

This is amazing, I was just wondering about if it's possible to tie anubis together with iocaine, but it seems you already thought of that.

xena

14 days ago

It's slightly different in subtle ways. If I recall iocaine makes you configure a subprocess that it executes to generate garbage. One rule I have for Anubis in the code is that fork()/exec() are banned. So the pluggable garbage generator is gonna be powered by CGI handlers compiled to WebAssembly. It should be fun!

n1xis10t

15 days ago

Probably any detail that you think is cool, I would be interested in reading about. When in doubt err on the side of too much detail.

That was a good read. I hadn’t heard of spintax before, but I’ve thought of doing things like that. Also “pseudoprofound anti-content”, what a great term, that’s hilarious!

kstrauser

15 days ago

As the owner of honeypot.net, I always appreciate seeing the name used as intended out in the wild.

mrweasel

15 days ago

1 reply

Now I'm picturing an AI trained exclusively on LinkedIn posts. One could probably sell that model to an online ad agency for a pretty penny.

Yizahi

15 days ago

And thus AM was born. Woe to us.

tommica

15 days ago

Hi Xena! Your blog is amazing! Didn't realize you're working on Anubis - it's a really nice tool for the internet! Reminds me a bit of the ye' olde internet for some reason.

gettingoverit

15 days ago

You've made one of the best solutions, that matched what I thought of implementing myself, and at the time it was most needed. I think a couple of "thank you" are sorely missing in this comment section.

Thank you!

ramonga

15 days ago

1 reply

what do people use to get keyword alerts in HN?

n1xis10t

14 days ago

I think that most people don't do this, and the ones that do have custom solutions. Xena's uses cron, but that's all I know. It's probably a custom shell script.

kstrauser

15 days ago

Correct; my bad!

And hey, Xena!

n1xis10t

15 days ago

5 replies

That’s so many scrapers. There must be a ton of companies with very large document collections at this point, and it really sucks that they don’t at least do us the courtesy of indexing them and making them available for keyword search, but instead only do AI.

It’s kind of crazy how much scraping goes on and how little search engine development goes on. I guess search engines aren’t fashionable. Reminds me of this article about search engines disappearing mysteriously: https://archive.org/details/search-timeline

I try to share that article as much as possible, it’s interesting.

kstrauser

15 days ago

1 reply

So! Much! Scraping! They were downloading every commit multiple times, and fetching every file as seen at each of those commits, and trying to download archives of all the code, and hitting `/me/my-repo/blame` endpoints as their IP's first-ever request to my server, and other unlikely stuff.

My scraper dudes, it's a git repo. You can fetch the whole freaking thing if you wanna look at it. Of course, that would require work and context-aware processing on their end, and it's easier for them to shift the expense onto my little server and make me pay for their misbehavior.

n1xis10t

15 days ago

Crazy

PeterStuer

15 days ago

1 reply

Or some anti-ddos/bot companies using ultra cheap scraping services to annoy you enough to get you into their "free" anti bot protection, so they can charge the few real ai scrapers for access to your site.

throw10920

14 days ago

1 reply

Is there any evidence that this has actually happened?

zhengyi13

14 days ago

2 replies

Even if there isn't (yet?), there's probably someone who's honestly thinking this is potentially a viable business model and at least napkin-mathing it out.

kstrauser

14 days ago

My napkin mathing is that their ROI would be negative. That's a lot of compute and bandwidth they'd have to pay for even if they were just throwing away the results.

throw10920

14 days ago

So, it hasn't happened, and you're just making stuff up.

mrweasel

15 days ago

2 replies

> There must be a ton of companies with very large document collections at this point

See, I don't think there is, I don't think they want that expense. It's basically the Linus Torvalds philosophy of data storage, if it's on the Internet, I don't need a backup. While I have absolutely no proof of this, I'd guess that many AI companies just crawl the Internet constantly, never saving any of the data. We're seeing some of these scrapers go to great length attempting to circumvent any and all forms of caching, they aren't interested in having a two week old copy of anything.

kelvinjps10

14 days ago

1 reply

Where did Linus Torvalds expressed this philosophy I have never seen it

lelanthran

14 days ago

> Where did Linus Torvalds expressed this philosophy I have never seen it

https://www.goodreads.com/quotes/574706-only-wimps-use-tape-...

n1xis10t

14 days ago

Could be. Can you train a model without saving things though?

rurban

13 days ago

Just block all the big hosters IP ranges, when they ignore robots.txt.

For fun add long timeouts and huge content sizes. No private individual will browse from there, and all scrapers will do.

miki123211

15 days ago

But there is a lot of search engine development going on, it's just that the results of the new search engines are fed straight into AI instead of displayed in the legacy 10-links-per-page view.

anonymous908213

15 days ago

8 replies

As someone on the browsing end, I love Anubis. I've only seen it a couple of times, but it sparks joy. It's rather refreshing compared to Cloudfare, which will usually make me immediately close the page and not bother with whatever content was behind it.

kstrauser

15 days ago

1 reply

Same here, really. That's why I started using it. I'd seen it pop up for a moment on a few sites I'd visited, and it was so quirky and completely not disruptive that I didn't mind routing my legit users through it.

n1xis10t

15 days ago

1 reply

So maybe there are more people who like the “anime catgirl” than there are who think it’s weird

kstrauser

15 days ago

2 replies

*anime jackalgirl ;-)

Quite possibly. Or, in my case, I think it's more quirky and fun than weird. It's non-zero amounts of weird, sure, but far below my threshold of troublesome. I probably wouldn't put my business behind it. I'm A-OK with using it on personal and hobby projects.

Frankly, anyone so delicate that they freak out at the utterly anodyne imagery is someone I don't want to deal with in my personal time. I can only abide so much pearl clutching when I'm not getting paid for it.

D-Machine

14 days ago

The Digital Research Alliance of Canada (the main organization unifying and handling all the main HPC compute clusters in Canada) now uses Anubis for their wiki. Granted this is not a business, but still!

https://docs.alliancecan.ca/wiki/Technical_documentation

Imustaskforhelp

15 days ago

For what its worth, I think that a UN/(Unicef?) website (not sure which one) did use anubis so maybe you can put it behind businesses too :)

teeray

15 days ago

1 reply

[delayed]

GoblinSlayer

13 days ago

Anubis is simple; recaptcha and the like are huge opaque spaghetti.

PunchyHamster

15 days ago

1 reply

My experience with it is that it somehow took 20 seconds to load (site might've been hn-hugged at the time), only to "protect" some fucking static page instead of just serving that shit in the first place rather than wasting CPU on... whatever it was doing to cause delay

timpera

14 days ago

Same experience for me. I tried it on a low-end smartphone and the Anubis challenge took about 45 seconds to complete.

prmoustache

14 days ago

1 reply

Anyone is free to replace the cat girl with an actual cat or a vintage computer logo or whatnot anyway.

My issue is that it blocks away non javascript users.

stefanka

14 days ago

2 replies

How can one do this? Did not find it in the docs

prmoustache

14 days ago

The 3 images are in the repo, you can replace them and rebuild or point to other ones in the templates.

easton

14 days ago

It’s a feature in the paid version, or I guess you could recompile it if you didn’t want to pay (but my guess is if you want to change the logo you can probably pay).

opem

14 days ago

yes, very true! Anubis is a hell lot better than cloudflare turnstile or its older cousin sister google recaptcha.

brettermeier

14 days ago

Reminds me of weird furry porn, I can't say I like it

acheong08

15 days ago

As someone on the hosting end, Anubis has unfortunately been overused and thus scrapers, especially Huawei ones, bypass it. I've gone for go-away instead which is similar but more configurable in challenges

m4rtink

15 days ago

Yep, Anubis-chan is super cute! :)

tonymet

14 days ago

1 reply

Funny how that also uses a porn cartoon

kstrauser

14 days ago

1 reply

Which cartoon are you referring to? The version of Anubis I installed only has the G-rated default images.

tonymet

14 days ago

2 replies

submissive nubile kitten seems safe to you?

n1xis10t

14 days ago

1 reply

If you keep referring to non-explicit material as pornography, you will continue to confuse people.

If you have an objection to the image other than it’s pornographic status, please word it clearly.

tonymet

14 days ago

I was clear on the issue

kstrauser

14 days ago

1 reply

I'm being sincere here: I genuinely don't know what you're talking about.

I'm referring to these default images: https://github.com/TecharoHQ/anubis/tree/main/docs/static/im.... Do you mean something different?

tonymet

14 days ago

Similar but yeah. Whatever prompts during the challenge . It’s creepy , out of context and inappropriate .

amypetrik8

14 days ago

2 replies

>I love the insanity of this idea. Not saying it's a good idea, but it's a very highly entertaining one, and I like that!

An even more insane idea -- minding the idea here is porn is radioactive to AI data training scrapers -- is there is something the powers that be view as far more disruptive and against community guidelineish than porn. And that would be wrongthink. The narratives. The historic narratives. The woke ideology. Anything related to an academic department whose field is <population subgroup> studies. Alls you need to do is plop in a little diatribe staunchly opposing any such enforced views and that AI bot will shoot away from your website and lightspeed

GoblinSlayer

13 days ago

I'm afraid AI bot and scraper are different things. Looks like poison is filtered after scraping no matter where it comes from, so there's no need to disable scraping you, because that's extra work.

lelanthran

14 days ago

I like this better than of NSFW links; just include a (possible LLM generated) paragraph about not supporting transitions in minor children. Or perhaps that libraries that remove instructional booklets for how to have same-sex intercourse aren't actually banning the books.

That sort of thing; nothing that 80% of people object to (so there's no problem if someone actually sees it), but something that definitely triggers the filters.

buu700

15 days ago

It's actually a well established concept: https://youtu.be/p9KeopXHcf8

reconnecting

15 days ago

1 reply

I would not recommend to show different versions of the site to search robots, as they probably have mechanisms that track differences, which could potentially lead to a lower ranking or a ban.

prmoustache

14 days ago

1 reply

How can they track differences if they have access to only one version?

reconnecting

13 days ago

This is a usual tactic for many online businesses to show a specially designed page for search spiders, so any major search engine has a way to verify if content is faked for them. Perhaps they use another spider that doesn't have an official UA or buy this service from a third party.

If you take a look at any website, even an unpopular one, you will see that there are hundreds of bots every day, and it's impossible to recognize what any of them is doing and why.

JohnMakin

15 days ago

4 replies

Cloudflare offers bot mitigation for free, and pretty generous WAF rules that makes mitigations like this seem a little overblown to me

conception

15 days ago

1 reply

For “free”.

n1xis10t

15 days ago

3 replies

Did you put “free” in quotes because you need to have paid for stuff from cloudflare to use the “free” thing?

If so, I suppose it’s like those magazines that say ”free cd”.

JohnMakin

15 days ago

1 reply

You don't though.

n1xis10t

15 days ago

Good to know thanks

Terr_

15 days ago

[delayed]

efilife

15 days ago

Well, you literally MITM yourself so I think it's a big price

ATechGuy

15 days ago

1 reply

It is really free? Genuinely asking.

gilrain

15 days ago

Yes. They upsell more complete solutions, but the free tier is pretty generous.

nospice

15 days ago

1 reply

I'm on the free tier, but I also watch my logs. The vast majority of the traffic I'm getting are scrapers and vulnerability scanners, a lot of them coming through residential proxies and other "laundered" egress points.

I honestly don't think that Cloudflare is on top of the problem at all.

cakealert

15 days ago

1 reply

when you combine a residential proxy with a tool like curl-impersonate (there are libraries in Go for this type of fingerprint spoofing now) they dont even show up as scrapers anymore, just users. especially when they adjust timings to mimic humans.

clouflare only blocks the most dumb of bots, there are still a lot of them.

this is why cloudflare will issue javascript challenges to you even when you are using google chrome with a VPN, they are desperate to appear to be doing something. and every VPN is used to crawl as well. a slightly more sophisticated bot passes the cloudflare javascript challenge as well, there really is nothing they can do to win here.

i know some teams that got annoyed with residential proxies (they are usually sold as socks5 but can be buggy and low bandwidth) so they invested into defeating the cloudflare javascript challenge and now crawl using 1000's of VPN endpoints at over 100 Gbit/s.

oidar

14 days ago

1 reply

Is "residential proxy" another name for an hacked/owned computer that the bots have access to? Or are there legitimate services that sell access to residential IPs?

nospice

14 days ago

A mix of both, but people legitimately sell egress. It's "free" money.

n1xis10t

15 days ago

You can’t deny that it’s fun though. Personally I generally feel like more people should be coming up with creative (if not entirely necessary) solutions to problems.

thethingundone

15 days ago

10 replies

I own a forum which currently has 23k online users, all of them bots. The last new post in that forum is from _2019_. Its topic is also very niche. Why are so many bots there? This site should have basically been scraped a million times by now, yet those bots seem to fetch the stuff live, on the fly? I don’t get it.

sandblast

15 days ago

1 reply

Are you sure the counter is not broken?

thethingundone

15 days ago

Yes, it’s running on a Woltlab Burning Board since forever.

danpalmer

15 days ago

2 replies

How do you define a user, and how do you define online?

If the forum considers unique cookies to be a user and creates a new cookie for any new cookie-less request, and if it considers a user to be online for 1 hour after their last request, then actually this may be one scraper making ~6 requests per second. That may be a pain in its own way, but it's far from 23k online bots.

crote

15 days ago

1 reply

That's still 518.400 requests per day. For static content. And it's a niche forum, so it's not exactly going to have millions of pages.

Either there are indeed hundreds or thousands of AI bots DDoSing the entire internet, or a couple of bots are needlessly hammering it over and over and over again. I'm not sure which option is worse.

n1xis10t

15 days ago

Imagine if all this scraping was going into a search engine with a massive index, or a bunch of smaller search engines that a meta-search engine could be made for. This’d be a lot more cool in that case

thethingundone

15 days ago

1 reply

AFAIK it keeps a user counted as online for 5 or 15 minutes (I think 5). It’s a Woltlab Burning Board.

danpalmer

15 days ago

1 reply

And what is a "user"?

thethingundone

15 days ago

1 reply

Whatever the forum software Woltlab Burning Board considers a user. If I recall correctly, it tries to build an identifier based on PHP session ids, so most likely simply cookies.

danpalmer

15 days ago

This is exactly my point. Scrapers typically don't store cookies, so every single request is likely to be a "new" user as far as the forum software is concerned.

Couple that with 15 minute session times, and that could just be one entity scraping the forum at 30 requests per second. One scraper going moderately fast sounds far less bad than 29000 bots.

It still sounds excessive for a niche site, but I'd guess this is sporadic, or that the forum software has a page structure that traps scrapers accidentally, quite easy to do.

thethingundone

15 days ago

1 reply

The bots are exposing themselves as Google, Bing and Yandex. I can’t verify whether it’s being attributed by IP address or whether the forum trusts their user agent. It could basically be anyone.

n1xis10t

15 days ago

3 replies

Interesting. When it was just normal search engines I didn’t hear of people having this problem, so this either means that there are a bunch of people pretending to be bing google and yandex, or thise companies have gotten a lot more aggressive.

bobbiechen

15 days ago

1 reply

There are lots of people pretending to be Google and friends. They far outnumber the real Googlebot, etc. and most people don't check the reverse DNS/IP list - it's tedious to do this for even well-behaved crawlers that publish how to ID themselves. So much for User Agent.

happymellon

15 days ago

1 reply

> So much for User Agent.

User agent has been abused for so long, I forget a time when it wasn't.

Anyone else remember having to fake being a Windows machine so that YouTube/Netflix would serve you content better than standard def, or banking portals that blocked you if your agent didn't say you were Internet Explorer?

wooger

15 days ago

I mean forget that, all modern desktop browsers (at least) start with the string 'Mozilla/5.0', still, in a world where Chrome is so dominant.

giantrobot

15 days ago

1 reply

Normal search engine spiders did/do cause problems but not on the scale of AI scrapers. Search engine spiders tend to follow a robots.txt, look at the sitemap.xml, and generally try to throttle requests. You'll find some that are poorly behaved but they tend to get blocked and either die out or get fixed and behave better.

The AI scrapers are atrocious. They just blindly blast every URL on a site with no throttling. They are terribly written and managed as the same scraper will hit the same site multiple times a day or even hour. They also don't pay any attention to context so they'll happily blast git repo hosts and hit expensive endpoints.

They're like a constant DOS attack. They're hard to block at the network level because they span across different hyperscalers' IP blocks.

n1xis10t

15 days ago

Puts on tinfoil hat: Maybe it isn’t AI scrapers, but actually is a massive dos attack, and it’s a conspiracy to get people to not self-host.

reallyhuh

15 days ago

What are the proportions for the attributions? Is it equally distributed or lopsided towards one of the three?

sethops1

15 days ago

2 replies

I have a site with a complete and accurate sitemap.xml describing when its ~6k pages are last updated (on average, maybe weekly or monthly). What do the bots do? They scrape every page continuously 24/7, because of course they do. The amount of waste going into this AI craze is just obscene.

n1xis10t

15 days ago

2 replies

It would be interesting if someone made a map that depicts the locations of the ip addresses that are sending so many requests, over the course of a day maybe.

giantrobot

15 days ago

Maps That Are Just Datacenters

GoblinSlayer

13 days ago

https://news.ycombinator.com/item?id=46241849

thisislife2

15 days ago

1 reply

If you are in the US, have you considered suing them for robot.txt / copyright violation? AI companies are currently flush with cash from VCs and there may be a few big law firms willing to fight a law suit against them on your behalf. AI companies have already lost some copyright cases.

happymellon

15 days ago

1 reply

Based upon traffic you could tell whether an IP or request structure is coming from a not, but how would you reliability tell which company is DDOSing you?

chrismorgan

15 days ago

It should be at least theoretically possible: each IP address is assigned to an organisation running the IP routing prefix, and you can look that up easily, and they should have some sort of abuse channel, or at the very least a legal system should be able to compel them to cooperate and give up the information they’re required to have.

tokioyoyo

15 days ago

4 replies

Large scale scraping tech is not as sophisticated as you'd think. A significant chunk of it is "get as much as possible, categorize and clean up later". Man, I really want the real web of the 2000s back, when things felt "real" more or less... how can we even get there.

thethingundone

15 days ago

1 reply

I would understand that, but it seems they don’t store the stuff but recollect the same content every hour.

tokioyoyo

15 days ago

I'm assuming a quick hash check to see if there's any change? Between scrapers "most up to date data" is fairly valuable nowadays as well.

n1xis10t

15 days ago

3 replies

If people start making search engines again and there is more competition for Google, I think things would be pretty sweet.

tokioyoyo

15 days ago

1 reply

Because of the financial incentives, it would still end up with people doing things to drive traffic to their website though, no? Maybe because the web was smaller, and people looked at it as means "to explore curiosity" in the olden days it kinda worked differently... maybe I just got old, but I don't want to believe that.

n1xis10t

15 days ago

1 reply

By “doing things to drive traffic to their website” do you mean trying to do SEO type things to manipulate search engine rankings? If so, I think that there are probably ways to rank that are immune to tampering.

Don’t worry, you’re not just old. The internet kind of sucks now.

makapuf

15 days ago

Google was neat in that you didn't see the content keyword spam either on the websites or the portal home pages. The Web was already full of shit (first ad banner was 1994? By 1999 you already had punch the monkey as classy content), but it was more ... organic and you could easily skip it.

nephihaha

14 days ago

There are other search engines, they've just been marginalised. Even something as mainstream as Bing has been pushed to the side.

PunchyHamster

15 days ago

it's few orders of magnitude harder given the amount of SEO spam prevalent, and that just gonna get worse with AI

idiotsecant

15 days ago

1 reply

Have you ever listened to the 'high water mark' monologue from fear and loathing? It's pretty much just that. It was a unique time and it was neat that we got to see it, but it can't possibly happen again.

https://www.youtube.com/watch?v=vUgs2O7Okqc

symbogra

15 days ago

Thanks for reminding me about that, what a great monologue. I didn't really understand it when I was younger, but now I feel the same thing with regards to software engineering. There was a golden age which finally broke at the end of the 2010's.

tmnvix

14 days ago

A curated web directory. Kind of like Yahoo had. The internet according to the dewey system with pages somehow rated for quality by actual humans (maybe something to learn from Wikipedia's approach here?)

stevage

14 days ago

I'd love to know the answer to this question. AI scrapers wanting everything on the internet makes sense to me. But I don't understand how that leads to every site being hit hundreds of thousands of times per day.

GaryBluto

14 days ago

Why do you keep it operating? Is it the aquarium value?

andrepd

15 days ago

When you have trillions of dollars being poured into your company by the financial system, and when furthermore there are no repercussions for behaving however you please, you tend not to care about that sort of "waste".

csomar

14 days ago

Sure you do by now. You are the hard drive.

mrweasel

15 days ago

Why pay for storage when you do it for them?

MisterTea

15 days ago

2 replies

> It's you vs the MJs of programming, you're not going to win.

MJs? Michael Jacksons? Right now the whole world, including me, want to know if that means they are bad?

kylecazar

15 days ago

I read it as Michael Jordan. Meaning the best.

n1xis10t

15 days ago

Yes probably bad. Also smooth criminals.

owl57

15 days ago

1 reply

> scrapers can ingest them and say "nope we won't scrape there again in the future"

Do all the AI scrapers actually do that?

amarant

15 days ago

Not all, stuff like unstable diffusion exists.

But a good many, perhaps even most(?), certainly do!

taurath

15 days ago

2 replies

Any other threads on the prevalence and nuisance of scrapers? I didn’t have any idea it was this bad.

crote

15 days ago

1 reply

I've been seeing "we had to take the forum/website offline to deal with scrapers" message on quite a few niche websites now. They are an absolute pest.

n1xis10t

15 days ago

Really? I haven’t started to see that yet. Weird

n1xis10t

15 days ago

Here’s one from yesterday: https://news.ycombinator.com/item?id=46302496#46306025

xg15

15 days ago

1 reply

There is some irony in using an AI generated banner image for this project...

(No, I don't want to defend the poor AI companies. Go for it!)

kstrauser

15 days ago

1 reply

In the olden days, I used Google an awful lot, but I would still grouse if Google were to drive my server into the ground.

n1xis10t

15 days ago

Fair point

montroser

15 days ago

This is a cute idea, but I wonder what is the sustainable solution to this emerging fundamental problem: As content publishers, we want our content to be accessible to everyone, and we're even willing to pay for server costs relative to our intended audience -- but a new outsized flood of scrapers was not part of the cost calculation, and that is messing up the plan.

It seems all options have major trade-offs. We can host on big social media and lose all that control and independence. We can pay for outsized infrastructure just to feed the scrapers. We can move as much as possible SSG and put all our eggs in the cloulflare basket. We can do real "verified identities" for bots, and just let through the ones we know and like, but this only perpetuates corporate control and makes healthy upstart competition (like Kagi) much more difficult.

So, what are we to do?

80 more comments available on Hacker News

Resources