Last activity 3 months agoPosted Aug 29, 2025 at 12:35 PM EDT

The Web Does Not Need Gatekeepers: Cloudflare's New "Signed Agents" Pitch

Original: The web does not need gatekeepers: Cloudflare’s new “signed agents” pitch

positiveblue

454 points

489 comments

Mood

heated

Sentiment

mixed

Discussion Activity

Very active discussion

First comment

12m

Peak period

141

Day 1

Avg / period

Comment distribution160 data points

Loading chart...

Based on 160 loaded comments

Key moments

01Story posted
Aug 29, 2025 at 12:35 PM EDT
3 months ago
Step 01
02First comment
Aug 29, 2025 at 12:47 PM EDT
12m after posting
Step 02
03Peak activity
141 comments in Day 1
Hottest window of the conversation
Step 03
04Latest activity
Sep 1, 2025 at 4:43 PM EDT
3 months ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (489 comments)

Showing 160 comments of 489

cyberlurker

3 months ago

1 reply

I would love that vision to become reality but what Cloudflare is doing is unfortunately necessary atm.

TheCraiggers

3 months ago

2 replies

Ok, I'll bite. Why is turning the Internet into a walled garden necessary now?

giantrobot

3 months ago

1 reply

Multi-Tbps DDoS attacks, pervasive scanning of sites for exploits, comically expensive egress bandwidth on services like AWS, and ISPs disallowing hosting services on residential accounts.

doublerabbit

3 months ago

2 replies

Start forcing tighter security on the devices causing the Multi-Tbps DDoS attacks would be a better option, no? Cheap unsecured IoT devices are a problem.

It's not just computers anymore. Web enabled CCTV, doorbell cameras are all culprits.

habinero

3 months ago

How would you secure someone else's devices?

esseph

3 months ago

And home routers, printers, and end user devices themselves. Residential ISP networks can be infiltrated and remote CVE'd through browser calls at this point from a remote website. It's not even hard.

esseph

3 months ago

Commercial, criminal, and state interests have far more resources than you do, and their interests are in direct conflict with yours.

That would be fine, you could walk away and go home, but if you're going to drive on their digital highways, you're going to need "insurance" just protect you from everyone else.

Ongoing multi-nation WWIII-scale hacking and infiltration campaigns of infrastructure, AI bot crawling, search company and startup crawling, security researchers crawling, and maybe somebody doesn't like your blog and decides to rent a botnet for a week or so.

Bet your ISP shuts you off before then to protect themselves. (Happens all the time via BGP blackholing, DDoS scrubbing services, BGP FlowSpec, etc).

jjangkke

3 months ago

2 replies

I would love to get off Cloudflare but there are no real good alternatives

didibus

3 months ago

3 replies

AWS is an alternative no?

miohtama

3 months ago

2 replies

AWS needs a dedicated AWS engineer while any technical person and some non-technical people have skill to set up Cloudflare. Esp. Without surprise bills.

didibus

3 months ago

1 reply

I always hear this, but honestly I'm not sure it's true.

It's hard to assess the validity of this versus Cloudflare having a really good marketing department.

I've used neither, so I can't say, but I've also never seen anyone truly explain why/why-not.

ryoshu

3 months ago

Why not use both and find out? Cloudflare is much less technical than AWS, but still a bit technical.

hdgvhicv

3 months ago

I thought the whole point of paying a fortune for AWS was to avoid having a dedicated engineer. It’s the cobol of the 21st century.

nromiun

3 months ago

1 reply

Bankruptcy as a surprise gift is not an alternative. Even those that use big cloud providers like AWS and GCP use CDNs like Cloudflare to protect themselves. And there is no free CDN like Cloudflare.

didibus

3 months ago

2 replies

> And there is no free CDN like Cloudflare.

Their pricing page says:

No-nonsense Free Tier

As part of the AWS free Usage Tier you can get started with Amazon CloudFront for free.

Included in Always Free Tier

1 TB of data transfer out to the internet per month 10,000,000 HTTP or HTTPS Requests per month 2,000,000 CloudFront Function invocations per month 2,000,000 CloudFront KeyValueStore reads per month 10 Distribution Tenants Free SSL certificates No limitations, all features available

nromiun

3 months ago

1 reply

1 TB per month of data is literally nothing. A kid could rent a VPS for an hour and drain all that. What do you do after that? AWS is not going to stop your bill going up is it?

I don't care about any of those fancy serverless services. I am just talking about the cheapest CDN.

didibus

3 months ago

Ah, for cheapest CDN, maybe you're right. I think BlazingCDN can also be cheap, but CLoudFlare might be the best deal. OP didn't really say there wasn't any cheaper alternative, just said "no real good alternatives".

rustc

3 months ago

1 reply

> Included in Always Free Tier

> 1 TB of data

Someone can rent a 1Gbps server for cheap (under $50 on OVH) and pull 330TB in a month from your site. That's about $30k of egress on AWS if you don't do anything to stop it.

didibus

3 months ago

True, CloudFlare DDoS protection is unmatched, they just eat the cost for free.

lucb1e

3 months ago

1 reply

We were supposed to pentest a website on AWS WAF last week. We encountered three types of blocks:

1) hard block without having done any requests yet. No clue why. Same browser (Burp's built-in Chromium), same clean state, same IP address, but one person got a captcha and the other one didn't. It would just say "reload the page to try again" forever. This person simply couldn't use the site at all; not sure if that would happen if you're on any other browser, but since it allowed the other Burp Suite browser, that doesn't seem to be the trigger for this perma-ban. (The workaround was to clone the cookie state from the other consultant, but normal users won't have that option.)

2) captcha. I got so many captchas, like every 4th request. It broke the website (async functionality) constantly. At some point I wanted to try a number of passwords for an admin username that we had found and, to my surprise, it allowed hundreds of requests without captcha. It blocks humans more than this automated bot...

3) "this website is under construction" would sometimes appear. Similar to situation#1, but it seemed to be for specific requests rather than specific persons. Inputting the value "1e9" was fine, "1e999" also fine, but "1e99" got blocked, but only on one specific page (entering it on a different page was fine). Weird stuff. If it doesn't like whatever text you wrote on a support form, I guess you're just out of luck. There's no captcha or anything you can do about it (since it's pretending the website isn't online at all). Not sure if this was AWS or the customer's own wonky mod_security variant

I dread to think if I were a customer of this place and I urgently needed them (it's not a regular webshop but something you might need in a pinch) and the only thing it ever gives me is "please reload the page to try again". Try what again?? Give me a human to talk to, any number to dial!

didibus

3 months ago

1 reply

Shouldn't this be seen as success? You weren't a normal user, you were trying to penetrate the site, and you got a bunch of friction?

lucb1e

3 months ago

2 replies

On the first fricking pageload I got blocked and couldn't open it at all, no captcha shown. That's a success only insofar as you want to exclude random people who don't have a second person whose cookie state to copy

Also mind that not every request we make is malicious. A lot of it is also seeing what's even there, doing baseline requests, normal things. I didn't get the impression that I got blocked more on malicious requests than normal browsing at all (see also the part where a bot could go to town on a login form while my manual navigation was getting captchas)

didibus

3 months ago

I hear you, but I find it suspicious. I mean CloudFront is used by over 10% of all CDN content online, and is used by Amazon itself.

It wouldn't just randomly block something.

It must be based on something no?

jeroenhd

3 months ago

Some websites will detect a Burp proxy and act accordingly. If you did your initial page load with any kind of integration like that, that's why the WAF may have blocked your request. I don't know exactly how they do it (my guess is fingerprinting the TLS handshake and TCP packet patterns), but I have seen several services do a great job at blocking any kind of analyzing proxy.

bob1029

3 months ago

4 replies

Writing backends that can actually handle public traffic and using authentication for expensive resources are fantastic alternatives.

Also, cheaply rate limiting malicious web clients should be something that is trivial to accomplish with competent web tooling (i.e., on your own servers). If this seems out of scope or infeasible, you might be using the wrong tools for the job.

nromiun

3 months ago

1 reply

Even if you write the best backend in the world where do you host them? AFAIK Cloudflare is the only free CDN.

lostmsu

3 months ago

1 reply

GitHub pages?

nromiun

3 months ago

You can't pipe any media files from any origin to GitHub Pages.

There are many free static site hosts but not many free CDN.

Symbiote

3 months ago

1 reply

If it were this easy, we wouldn't have had about 10 HN posts on the topic in the last few months.

hdgvhicv

3 months ago

The technical skills of the majority of the HN community are way below those of the typical computing community a generation ago.

acdha

3 months ago

This sounds pretty unrealistic: the web is not better off if the only people who can host content are locking it behind authentication and/or have significant infrastructure budgets and the ability to create heavily tuned static stacks.

esseph

3 months ago

You still have the network traffic issues which is very substantial

matt-p

3 months ago

6 replies

I have zero issue with Ai Agents, if there's a real user behind there somewhere. I DO have a major issue with my sites being crawled extremely aggressively by offenders including Meta, Perplexity and OpenAI - it's really annoying realising that we're tying up several cpu cores on AI crawling. Less than on real users and google et al.

Operyl

3 months ago

2 replies

They're getting to the point of 200-300RPS for some of my smaller marketing sites, hallucinating URLs like crazy. It's fucking insane.

palmfacehn

3 months ago

3 replies

You'd think they would have an interest in developing reasonable crawling infrastructure, like Google, Bing or Yandex. Instead they go all in on hosts with no metering. All of the search majors reduce their crawl rate as request times increase.

On one hand these companies announce themselves as sophisticated, futuristic and highly-valued, on the other hand we see rampant incompetence, to the point that webmasters everywhere are debating the best course of action.

matt-p

3 months ago

1 reply

Honestly it's just tragedy of the commons. Why put the effort in when you don't have to identify yourself, just crawl and if you get blocked move the job to another server.

palmfacehn

3 months ago

1 reply

At this point I'm blocking several ASNs. Most are cloud provider related, but there are also some repurposed consumer ASNs coming out of the PRC. Long term, this devalues the offerings of those cloud providers, as prospective customers will not be able to use them for crawling.

account42

3 months ago

This is the correct solution and is how network abuse has been dealt with before the latest fad. Network operators can either police their own users or be blocked/throttled wholesale. There isn't anything more needed except for the willingness to apply measures to networks that are "too big to fail".

esperent

3 months ago

I suspect it's because they're dealing with such unbelievable levels of bandwidth and compute for training and inference that the amount required to blast the entire web like this barely registers to them.

whatevaa

3 months ago

They vibe code their crawlers.

matt-p

3 months ago

I'm seeing around the same, as a fairly constant base load. Even more annoying when it's hitting auth middleware constantly, over and over again somehow expecting a different answer.

rikafurude21

3 months ago

1 reply

Cloudflare is trying to gatekeep which user-initated agents are allowed to read website content, which is of course very different from scraping website for training data. Meta, Perplexity and OpenAI all have some kind of web-search functionality where they sent requests based on user prompts. These are not requests that get saved to train the next LLM. Cloudflare intentionally blurs the line between both types of bots, and in that sense it is a bait-and-switch where they claim to 'protect content creators' by being the man in the middle and collecting tolls from LLM providers to pay creators (and of course take a cut for themselves). Its not something they do because it would be fair, theres financial motivation.

jsheard

3 months ago

1 reply

> Cloudflare is trying to gatekeep which user-initated agents are allowed to read website content, which is of course very different from scraping website for training data.

That distinction requires you to take companies which benefit from amassing as much training data as possible at their word when they pinky swear that a particular request is totally not for training, promise.

rikafurude21

3 months ago

1 reply

If you look at the current LLM landscape, the frontier is not being pushed by labs throwing more data at their models - most improvements come from using more compute and improving training methods. In that sense I dont have to take their word, more data just hasnt been the problem for a long time.

jsheard

3 months ago

Just today Anthropic announced that they will begin using their users data for training by default - they still want fresh data so badly that they risked alienating their own paying customers to get some more. They're at the stage of pulling the copper out of the walls to feed their crippling data addiction.

asats

3 months ago

1 reply

I've some personal apps online and I had to turn the cloudflare ai bot protection on because one of them got 1.6TB of data accessed by the bots in the last month, 1.3 million requests per day, just non stop hammering it with no limits.

immibis

3 months ago

1 reply

So, under the free traffic tier of any decent provider.

asats

3 months ago

The pages are not static and require computation to serve, and there's more than one app on that same bare metal server, so it was negatively affecting the performance of a lot of my other stuff.

If I couldn't easily cut off the majority of that bot volume I probably would've shut down the app entirely.

swed420

3 months ago

2 replies

> I DO have a major issue with my sites being crawled extremely aggressively by offenders including Meta, Perplexity and OpenAI

Gee, if only we had, like, one central archive of the internet. We could even call it the internet archive.

Then, all these AI companies could interface directly with that single entity on terms that are agreeable.

teitoklien

3 months ago

1 reply

you think they care about that ? they’d still crawl like this just in case which is why they don’t rate limit atm

swed420

3 months ago

It would of course need to be legally enforced somehow, with penalties high enough to hurt even the big players.

gck1

3 months ago

1 reply

Internet Archive is missing enormous chunks of the internet though. And I don't mean weird parts of the internet, just regional stuff.

Not even news articles from top 10 news websites from my country are usually indexed there.

swed420

3 months ago

So then make a better one. I was only referencing it as a general concept that can be approved upon as desired.

chatmasta

3 months ago

I wonder how many CPU cycles are spent because of AI companies scraping content. This factor isn't usually considered when estimating “environmental impact of AI.” What’s the overhead of this on top of inference and training?

To be fair, an accurate measurement would need to consider how many of those CPU cycles would be spent by the human user who is driving the bot. From that perspective, maybe the scrapers can “make up for it” by crawling efficiently, i.e. avoid loading tracker scripts, images, etc unless necessary to solve the query. This way they’ll still burn CPU cycles but at least it’ll be less cycles than a human user with a headful browser instance.

zzo38computer

3 months ago

Same with me. If there is a real user behind the use of the AI agents and they do not make excessive accesses in order to do what they are trying to do, then I do not have a complaint (the use of AI agents is not something I intend, but that is up to whoever is using them and not up to me). I do not like the excessive crawling.

However, what is more important to me than AI agents, is that someone might want to download single files with curl, or use browsers such as Lynx, etc, and this should work.

sdsd

3 months ago

1 reply

Maybe the title means something more like "The web should not have gatekeepers (Cloudflare)". They do seem to say as much toward the end:

>We need protocols, not gatekeepers.

But until we have working protocols, many webmasters literally do need a gatekeeper if they want to realistically keep their site safe and online.

I wish this weren't the case, but I believe the "protocol" era of the web was basically ended when proprietary web 2.0 platforms emerged that explicitly locked users in with non-open protocols. Facebook doesn't want you to use Messenger in an open client next to AIM, MSN, and IRC. And the bad guys won.

But like I said, I hope I'm wrong.

jeroenhd

3 months ago

1 reply

>We need protocols, not gatekeepers

The funny thing is that this blog post is complaining about a proposed protocol from Cloudflare (one which will identify bots so that good bots can be permitted). The signup form is just a method to ask Cloudflare (or any other website owner/CDN) to be categorized as a good bot.

It's not a great protocol if you're in the business of scraping websites or selling people bots to access websites for them, but it's a great protocol for people who just want their website to work without being overwhelmed by the bad side of the internet.

The whitelist approach Cloudflare takes isn't good for the internet, but for website owners who are already behind Cloudflare, it's better than the alternative. Someone will need to come up with a better protocol that also serves the website owners' needs if they want Cloudflare to fail here. The AI industry simply doesn't want to cooperate, so their hand must be forced, and only companies like Cloudflare are powerful enough to accomplish that.

ccgreg

3 months ago

Conventional crawlers already have a way to identify themselves, via a json file containing a list of IP addresses. Cloudflare is fully aware of this defacto standard.

jimmyl02

3 months ago

1 reply

I understand the concerns around a central gatekeeper but I'm confused as to why this specifically is viewed negatively. Don't website owners have to choose to enable cloudflare and to opt-in to this gate that the site owners control?

If this was cloudflare going into some centralized routing of the internet and saying everything must do X then that would be a lot more alarming but at the end of the day the internet is decentralized and site owners are the ones who are using this capability.

Additionally I don't think that I as an individual website owner would actually want / be capable of knowing which agents are good and bad and cloudflare doing this would be helpful to me as a site owner as long as they act in good faith. And the moment they stop acting in good faith I would be able to disable them. This is definitely a problem right now as unrestricted access to the bots means bad bots are taking up many cycles raising costs and taking away resources from real users

immibis

3 months ago

Site owners are tricked and scared (by Cloudflare) into using Cloudflare when they don't need to. Cloudflare feels the increase in customer growth and the rest of us feel the pain.

jmtame

3 months ago

4 replies

I pretty much use Perplexity exclusively at this point, instead of Google. I'd rather just get my questions answered than navigate all of the ads and slowness that Google provides. I'm fine with paying a small monthly fee, but I don't want Cloudflare being the gatekeeper.

Perhaps a way to serve ads through the agents would be good enough. I'd prefer that to be some open protocol than controlled by a company.

verdverm

3 months ago

1 reply

Perplexity has been one of the AI companies that created the problem that gave rise to this CF proposal. Why doesn't Perplexity invest more into being a responsible scraper?

https://blog.cloudflare.com/perplexity-is-using-stealth-unde...

jmtame

3 months ago

1 reply

Re-read what I wrote.

verdverm

3 months ago

and what am I supposed to garner from the re-read?

What did you say that relates to Perplexity being one of the reasons that Cloudflare and their customers have decided they need better protection from abusive scrapers?

Websites choose their own gatekeepers, Cloudflare is just one provider

Fabricio20

3 months ago

1 reply

This has been my experience more recently as well, I've finally migrated from google to Brave Search since google was just slow for me.

I also appreciate the AI search results a bit when im looking for something very specific (like what the yaml definition for a docker swarm deployment constraint looks like) because the AI just gives me the snippet while the search results are 300 medium blog posts about how to use docker and none of them explain the variables/what each does. Even the official docker documentation website is a mess to navigate and find anything relevant!

jmtame

3 months ago

1 reply

Not to mention how much worse it is on mobile. Every web site asks me to accept their cookies, close layers of ads with tiny buttons, and loads slowly with ads spread throughout the content. And that’s just to figure out if I’m even on the right page.

maltelandwehr

3 months ago

The horrible UX on mobile, especially when traveling to another country and having to deal with forced geo-redirects, is the main reason I have largely replaced Google with ChatGPT for my everyday search needs.

jeroenhd

3 months ago

1 reply

Perplexity is the problem Cloudflare and companies like it are trying to solve. The company refuses to take no for an answer and will mislead and fake their way through until they've crawled the content they wanted to crawl.

The problem isn't just that ads can't be served. It's that every technical measure to attempt to block their service produces new ways of misleading website owners and the services they use. Perplexity refuses any attempt at abuse detection and prevention from their servers.

None of this would've been necessary if companies like Perplexity would've just acted like a responsible web service and told their customers "sorry, this website doesn't allow Perplexity to act on your behalf".

The open protocol you want already exists: it's the user agent. A responsible bot will set the correct user agent, maybe follow the instructions in robots.txt, and leave it at that. Companies like Perplexity (and many (AI) scrapers) don't want to participate in such a protocol. They will seek out and abuse any loopholes in any well-intended protocol anyone can come up with.

I don't think anyone wants Cloudflare to have even more influence on the internet, but it's thanks to the growth of inconsiderate AI companies like Perplexity that these measure are necessary. The protocol Cloudflare proposes is open (it's just a signature), the problem people have with it is that they have to ask Cloudflare nicely to permit website owners to track and prevent abuse from bots. For any Azure-gated websites, your bot would need to ask permission there as well, as with Akamai-gated websites, and maybe even individual websites.

A new protocol is a technical solution. Technical solutions work for technical problems. The problem Cloudflare is trying to solve isn't a technical problem; it's a social problem.

jmtame

3 months ago

You’re referencing an old and outdated technology that has no capability to handle things like revenue and attribution. New protocols will need to evolve to the current use. Owners want money, so make the protocol focused on that use case.

I’m not here to propose a solution. I’m here as an end-user saying I won’t go back to the old experience which is outdated and broken.

rs_rs_rs_rs_rs

3 months ago

1 reply

>but I don't want Cloudflare being the gatekeeper

Cloudflare is not the gatekeeper, it's the owner of the site that blocks Perplexity that's "gatekeeping" you. You're telling me that's not right?

jmtame

3 months ago

1 reply

Cloudflare is a gatekeeper because they’re trying to insert themselves between the owner and the end-user. Despite all the altruistic signaling, they really just want to capitalize on AI. And they’re happy to do that even if it results in a subpar experience for the end-user. They started this with a focus on news organizations, so I’m not particularly excited about trying to block AI access and lock down the web through one private company just so we can preserve 90s era clickbait businesses.

rs_rs_rs_rs_rs

3 months ago

>Cloudflare is a gatekeeper because they’re trying to insert themselves between the owner and the end-user

But they can't insert themselves without the owner directly adding them. So it's the owner that's doing the gatekeeping(regardless if it's Cloudflare or iptables rules)

I think all you AI people blaming Cloudflare are just trying to deflect from the actual problem which is more and more owners don't want AI crawlers going through their content.

If Cloudflare dissapears who are you going to blame next, the iptables developers, maybe Linus Torvalds?

johnnienaked

3 months ago

1 reply

I can see a future where I don't use the internet at all.

robmusial

3 months ago

Maybe not the Internet for me, but certainly the web. But I totally agree with the sentiment.

impure

3 months ago

6 replies

Well, if you have a better way to solve this that’s open I’m all ears. But what Cloudflare is doing is solving the real problem of AI bots. We’ve tried to solve this problem with IP blocking and user agents, but they do not work. And this is actually how other similar problems have been solved. Certificate authorities aren’t open and yet they work just fine. Attestation providers are also not open and they work just fine.

viktorcode

3 months ago

2 replies

AI poisoning is a better protection. Cloudflare is capable of serving stashes of bad data to AI bots as protective barrier to their clients.

esseph

3 months ago

5 replies

AI poisoning is going to get a lot of people killed, be cause the AI won't stop being used.

viktorcode

3 months ago

2 replies

By that logic AI already killing people. We can't presume that whatever can be found on the internet is reliable data, can't we?

esseph

3 months ago

Yet we use it every day for police, military, and political targeting with economic and kinetic consequences.

lucb1e

3 months ago

If science taught us anything it's that no data is ever reliable. We are pretty sure about so many things, and it's the best available info so we might as well use it, but in terms of "the internet can be wrong" -> any source can be wrong! And I'd not even be surprised if internet in aggregate (with the bot reading all of it) is right more often than individual authors of pretty much anything

beeflet

3 months ago

Okay, let them

jlarocco

3 months ago

You mean incompetent users of AI will get people killed. You don't get a free pass because you used a tool that sucked.

culi

3 months ago

The current state of the art in AI poisoning is Nightshade from the University of Chicago. It's meant to eventually be an addon to their WebGlaze[1] which is an invite-only tool meant for artists to protect their art from AI mimicry

Nobody is dying because artists are protecting their art

[0] https://nightshade.cs.uchicago.edu/whatis.html

[1] https://glaze.cs.uchicago.edu/webglaze.html

account42

3 months ago

This is some next level blame shifting. Next you are going to steal motor oil and then complain that your customers got sick when you used it to cook their food.

verdverm

3 months ago

2 replies

You don't think that the AI companies will take efforts to detect and filter bad data for training? Do you suppose they are already doing this, knowing that data quality has an impact on model capabilities?

culi

3 months ago

1 reply

If these companies are adding extra code to bypass artists trying to protect their intellectual property from mimicry then that is an obvious and egregious copyright violation

More likely it will push these companies to actually pay content creators for the content they work on to be included in their models.

[0] https://nightshade.cs.uchicago.edu/whatis.html

[1] https://glaze.cs.uchicago.edu/webglaze.html

verdverm

3 months ago

Seems like their poisoning is something that shouldn't be hard to detect and filter on. There is enough perturbation to create visual artifacts people can see. Steganography research is much further along in being undetectable. I would imaging in order to disrupt training sufficiently, you would not be able to have so few perturbations that it would go undetected

viktorcode

3 months ago

They will learn to pay for high quality data instead of blindly relying on internet contents.

ShakataGaNai

3 months ago

1 reply

Agreed. It might not be THE BEST solution, but it is a solution that appears to work well.

Centralization bad yada yada. But if Cloudflare can get most major AI players to participate, then convince the major CDN's to also participate.... ipso facto columbo oreo....standard.

positiveblueAuthor

3 months ago

yep, that's why I am writing this now :)

You can see it in the web vs mobile apps.

Many people may not see a problem on wallet gardens but reality is that we have much less innovation in mobile than in web because anyone can spawn a web server vs publish an app in the App Store (apple)

Voultapher

3 months ago

3 replies

> Well, if you have a better way to solve this that’s open I’m all ears.

Regulation.

Make it illegal to request the content of a webpage by crawler if a website operator doesn't explicitly allows it via robots.txt. Institute a government agency that is tasked with enforcement. If you as a website operator can show that traffic came from bots, you can open a complaint with the government agency and they take care of shaking painful fines out of the offending companies. Force cloud hosts to keep books on who was using what IP addresses. Will it be a 100% fix, no, will it have a massive chilling effect if done well, absolutely.

akoboldfrying

3 months ago

1 reply

The biggest issue right now seems to be people renting their residential IP addresses to scraper companies, who then distribute large scrapes across these mostly distinct IPs. These addresses are from all over the world, not just your own country, so we'll either need a World Government, or at least massive intergovernmental cooperation, for regulation to help.

Voultapher

3 months ago

1 reply

I don't think we need a world government to make progress on that point.

The companies buying these services, are buying them from other companies. Countries or larger blocks like the EU can exert significant pressure on such companies by declaring the use of such services as illegal when interacting with websites hosted in the country or block or by companies in them.

akoboldfrying

3 months ago

1 reply

It just seems too easy to skirt around via middlemen. The EU (say) could prosecute an EU company directly doing this residential scraping, and it could probably keep tabs on a handful of bank accounts of known bad actors in other countries, and then investigate and prosecute EU companies transferring money to them. But how do you stop an EU company paying a Moldovan company (that has existed for 10 days) for "internet services", that pays a Brazilian company, that pays a Russian company to do the actual residential scraping? And then there's all the crypto channels and other quid pro quo payment possibilities.

Voultapher

3 months ago

1 reply

Genuinely this isn't a tech specific or even novel problem. There is plenty of prior art when it comes to inhibiting unwanted behavior.

> But how do you stop an EU company paying a Moldovan company (that has existed for 10 days) for "internet services", that pays a Brazilian company, that pays a Russian company to do the actual residential scraping?

The same example could be made with money laundering, and yes it's a real and sizable issue. Yet, the majority of money is not laundered. How does the EU company make sure it will not be held liable, especially the people that made the decision? Maybe on a technical level the perfect crime is possible and not getting caught is possible or even likely given a certain approach. But the uncertainty around it will dissuade many, not all. The same goes for companies selling the services, you might think you have a foolproof way to circumvent the measures put in play, but what if not and the government comes knocking?

akoboldfrying

3 months ago

Your money laundering analogy is apt. I know very little about that topic, and I especially don't know how much money laundering is really out there (nor do governments), but I'm confident that a lot is. Do AML laws have a chilling effect on it? I think they must, since they surely increase the cost and risk, and similar legislation for scraping should have a similar effect. But AML is a pretty bad solution to money laundering, and I despair if AML-for-scraping is the best possible solution to scraping.

jlarocco

3 months ago

1 reply

I'm not anti-government, but a technical solution that elliminates the the problem is infinitely better than regulating around it.

The internet is too big and distributed to regulate. Nobody will agree on what the rules should be, and certain groups or countries will disagree in any case and refuse to enforce them.

Existing regulation rarely works, and enforcement is half-assed, at best. Ransomware is regulated and illlegal, but we see articles about major companies infected all the time.

I don't think registering with Cloudflare is the answer, but regulation definitely isn't the answer.

account42

3 months ago

The problem is that a technical solution is impossible.

zimmund

3 months ago

1 reply

> Institute a government agency that is tasked with enforcement.

You're forgetting about the first W in WWW...

account42

3 months ago

So what you're saying is that if I were to host a bit torrent tracker in Sweden then the US can't do anything about it?

fsniper

3 months ago

Are they? Until Let's Encrypt came along and democratise the CA scene, it was a hell hole. Web Security was depending on how deep your pockets are. One can argue that the same path is being laid in front us until a Let's Encrypt comes along and democratise it? And here as it's about attestation, how are we going to prevent gatekeeper's doing "selective attestations with arguable criteria"? How will we prevent political forces?

jeroenhd

3 months ago

I'm not sure if things are as fine as you say they are. Certificate authorities were practically unheard of outside of corporate websites (and even then mostly restricted to login pages) until Let's Encrypt normalized HTTPS. Without the openness of Let's Encrypt, we'd still be sharing our browser history and search queries with our ISPs for data mining. Attestation providers have so far refused to revoke attestation for known-vulnerable devices (because customers needing to replace thousands of devices would be an unacceptable business decision), making the entire market rather useless.

That said, what I am missing from these articles is an actual solution. Obviously we don't want Cloudflare from becoming an internet gatekeeper. It's a bad solution. But: it's a bad solution to an even worse problem.

Alternatives do exist, even decentralised ones, in the form of remote attestation ("can't access this website without secure boot and a TPM and a known-good operating system"), paying for every single visit or for subscriptions to every site you visit (which leads to centralisation because nobody wants a subscription to just your blog), or self-hosted firewalls like Anubis that mostly rely on AI abuse being the result of lazy or cheap parties.

People drinking the AI Kool-Aid will tell you to just ignore the problem, pay for the extra costs, and scale up your servers, because it's *the future*, but ignoring problems is exactly why Cloudflare still exists. If ISPs hadn't ignored spoofing, DDoS attacks, botnets within their network, """residential proxies""", and other such malicious acts, Cloudflare would've been an Akamai competitor rather than a middle man to most of the internet.

lucb1e

3 months ago

Certificate authorities don't block humans if they 'look' like a bot

skybrian

3 months ago

1 reply

This is sort of like how email is based on Internet standards but a large percentage of email users use Gmail. The Internet standards Cloudflare is promoting are open, but Cloudflare has a lot of power due to having so many customers.

(What are some good alternatives to Cloudflare?)

Another way the situation is similar: email delivery is often unreliable and hard to implement due to spam filters. A similar thing seems to be happening to the web.

nromiun

3 months ago

1 reply

It is a big problem. There is no good alternative to Cloudflare as a free CDN. They put servers all over the world and they are giving them away for free. And making their money on premium serverless services.

Not to mention the big cloud providers are unhinged with their egress pricing.

gck1

3 months ago

1 reply

> Not to mention the big cloud providers are unhinged with their egress pricing.

I always wonder why this status quo persisted even after Cloudflare. Their pricing is indeed so unhinged, that they're not even in consideration for me for things where egress is a variable.

Why is egress seemingly free for Cloudflare or Hetzner but feels like they launch spaceships at AWS and GCP every time you send a data packet to the outside world?

nromiun

3 months ago

They are just greedy. And they know nobody can compete with them for availability in every country. Except for Cloudflare, which is why it is so popular.

TIPSIO

3 months ago

4 replies

Everyone loves the dream of a free for all and open web.

But the reality is how can someone small protect their blog or content from AI training bots? E.g.: They just blindly trust someone is sending Agent vs Training bots and super duper respecting robots.txt? Get real...

Or, fine what if they do respect robots.txt, but they buy the data that may or may not have been shielded through liability layers via "licensed data"?

Unless you're reddit, X, Google, or Meta with scary unlimited budget legal teams, you have no power.

Great video: https://www.youtube.com/shorts/M0QyOp7zqcY

clvx

3 months ago

1 reply

You can lock it up with a user account and payment system. The fact the site is up on the internet doesn’t mean you can or cannot profit from it. It’s up to you. What I would like it’s a way to notify my isp and say, block this traffic to my site.

inetknght

3 months ago

> What I would like it’s a way to notify my isp and say, block this traffic to my site.

I would love that, and make it automated.

A single message from your IP to your router: block this traffic. That router sends it upstream, and it also blocks it. Repeat ad nauseum until source changes ASN or (if the originator is on the same ASN) reaches the router from the originator, routing table space notwithstanding. Maybe it expires after some auto-expiry -- a day or month or however long your IP lease exists. Plus, of course, a way to query what blocks I've requested and a way to unblock.

Gud

3 months ago

6 replies

By developing Free Software combating these hostile softwares.

Corporations develop hostile AI agents,

Capable hackers develop anti-AI-agents.

This defeatist atittude "we have no power".

TIPSIO

3 months ago

1 reply

Yes, I obviously agree with you. My comment's point is missed a little I think by you. CF is making these tools and giving access to it to millions of people.

supriyo-biswas

3 months ago

1 reply

Well there's open source stuff like https://github.com/TecharoHQ/anubis; one doesn't need a top-down mandated solution coming from a corporation.

In general Cloudflare has been pushing DRMization of the web for quite some time, and while I understand why they want to do it, I wish they didn't always show off as taking the moral high ground.

Klonoar

3 months ago

1 reply

Anubis doesn’t necessarily stop the most well funded actors.

If anything we’ve seen the rise in complaints about it just annoying average users.

supriyo-biswas

3 months ago

1 reply

The actual response to which Anubis was created is seemingly a strange kind of DDOS attack that has been misattributed to LLMs, but is some kind of attacker that makes partial GET requests that are aborted soon after sending the request headers, mostly coming from residential proxies. (Yes, it doesn’t help that the author of Anubis also isn’t fully aware of the mechanics of the attack. In fact, there is no proper write up of the mechanism of the attack which I hope to write about someday).

Having said that, the solution is effective enough, having a lightweight proxy component that issues proof of work tokens to such bogus requests works well enough, as various users on HN seem to point out.

theamk

3 months ago

> a strange kind of DDOS attack that has been misattributed to LLMs, , but is some kind of attacker that makes partial GET requests that are aborted soon after sending the request headers, mostly coming from residential proxies.

um, no? Where did you get this strange bit of info.

The original reports say nothing of that sort: https://news.ycombinator.com/item?id=42790252 ; and even original motivation for Anubis was Amazon AI crawler https://news.ycombinator.com/item?id=42750420

(I've seen more posts with the analysis, including one which showed an AI crawler which would identify properly, but once it hits the ratelimit, would switch to fake user agent from proxies.. but I cannot find it now)

Analemma_

3 months ago

1 reply

How does an agent help my website not get crushed by traffic load, and how is this proposal any different from the gatekeeping problem to the open web, except even less transparent and accountable because now access is gated by logic inside an impenetrable web of NN weights?

This seems like slogan-based planning with no actual thought put into it.

Gud

3 months ago

1 reply

Whatever is working against the AI doesn’t have to be an AI agent.

hoppp

3 months ago

1 reply

So proof of work checks everywhere?

1gn15

3 months ago

Sure, as long as it doesn't discriminate against user agents.

esseph

3 months ago

1 reply

Sometimes it's a hardware problem, not a software problem.

nimih

3 months ago

For that matter, sometimes it's a social/political problem and not a technological problem.

victorbjorklund

3 months ago

1 reply

So basically cloudflare but self-hosted (with all the pain that comes from that)?

Gud

3 months ago

1 reply

What’s so painful about self hosting? I’ve been self hosting since before I hit puberty. If 12 year old me can run a httpd, anyone can.

And if you don’t want to self host, at least try to use services from organisations that aren’t hostile to the open web

victorbjorklund

3 months ago

2 replies

I self-host lots of stuff. But yes it is more pain to host a WAF that can handle billions of request per minute. Even harder to do it for free like Cloudflare. And in the end the end result for the user is exactly the same if you use a self-hosted WAF or let someone else host it for you.

lucb1e

3 months ago

2 replies

If you're handling billions of requests per second, you're not a self hoster. That's a commercial service with a dedicated team to handle traffic around the clock. Most ISPs probably don't even operate lines that big

To put that in perspective, even if they're sending empty TCP packets, "several billion" pps is 200 to 1800 gigabits of traffic, depending on what you mean by that. Add a cookieless HTTP payload and you're at many terabits per second. The average self hoster is more likely to get struck by lightning than encounter and need protection from this (even without considering the, probably modest, consequences of being offline a few hours if it does happen)

Edit: off by a factor of 60, whoops. Thanks to u/Gud for pointing that out. I stand by the conclusion though: less likely to occur than getting struck by lightning (or maybe it's around equally likely now? But somewhere in that ballpark) and the consequences of being down for a few hours are generally not catastrophic anyway. You can always still put big brother in front if this event does happen to you and your ISP can't quickly drop the abusive traffic

Gud

3 months ago

1 reply

To be fair, he did say per minute :-)

lucb1e

3 months ago

Oh, whoops. Divide everything by 60, quick!

That does make it a bit less ludicrous even if I think the conclusion of my response still applies

PaulHoule

3 months ago

2 replies

If somebody decides they hate you, your site that could handle, say, 100,000 legitimate requests per day could suddenly get billions of illegitimate requests.

Gud

3 months ago

1 reply

Not everybody wants to manage some commercial grade packet filter that can handle some DDoSing script kiddie, it’s a strong argument.

But another argument against using the easiest choice, the near monopoly, is that we need a diverse, thriving ecosystem.

We don’t want to end up in a situation where suddenly Cloudflare gets to dictate what is allowed on the web.

We have already lost email to the tech giants, try running your own mail sometime. The technical aspect is easy, the problem is you will end up in so many spam folders it’s disgusting.

What we need are better decentralized protocols.

immibis

3 months ago

1 reply

Please do try running your own mail some time. It's not nearly as hard as doomers would have you think. And if you only receive, you don't have any problems at all.

At first, you can use it for less serious stuff until you see how much it works.

Gud

3 months ago

I do, I host my own mail server.

Technically it's not very challenging. The problem is the total dominance of a few actors and a lot of spammers.

lucb1e

3 months ago

1 reply

They could. Let me know when it happens

I have this argument every time self hosting comes up, and every time I wonder if someone will do it to me to make a point. Or if one of the like million other comments I post upsets someone or one of the many tools that I host. Yet to happen, idk. It's like arguing whether you need a knife on the street at all times because someone might get angry from a look. It happens, we have a word for it in NL (zinloos geweld) and tiles in sidewalks (lady bug depictions) and everything, but no normal person actually wears weapons 24/7 (drug dealers surely yeah) or has people talk through a middle person

I'd suspect other self hosters just see more shit than I do, were it not for that nobody ever says it happened to them. The only argument I ever hear is that they want to be "safe" while "self hosting with cloudflare". Who's really hosting your shit then?

PaulHoule

3 months ago

1 reply

I've had my involvement with the computer underground.

A web site owner published something he really shouldn't have and got hacked. I wound up being a "person of interest" in the resulting FBI investigation because I was the weirdest person in the chat room for the site. I think it drove them crazy I was using Tor so they got somebody to try to entrap me into sharing CP but (1) I'm not interested and (2) know better than that.

lucb1e

3 months ago

That's definitely the most interesting response I've had to this question, thanks for that

Will have to give this a second thought but as a first one now that I read this: ...and would Cloudflare have helped against the FBI, or any foreign nation doing a request with Cloudflare against child porn? Surely not?! A different kind of opsec is surely more relevant there, so I don't know if it's really relevant to "normal", legal self hosting (as opposed to criminal, much less that level of unethical+criminal) communities or if there's an aspect I'm missing here

immibis

3 months ago

1 reply

But you don't get billions of requests per minute. You get maybe five requests per second (300 per minute) on a bad day. The sites that seem to be getting badly attacked, they get 200 per second, which is still within reach of a self hosted firewall. Think about how many CPU cycles per packet that allows for. Hardly a real DDoS.

The only reason you even want to firewall 200 requests per second is that the code downstream of the firewall takes more than 5ms to service a request, so you could also consider improving that. And if you're only getting <5 and your server isn't overloaded then why block anything at all?

Symbiote

3 months ago

Such entitlement.

How much additional tax money should I spend at work so the AI scum can make 200 searches per second?

Human and 'nice' bots make about 5 per second.

banku_brougham

3 months ago

This is the attitude I like to see. As they say, actually I hate this because of past connotations but "freedom isn't free"

xg15

3 months ago

That's a mantra, not a solution.

eldenring

3 months ago

1 reply

I recently found out my website has been blocked by AI agents, when I had never asked for it. It seems to be opt-out by default, but in an obscure way. Very frustrating. I think some of these companies (one in particular) are risking burning a lot of goodwill, although I think they have been on that path for a while now.

DanOpcode

3 months ago

Are you talking about Cloudflare? The default seems indeed to be to block AI crawlers when you set up a new site with them.

gausswho

3 months ago

4 replies

What we need is some legal teeth behind robots.txt. It won't stop everyone, but Big Corp would be a tasty target for lawsuits.

notatoad

3 months ago

2 replies

It wouldn’t stop anyone. The bots you want to block already operate out of places where those laws wouldn’t be enforced.

qbane

3 months ago

1 reply

Then that is a good reason to deny the requests from those IPs

literalAardvark

3 months ago

I've run a few hundred small domains for various online stores with an older backend that didn't scale very well for crawlers and at some point we started blocking by continent.

It's getting really, really ugly out there.

account42

3 months ago

If that was the case then I am I getting buttflare-blocked here in the EU.

stronglikedan

3 months ago

1 reply

It should have the same protections as an EULA, where the crawler is the end user, and crawlers should be required to read it and apply it.

immibis

3 months ago

1 reply

So none at all? EULAs are mostly just meant to intimidate you so you won't exercise your inalienable rights.

majorchord

3 months ago

I find that extremely hard to believe. Do you have a source?

quectophoton

3 months ago

3 replies

I don't know about this. This means I'd get sued for using a feed reader on Codeberg[1], or for mirroring repositories from there (e.g. with Forgejo), since both are automated actions that are not caused directly by a user interaction (i.e. bots, rather than user agents).

[1]: https://codeberg.org/robots.txt#:~:text=Disallow:%20/.git/,....

blibble

3 months ago

1 reply

> This means I'd get sued for using a feed reader on Codeberg

you think codeberg would sue you?

quectophoton

3 months ago

Probably not.

But it's the same thing with random software from a random nobody that has no license, or has a license that's not open-source: If I use those libraries or programs, do I think they would sue me? Probably not.

gausswho

3 months ago

1 reply

To be more specific, if we assume good faith upon our fine congresspeople to craft this well... ok yeah, well for hypothetical case I'll continue...

What legal teeth I would advocate would be targeted to crawlers (a subset of bot) and not include your usage. It would mandate that Big Corp crawlers (for search indexing, AI data harvesting, etc.) be registered and identify themselves in their requests. This would allow serverside tools to efficiently reject them. Failure to comply would result in fines large enough to change behavior.

Now that I write that out, if such a thing were to come to pass, and it was well received, I do worry that congress would foam at the mouth to expand it to bots more generally, Microsoft-Uncertified-Devices, etc.

quectophoton

3 months ago

Yeah, my main worry here is how we define the unwanted traffic, and how that definition could be twisted by bigcorp lawyers.

If it's too loose and similar to "wanted traffic is how the authors intend the website to be accessed, unwanted traffic is anything else", that's an argument that can be used against adblocks, or in favor of very specific devices like you mention. Might even give slightly more teeth to currently-unenforceable TOS.

If it's too strict, it's probably easier to find loopholes and technicalities that just lets them say "technically it doesn't match the definition of unwanted traffic".

Even if it's something balanced, I bet bigcorp lawyers will find a way to twist the definitions in their favor and set a precedent that's convenient for them.

I know this is a mini-rant rather than a helpful comment that tries to come up with a solution, it's just that I'm pessimistic because it seems the internet becomes a bit worse day by day no matter what we try to do :c

lucb1e

3 months ago

1 reply

You don't get sued for using a service as it is meant to be used (using an RSS reader on their feed endpoint; cloning repositories that it is their mission to host). It doesn't anger anyone so they wouldn't bother trying to enforce a rule, and secondly it's a fruitless case because the judge would say it's not a reasonable claim they're making

Robots.txt is meant for crawlers, not user agents such as a feed reader or git client

quectophoton

3 months ago

I agree with you, generally you can expect good faith to be returned with good faith (but here I want to make heavy emphasis that I only agree on the judge part iff good faith can be assumed and the judge is informed enough to actually be able to make an informed decision).

But not everyone thinks that's the purpose of robots.txt. Example, quoting Wikipedia[1] (emphasis mine):

> indicate to visiting web crawlers and other web robots which portions of the website they are allowed to visit.

Quoting the linked `web robots` page[2]:

> An Internet bot, web robot, robot, or simply bot, is a software application that runs automated tasks (scripts) on the Internet, usually with the intent to imitate human activity, such as messaging, on a large scale. [...] The most extensive use of bots is for web crawling, [...]

("usually" implying that's not always the case; "most extensive use" implying it's not the only use.)

Also a quick HN search for "automated robots.txt"[3] shows that a few people disagree that it's only for crawlers. It seems to be only a minority, but the search results are obviously biased towards HN users, so it could be different outside HN.

Besides all this, there's also the question of whether web scraping (not crawling) should also be subject to robots.txt or not; where "web scraping" includes any project like "this site has useful info but it's so unusable that I made a script so I can search it from my terminal, and I cache the results locally to avoid unnecessary requests".

The behavior of alternative viewers like Nitter could also be considered web scraping if they don't get their info from an API[4], and I don't know if I'd consider Nitter the bad actor here.

But yeah, like I said I agree with your comment and your interpretation, but it's not the only interpretation of what robots.txt is meant for.

[1]: https://en.wikipedia.org/wiki/Robots.txt

[2]: https://en.wikipedia.org/wiki/Internet_bot

[3]: https://hn.algolia.com/?dateRange=all&query=automated%20robo...

[4]: I don't know how Nitter actually works or where does it get its data from, I just mention it so it's easier to explain what I mean by "alternative viewer".

Galanwe

3 months ago

1 reply

- Moral rules are never really effective

- Legal threats are never really effective

Effective solutions are:

- Technical

- Monetary

I like the idea of web as a blockchain of content. If you want to pull some data, you have to pay for it with some kind of token. You either buy that token to consume information if you're of the leecher type, or get some by doing contributions that gain back tokens.

It's more or less the same concept as torrents back in the day.

This should be applied to emails too. The regular person send what, 20 emails per day max ? Say it costs $0.01 per mail, anyone could pay that. But if you want to spam 1,000,000 everyday that becomes prohibitive.

edm0nd

3 months ago

>This should be applied to emails too. The regular person send what, 20 emails per day max ? Say it costs $0.01 per mail, anyone could pay that.

This seems flawed.

Poor people living in 3rd world countries that make like $2.00/day wouldn't be able to afford this.

>But if you want to spam 1,000,000 everyday that becomes prohibitive.

Companies and people with $ can easily pay this with no issues. If it costs $10,000 to send 1M emails that inbox but you profit $50k, its a non issue.

jmarbach

3 months ago

I recently ran a test on the page load reliability of Browserbase and I was shocked to see how unreliable it was for a standard set of websites - the top 100 websites in the US by traffic according to SimilarWeb. 29% of page load requests failed. Without an open standard for agent identification, it will always be a cat and mouse game to trap agents, and many agents will predictably fail simple tasks.

https://anchorbrowser.io/blog/page-load-reliability-on-the-t...

Here's to working together to develop a new protocol that works for agents and website owners alike.

theideaofcoffee

3 months ago

Your ideas are intriguing to me and wish to subscribe to your newsletter.

Joking aside, I think the ideas and substance are great and sorely needed. However, I can only see the idea of a sort of token chain verification as running into the same UX problems that plagued (plagues?) PGP and more encryption-focused processes. The workflow is too opaque, requires too much specialized knowledge that is out of reach for most people. It would have to be wrapped up into something stupid simple like an iOS FaceID modal to have any hope of succeeding with the general public. I think that's the idea, that these agents would be working on behalf of their owners on their own devices, so it has to be absolutely seamless.

Otherwise, rock on.

329 more comments available on Hacker News

View full discussion on Hacker News

ID: 45066258Type: storyLast synced: 11/20/2025, 8:09:59 PM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

Read Article View on HN