The Web Does Not Need Gatekeepers: Cloudflare's New "Signed Agents" Pitch
Original: The web does not need gatekeepers: Cloudflare’s new “signed agents” pitch
Mood
heated
Sentiment
mixed
Category
other
Key topics
Cloudflare's new 'signed agents' proposal aims to verify AI agents acting on behalf of humans, sparking debate on the need for gatekeepers on the web and the implications for security, privacy, and the open web.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
12m
Peak period
141
Day 1
Avg / period
40
Based on 160 loaded comments
Key moments
- 01Story posted
Aug 29, 2025 at 12:35 PM EDT
3 months ago
Step 01 - 02First comment
Aug 29, 2025 at 12:47 PM EDT
12m after posting
Step 02 - 03Peak activity
141 comments in Day 1
Hottest window of the conversation
Step 03 - 04Latest activity
Sep 1, 2025 at 4:43 PM EDT
3 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
It's not just computers anymore. Web enabled CCTV, doorbell cameras are all culprits.
That would be fine, you could walk away and go home, but if you're going to drive on their digital highways, you're going to need "insurance" just protect you from everyone else.
Ongoing multi-nation WWIII-scale hacking and infiltration campaigns of infrastructure, AI bot crawling, search company and startup crawling, security researchers crawling, and maybe somebody doesn't like your blog and decides to rent a botnet for a week or so.
Bet your ISP shuts you off before then to protect themselves. (Happens all the time via BGP blackholing, DDoS scrubbing services, BGP FlowSpec, etc).
It's hard to assess the validity of this versus Cloudflare having a really good marketing department.
I've used neither, so I can't say, but I've also never seen anyone truly explain why/why-not.
Their pricing page says:
No-nonsense Free Tier
As part of the AWS free Usage Tier you can get started with Amazon CloudFront for free.
Included in Always Free Tier
1 TB of data transfer out to the internet per month 10,000,000 HTTP or HTTPS Requests per month 2,000,000 CloudFront Function invocations per month 2,000,000 CloudFront KeyValueStore reads per month 10 Distribution Tenants Free SSL certificates No limitations, all features available
I don't care about any of those fancy serverless services. I am just talking about the cheapest CDN.
> 1 TB of data
Someone can rent a 1Gbps server for cheap (under $50 on OVH) and pull 330TB in a month from your site. That's about $30k of egress on AWS if you don't do anything to stop it.
1) hard block without having done any requests yet. No clue why. Same browser (Burp's built-in Chromium), same clean state, same IP address, but one person got a captcha and the other one didn't. It would just say "reload the page to try again" forever. This person simply couldn't use the site at all; not sure if that would happen if you're on any other browser, but since it allowed the other Burp Suite browser, that doesn't seem to be the trigger for this perma-ban. (The workaround was to clone the cookie state from the other consultant, but normal users won't have that option.)
2) captcha. I got so many captchas, like every 4th request. It broke the website (async functionality) constantly. At some point I wanted to try a number of passwords for an admin username that we had found and, to my surprise, it allowed hundreds of requests without captcha. It blocks humans more than this automated bot...
3) "this website is under construction" would sometimes appear. Similar to situation#1, but it seemed to be for specific requests rather than specific persons. Inputting the value "1e9" was fine, "1e999" also fine, but "1e99" got blocked, but only on one specific page (entering it on a different page was fine). Weird stuff. If it doesn't like whatever text you wrote on a support form, I guess you're just out of luck. There's no captcha or anything you can do about it (since it's pretending the website isn't online at all). Not sure if this was AWS or the customer's own wonky mod_security variant
I dread to think if I were a customer of this place and I urgently needed them (it's not a regular webshop but something you might need in a pinch) and the only thing it ever gives me is "please reload the page to try again". Try what again?? Give me a human to talk to, any number to dial!
Also mind that not every request we make is malicious. A lot of it is also seeing what's even there, doing baseline requests, normal things. I didn't get the impression that I got blocked more on malicious requests than normal browsing at all (see also the part where a bot could go to town on a login form while my manual navigation was getting captchas)
It wouldn't just randomly block something.
It must be based on something no?
Also, cheaply rate limiting malicious web clients should be something that is trivial to accomplish with competent web tooling (i.e., on your own servers). If this seems out of scope or infeasible, you might be using the wrong tools for the job.
On one hand these companies announce themselves as sophisticated, futuristic and highly-valued, on the other hand we see rampant incompetence, to the point that webmasters everywhere are debating the best course of action.
That distinction requires you to take companies which benefit from amassing as much training data as possible at their word when they pinky swear that a particular request is totally not for training, promise.
If I couldn't easily cut off the majority of that bot volume I probably would've shut down the app entirely.
Gee, if only we had, like, one central archive of the internet. We could even call it the internet archive.
Then, all these AI companies could interface directly with that single entity on terms that are agreeable.
Not even news articles from top 10 news websites from my country are usually indexed there.
To be fair, an accurate measurement would need to consider how many of those CPU cycles would be spent by the human user who is driving the bot. From that perspective, maybe the scrapers can “make up for it” by crawling efficiently, i.e. avoid loading tracker scripts, images, etc unless necessary to solve the query. This way they’ll still burn CPU cycles but at least it’ll be less cycles than a human user with a headful browser instance.
However, what is more important to me than AI agents, is that someone might want to download single files with curl, or use browsers such as Lynx, etc, and this should work.
>We need protocols, not gatekeepers.
But until we have working protocols, many webmasters literally do need a gatekeeper if they want to realistically keep their site safe and online.
I wish this weren't the case, but I believe the "protocol" era of the web was basically ended when proprietary web 2.0 platforms emerged that explicitly locked users in with non-open protocols. Facebook doesn't want you to use Messenger in an open client next to AIM, MSN, and IRC. And the bad guys won.
But like I said, I hope I'm wrong.
The funny thing is that this blog post is complaining about a proposed protocol from Cloudflare (one which will identify bots so that good bots can be permitted). The signup form is just a method to ask Cloudflare (or any other website owner/CDN) to be categorized as a good bot.
It's not a great protocol if you're in the business of scraping websites or selling people bots to access websites for them, but it's a great protocol for people who just want their website to work without being overwhelmed by the bad side of the internet.
The whitelist approach Cloudflare takes isn't good for the internet, but for website owners who are already behind Cloudflare, it's better than the alternative. Someone will need to come up with a better protocol that also serves the website owners' needs if they want Cloudflare to fail here. The AI industry simply doesn't want to cooperate, so their hand must be forced, and only companies like Cloudflare are powerful enough to accomplish that.
If this was cloudflare going into some centralized routing of the internet and saying everything must do X then that would be a lot more alarming but at the end of the day the internet is decentralized and site owners are the ones who are using this capability.
Additionally I don't think that I as an individual website owner would actually want / be capable of knowing which agents are good and bad and cloudflare doing this would be helpful to me as a site owner as long as they act in good faith. And the moment they stop acting in good faith I would be able to disable them. This is definitely a problem right now as unrestricted access to the bots means bad bots are taking up many cycles raising costs and taking away resources from real users
Perhaps a way to serve ads through the agents would be good enough. I'd prefer that to be some open protocol than controlled by a company.
https://blog.cloudflare.com/perplexity-is-using-stealth-unde...
What did you say that relates to Perplexity being one of the reasons that Cloudflare and their customers have decided they need better protection from abusive scrapers?
Websites choose their own gatekeepers, Cloudflare is just one provider
I also appreciate the AI search results a bit when im looking for something very specific (like what the yaml definition for a docker swarm deployment constraint looks like) because the AI just gives me the snippet while the search results are 300 medium blog posts about how to use docker and none of them explain the variables/what each does. Even the official docker documentation website is a mess to navigate and find anything relevant!
The problem isn't just that ads can't be served. It's that every technical measure to attempt to block their service produces new ways of misleading website owners and the services they use. Perplexity refuses any attempt at abuse detection and prevention from their servers.
None of this would've been necessary if companies like Perplexity would've just acted like a responsible web service and told their customers "sorry, this website doesn't allow Perplexity to act on your behalf".
The open protocol you want already exists: it's the user agent. A responsible bot will set the correct user agent, maybe follow the instructions in robots.txt, and leave it at that. Companies like Perplexity (and many (AI) scrapers) don't want to participate in such a protocol. They will seek out and abuse any loopholes in any well-intended protocol anyone can come up with.
I don't think anyone wants Cloudflare to have even more influence on the internet, but it's thanks to the growth of inconsiderate AI companies like Perplexity that these measure are necessary. The protocol Cloudflare proposes is open (it's just a signature), the problem people have with it is that they have to ask Cloudflare nicely to permit website owners to track and prevent abuse from bots. For any Azure-gated websites, your bot would need to ask permission there as well, as with Akamai-gated websites, and maybe even individual websites.
A new protocol is a technical solution. Technical solutions work for technical problems. The problem Cloudflare is trying to solve isn't a technical problem; it's a social problem.
I’m not here to propose a solution. I’m here as an end-user saying I won’t go back to the old experience which is outdated and broken.
Cloudflare is not the gatekeeper, it's the owner of the site that blocks Perplexity that's "gatekeeping" you. You're telling me that's not right?
But they can't insert themselves without the owner directly adding them. So it's the owner that's doing the gatekeeping(regardless if it's Cloudflare or iptables rules)
I think all you AI people blaming Cloudflare are just trying to deflect from the actual problem which is more and more owners don't want AI crawlers going through their content.
If Cloudflare dissapears who are you going to blame next, the iptables developers, maybe Linus Torvalds?
Nobody is dying because artists are protecting their art
If these companies are adding extra code to bypass artists trying to protect their intellectual property from mimicry then that is an obvious and egregious copyright violation
More likely it will push these companies to actually pay content creators for the content they work on to be included in their models.
Centralization bad yada yada. But if Cloudflare can get most major AI players to participate, then convince the major CDN's to also participate.... ipso facto columbo oreo....standard.
You can see it in the web vs mobile apps.
Many people may not see a problem on wallet gardens but reality is that we have much less innovation in mobile than in web because anyone can spawn a web server vs publish an app in the App Store (apple)
Regulation.
Make it illegal to request the content of a webpage by crawler if a website operator doesn't explicitly allows it via robots.txt. Institute a government agency that is tasked with enforcement. If you as a website operator can show that traffic came from bots, you can open a complaint with the government agency and they take care of shaking painful fines out of the offending companies. Force cloud hosts to keep books on who was using what IP addresses. Will it be a 100% fix, no, will it have a massive chilling effect if done well, absolutely.
The companies buying these services, are buying them from other companies. Countries or larger blocks like the EU can exert significant pressure on such companies by declaring the use of such services as illegal when interacting with websites hosted in the country or block or by companies in them.
> But how do you stop an EU company paying a Moldovan company (that has existed for 10 days) for "internet services", that pays a Brazilian company, that pays a Russian company to do the actual residential scraping?
The same example could be made with money laundering, and yes it's a real and sizable issue. Yet, the majority of money is not laundered. How does the EU company make sure it will not be held liable, especially the people that made the decision? Maybe on a technical level the perfect crime is possible and not getting caught is possible or even likely given a certain approach. But the uncertainty around it will dissuade many, not all. The same goes for companies selling the services, you might think you have a foolproof way to circumvent the measures put in play, but what if not and the government comes knocking?
The internet is too big and distributed to regulate. Nobody will agree on what the rules should be, and certain groups or countries will disagree in any case and refuse to enforce them.
Existing regulation rarely works, and enforcement is half-assed, at best. Ransomware is regulated and illlegal, but we see articles about major companies infected all the time.
I don't think registering with Cloudflare is the answer, but regulation definitely isn't the answer.
That said, what I am missing from these articles is an actual solution. Obviously we don't want Cloudflare from becoming an internet gatekeeper. It's a bad solution. But: it's a bad solution to an even worse problem.
Alternatives do exist, even decentralised ones, in the form of remote attestation ("can't access this website without secure boot and a TPM and a known-good operating system"), paying for every single visit or for subscriptions to every site you visit (which leads to centralisation because nobody wants a subscription to just your blog), or self-hosted firewalls like Anubis that mostly rely on AI abuse being the result of lazy or cheap parties.
People drinking the AI Kool-Aid will tell you to just ignore the problem, pay for the extra costs, and scale up your servers, because it's *the future*, but ignoring problems is exactly why Cloudflare still exists. If ISPs hadn't ignored spoofing, DDoS attacks, botnets within their network, """residential proxies""", and other such malicious acts, Cloudflare would've been an Akamai competitor rather than a middle man to most of the internet.
(What are some good alternatives to Cloudflare?)
Another way the situation is similar: email delivery is often unreliable and hard to implement due to spam filters. A similar thing seems to be happening to the web.
Not to mention the big cloud providers are unhinged with their egress pricing.
I always wonder why this status quo persisted even after Cloudflare. Their pricing is indeed so unhinged, that they're not even in consideration for me for things where egress is a variable.
Why is egress seemingly free for Cloudflare or Hetzner but feels like they launch spaceships at AWS and GCP every time you send a data packet to the outside world?
But the reality is how can someone small protect their blog or content from AI training bots? E.g.: They just blindly trust someone is sending Agent vs Training bots and super duper respecting robots.txt? Get real...
Or, fine what if they do respect robots.txt, but they buy the data that may or may not have been shielded through liability layers via "licensed data"?
Unless you're reddit, X, Google, or Meta with scary unlimited budget legal teams, you have no power.
Great video: https://www.youtube.com/shorts/M0QyOp7zqcY
I would love that, and make it automated.
A single message from your IP to your router: block this traffic. That router sends it upstream, and it also blocks it. Repeat ad nauseum until source changes ASN or (if the originator is on the same ASN) reaches the router from the originator, routing table space notwithstanding. Maybe it expires after some auto-expiry -- a day or month or however long your IP lease exists. Plus, of course, a way to query what blocks I've requested and a way to unblock.
Corporations develop hostile AI agents,
Capable hackers develop anti-AI-agents.
This defeatist atittude "we have no power".
In general Cloudflare has been pushing DRMization of the web for quite some time, and while I understand why they want to do it, I wish they didn't always show off as taking the moral high ground.
If anything we’ve seen the rise in complaints about it just annoying average users.
Having said that, the solution is effective enough, having a lightweight proxy component that issues proof of work tokens to such bogus requests works well enough, as various users on HN seem to point out.
um, no? Where did you get this strange bit of info.
The original reports say nothing of that sort: https://news.ycombinator.com/item?id=42790252 ; and even original motivation for Anubis was Amazon AI crawler https://news.ycombinator.com/item?id=42750420
(I've seen more posts with the analysis, including one which showed an AI crawler which would identify properly, but once it hits the ratelimit, would switch to fake user agent from proxies.. but I cannot find it now)
This seems like slogan-based planning with no actual thought put into it.
And if you don’t want to self host, at least try to use services from organisations that aren’t hostile to the open web
To put that in perspective, even if they're sending empty TCP packets, "several billion" pps is 200 to 1800 gigabits of traffic, depending on what you mean by that. Add a cookieless HTTP payload and you're at many terabits per second. The average self hoster is more likely to get struck by lightning than encounter and need protection from this (even without considering the, probably modest, consequences of being offline a few hours if it does happen)
Edit: off by a factor of 60, whoops. Thanks to u/Gud for pointing that out. I stand by the conclusion though: less likely to occur than getting struck by lightning (or maybe it's around equally likely now? But somewhere in that ballpark) and the consequences of being down for a few hours are generally not catastrophic anyway. You can always still put big brother in front if this event does happen to you and your ISP can't quickly drop the abusive traffic
But another argument against using the easiest choice, the near monopoly, is that we need a diverse, thriving ecosystem.
We don’t want to end up in a situation where suddenly Cloudflare gets to dictate what is allowed on the web.
We have already lost email to the tech giants, try running your own mail sometime. The technical aspect is easy, the problem is you will end up in so many spam folders it’s disgusting.
What we need are better decentralized protocols.
At first, you can use it for less serious stuff until you see how much it works.
Technically it's not very challenging. The problem is the total dominance of a few actors and a lot of spammers.
I have this argument every time self hosting comes up, and every time I wonder if someone will do it to me to make a point. Or if one of the like million other comments I post upsets someone or one of the many tools that I host. Yet to happen, idk. It's like arguing whether you need a knife on the street at all times because someone might get angry from a look. It happens, we have a word for it in NL (zinloos geweld) and tiles in sidewalks (lady bug depictions) and everything, but no normal person actually wears weapons 24/7 (drug dealers surely yeah) or has people talk through a middle person
I'd suspect other self hosters just see more shit than I do, were it not for that nobody ever says it happened to them. The only argument I ever hear is that they want to be "safe" while "self hosting with cloudflare". Who's really hosting your shit then?
A web site owner published something he really shouldn't have and got hacked. I wound up being a "person of interest" in the resulting FBI investigation because I was the weirdest person in the chat room for the site. I think it drove them crazy I was using Tor so they got somebody to try to entrap me into sharing CP but (1) I'm not interested and (2) know better than that.
Will have to give this a second thought but as a first one now that I read this: ...and would Cloudflare have helped against the FBI, or any foreign nation doing a request with Cloudflare against child porn? Surely not?! A different kind of opsec is surely more relevant there, so I don't know if it's really relevant to "normal", legal self hosting (as opposed to criminal, much less that level of unethical+criminal) communities or if there's an aspect I'm missing here
The only reason you even want to firewall 200 requests per second is that the code downstream of the firewall takes more than 5ms to service a request, so you could also consider improving that. And if you're only getting <5 and your server isn't overloaded then why block anything at all?
How much additional tax money should I spend at work so the AI scum can make 200 searches per second?
Human and 'nice' bots make about 5 per second.
It's getting really, really ugly out there.
[1]: https://codeberg.org/robots.txt#:~:text=Disallow:%20/.git/,....
you think codeberg would sue you?
But it's the same thing with random software from a random nobody that has no license, or has a license that's not open-source: If I use those libraries or programs, do I think they would sue me? Probably not.
What legal teeth I would advocate would be targeted to crawlers (a subset of bot) and not include your usage. It would mandate that Big Corp crawlers (for search indexing, AI data harvesting, etc.) be registered and identify themselves in their requests. This would allow serverside tools to efficiently reject them. Failure to comply would result in fines large enough to change behavior.
Now that I write that out, if such a thing were to come to pass, and it was well received, I do worry that congress would foam at the mouth to expand it to bots more generally, Microsoft-Uncertified-Devices, etc.
If it's too loose and similar to "wanted traffic is how the authors intend the website to be accessed, unwanted traffic is anything else", that's an argument that can be used against adblocks, or in favor of very specific devices like you mention. Might even give slightly more teeth to currently-unenforceable TOS.
If it's too strict, it's probably easier to find loopholes and technicalities that just lets them say "technically it doesn't match the definition of unwanted traffic".
Even if it's something balanced, I bet bigcorp lawyers will find a way to twist the definitions in their favor and set a precedent that's convenient for them.
I know this is a mini-rant rather than a helpful comment that tries to come up with a solution, it's just that I'm pessimistic because it seems the internet becomes a bit worse day by day no matter what we try to do :c
Robots.txt is meant for crawlers, not user agents such as a feed reader or git client
But not everyone thinks that's the purpose of robots.txt. Example, quoting Wikipedia[1] (emphasis mine):
> indicate to visiting web crawlers and other web robots which portions of the website they are allowed to visit.
Quoting the linked `web robots` page[2]:
> An Internet bot, web robot, robot, or simply bot, is a software application that runs automated tasks (scripts) on the Internet, usually with the intent to imitate human activity, such as messaging, on a large scale. [...] The most extensive use of bots is for web crawling, [...]
("usually" implying that's not always the case; "most extensive use" implying it's not the only use.)
Also a quick HN search for "automated robots.txt"[3] shows that a few people disagree that it's only for crawlers. It seems to be only a minority, but the search results are obviously biased towards HN users, so it could be different outside HN.
Besides all this, there's also the question of whether web scraping (not crawling) should also be subject to robots.txt or not; where "web scraping" includes any project like "this site has useful info but it's so unusable that I made a script so I can search it from my terminal, and I cache the results locally to avoid unnecessary requests".
The behavior of alternative viewers like Nitter could also be considered web scraping if they don't get their info from an API[4], and I don't know if I'd consider Nitter the bad actor here.
But yeah, like I said I agree with your comment and your interpretation, but it's not the only interpretation of what robots.txt is meant for.
[1]: https://en.wikipedia.org/wiki/Robots.txt
[2]: https://en.wikipedia.org/wiki/Internet_bot
[3]: https://hn.algolia.com/?dateRange=all&query=automated%20robo...
[4]: I don't know how Nitter actually works or where does it get its data from, I just mention it so it's easier to explain what I mean by "alternative viewer".
- Legal threats are never really effective
Effective solutions are:
- Technical
- Monetary
I like the idea of web as a blockchain of content. If you want to pull some data, you have to pay for it with some kind of token. You either buy that token to consume information if you're of the leecher type, or get some by doing contributions that gain back tokens.
It's more or less the same concept as torrents back in the day.
This should be applied to emails too. The regular person send what, 20 emails per day max ? Say it costs $0.01 per mail, anyone could pay that. But if you want to spam 1,000,000 everyday that becomes prohibitive.
This seems flawed.
Poor people living in 3rd world countries that make like $2.00/day wouldn't be able to afford this.
>But if you want to spam 1,000,000 everyday that becomes prohibitive.
Companies and people with $ can easily pay this with no issues. If it costs $10,000 to send 1M emails that inbox but you profit $50k, its a non issue.
https://anchorbrowser.io/blog/page-load-reliability-on-the-t...
Here's to working together to develop a new protocol that works for agents and website owners alike.
Joking aside, I think the ideas and substance are great and sorely needed. However, I can only see the idea of a sort of token chain verification as running into the same UX problems that plagued (plagues?) PGP and more encryption-focused processes. The workflow is too opaque, requires too much specialized knowledge that is out of reach for most people. It would have to be wrapped up into something stupid simple like an iOS FaceID modal to have any hope of succeeding with the general public. I think that's the idea, that these agents would be working on behalf of their owners on their own devices, so it has to be absolutely seamless.
Otherwise, rock on.
329 more comments available on Hacker News
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.