The AI-Scraping Free-for-All Is Coming to an End
Posted4 months agoActive4 months ago
nymag.comTechstory
skepticalmixed
Debate
80/100
Artificial IntelligenceData ScrapingLicensingWeb Governance
Key topics
Artificial Intelligence
Data Scraping
Licensing
Web Governance
The article discusses the emerging trend of websites restricting AI scraping through new licensing standards, sparking debate among commenters about the effectiveness and implications of such measures.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
5m
Peak period
42
0-12h
Avg / period
11
Comment distribution66 data points
Loading chart...
Based on 66 loaded comments
Key moments
- 01Story posted
Sep 14, 2025 at 11:01 AM EDT
4 months ago
Step 01 - 02First comment
Sep 14, 2025 at 11:06 AM EDT
5m after posting
Step 02 - 03Peak activity
42 comments in 0-12h
Hottest window of the conversation
Step 03 - 04Latest activity
Sep 19, 2025 at 5:26 AM EDT
4 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45240266Type: storyLast synced: 11/20/2025, 2:49:46 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
You can see artifacts when their servers are at queue load and you see the URLs, a few resources have the JWT with the account details in the URL. IIRC the clearname of the account in the token is Masha Rabinovich, with an email account masha@dns.li, an identity that has cropped up in various investigations [1][2].
[1] https://gyrovague.com/2023/08/05/archive-today-on-the-trail-...
[2] https://webapps.stackexchange.com/questions/145817/who-owns-...
I think a big difference is that there’s no micro transactions or compulsory licensing for content, so it always feels patently unfair to buy a subscription to read one article.
While this likely has no legal weight (except for EU TDM for commercial use, where the law does take into account opt-outs), they are betting on using services like CloudFlare and Fastly to enforce this.
[1] https://www.investors.com/research/the-new-america/reddit-st...
I wouldn't be quite so sure about that. The AI industry has entirely relied on 'move fast and break things' and 'old fart judges who don't understand the tech' as their legal strategy.
The idea that AI training is fair use isn't so obvious, and quite frankly is entirely ridiculous in a world where AI companies pay for the data. If it's not fair use to take reddit's data, it's not fair use to take mine either.
On a technological level the difference to prior ML is straightforward: A classical classifier system is simply incapable of emitting any copyrighted work it was trained on. The very architecture of the system guarantees it to produce new information derived from the training data rather than the training data itself.
LLMs and similar generative AI do not have that safeguard. To be practically useful they have to be capable of emitting facts from training data, but have no architectural mechanism to separate facts from expressions. For them to be capable of emitting facts they must also be capable of emitting expressions, and thus, copyright violation.
Add in how GenAI tends to directly compete with the market of the works used as training data in ways that prior "fair use" systems did not and things become sketchy quickly.
Every major AI company knows this, as they have rushed to implement copyright filtering systems once people started pointing out instances of copyrighted expressions being reproduced by AI systems. (There are technical reasons why this isn't a very good solution to curtail copyright infringement by AI)
Observe how all the major copyright victories amount to judges dismissing cases on grounds of "Well you don't have an example specific to your work" rather than addressing whether such uses are acceptable as a collective whole.
If this intended to refer to Judge Alsup, it is extremely wrong.
Sure they do. Every time a bot searches, reads your site and formulates an answer it does not replicate your expression. First of all, it compares across 20.. 100 sources. Second, it only reports what is related to the user query. And third - it uses its own expression. It's more like asking a friend who read those articles and getting an answer.
LLMs ability to separate facts from expression is quite well developed, maybe their strongest skill. They can translate, paraphrase, summarize, or reword forever.
There should presumably be data showing the reliability of LLMs' knowledge to be quite high, then?
> Every time a bot searches
We are talking about LLMs by themselves, not larger systems using them.
> LLMs ability to separate facts from expression is quite well developed
It is not. Whether you ask an LLM for an excerpt of the bible, or an excerpt of The Lord of the Rings, the LLM does not distinguish. It has no concept of what is, and what is not, under copyright.
> Observe how all the major copyright victories amount to judges dismissing cases on grounds of "Well you don't have an example specific to your work" rather than addressing whether such uses are acceptable as a collective whole.
Well, all a judge can/should do is to apply current law to the case before them. In the case of generative AI then it seems that it's mostly going to be copyright and "right of publicity" (reproducing someone else's likeness/voice) that apply.
Copyright infringment is all about having published something based on someone else's work - AFAIK it doesn't have anything to say about someone/something having the potential to infringe (e.g. training an AI) if they haven't actually done it. It has to be about the generated artifact.
Of course copyright law wasn't designed with generative AI in mind, and maybe now that it is here we need new laws to protect creative content. For example, should OpenAI be able to copy Studio Ghibli's "trademark" style without requiring permission?
This is true, and I do not mean to suggest it is bad. But rather, that it leaves uncertainty. These cases can all be struck down without reducing the possibility that if one does stick, the entire industry is at stake.
> Copyright infringment is all about having published something based on someone else's work - AFAIK it doesn't have anything to say about someone/something having the potential to infringe (e.g. training an AI) if they haven't actually done it. It has to be about the generated artifact.
A notable problem here is that AI models are not "standalone products" but tools provided as a service. This complicates the situation.
Take Disney/Universal's case against Midjourney, which is both about the models but also the provision of services.
Even if only the latter gets deemed illegal, that's ruinous for the big AI companies. What good is OpenAI if they can't provide ChatGPT? Who would license a LLM if the act of using it creates constant legal risks?
A “classical” classifier can regurgitate its training data as well. It’s just that Reddit never seemed to care about people training e.g. sentiment classifiers on their data before.
In fact a “decoder” is simply autoregressive token classification.
The right thing would be for the end users to receive the compensation Reddit is getting from AI companies.
Is there even one example of a “tech mega corp” that has grown to control more than 1/5 of its market without this circling back to hurt people in some way? A single example?
The licensing standard they're talking about will achieve nothing.
Anti-bot companies selling scraping protections will run out of runway: there's a limited set of signals, and none of them are robust. As the signals get used, they're also getting burned. And it's politically impossible to expand the web platform to have robust counter-abuse capabilities.
Putting the content behind a login wall can work for large sites, but not small ones.
The free-for-all will not end until adversarial scraping becomes illegal.
Syndication is the answer. Small artists are on Spotify, small video makers are on YouTube.
See the problem?
Well, with one alternative: Edge.
We don't need these institutions. We don't need these publishing platforms.
It's ok for them to die. They no longer provide value.
Adversarial scraping is not a thing, and it can't hurt you.
Fair use, however, is a thing, and what we need to be doing is totally overhauling copyright law such that it maximizes protections for individual creative types, and does away with the exploitable corporatized loopholes and bureaucracy.
99% of all sales for nearly all copyrighted products are done within the first 4 years of a work hitting the market. Give ironclad copyright to the creator for 5 years. The creator can assign their rights, explicitly, in writing, to a third party, for any particular work, or any particular fraction of their work, but each and every assignment of rights has to be explicitly documented and notarized.
No more DMCA automated bullshit. The creator can submit a copyright claim. They need to provide evidence. If the evidence of wrongdoing is false, they should be fined. If a third party files a claim, they should be fined, zero exceptions, even if they have assigned rights.
Artists and creators and writers should get the recognition - if someone creates a thing, they attach a name to it, and they can lease rights to corporations or the like.
After 5 years, extend fair use to something liberal and generous, requiring both acknowledgments of source works and royalties, no more than 15%, paid to the creator/s. If multiple post-5 year "fair use" creators are involved, the 15% is split between them. From 5-15 years, you have to give credit and pay a fair use royalty. If you're a trillion dollar company, you're shelling out a lot of royalties. If you're an artist reusing other art, or writing fanfic for profit, or whatever, you're buying other artists a coffee in tribute.
After 15 years, it becomes public domain.
Anything older than 5 years becomes fair game for training AI or otherwise using in software. You set aside 15% for distribution and reimbursement once a year, and notify any creator of your use of their material.
We need something sane, that scales, that doesn't hand power to corrupt cadres of lawyers and middle men who do nothing but leach from creatives and ruin innocent people's lives.
AI is here to stay. Let's set up a system in which they contribute back to the commons in a significant way, that doesn't favor byzantine licensing and gatekeeping schemes designed to keep lawyers fat and happy off the efforts of people actually contributing to the common good. Let's allow the corporate media platforms and publishing outfits to die off. We have much better ways of doing things and better ways of rewarding people for their work. We don't need lawyers sucking up 80% of the profits for "facilitating deals" or whatever it is they tell themselves to sleep at night.
Raze the old system and salt the ground. Simplify everything for the practical and creative people to maximize on the value all around, get people the credit and profit they deserve, and foster a vibrant public good. It doesn't need to be thousands of pages of technicalities and byzantine law and legal tradecraft. That game was built for the lawyers, and we should stop playing it.
VC will eventually run out, then comes the burst.
user queries "static" training data in LLM; LLM guesses something, then searches internet in real-time for data to support the guesses. This would be classified as "browsing" rather than trawling.
(the searched data then get added back into the corpus, thus sadly sidestepping all the anti-AI trawling mechanisms)
Kind of like the way a normal user would.
The problem is, as others have already mentioned, how would the LLMs know what is a good answer versus a bad, when a "normal" user also has this issue?
Those things were afterthoughts because for the most part the experimental methods sucked compared to the real thing. If we were in mid 2016 and your LSTM was barely stringing together coherent sentences, it was a curiosity but not a serious competitor to StackOverflow.
I say this not because I don’t think law/ethics are important in the abstract, but because they only became relevant after significant technological improvement.
3 more comments available on Hacker News