The Nonprofit Feeding the Internet to AI Companies
Key topics
The article discusses the nonprofit Common Crawl, which provides data for AI training, and the tension between making content available and protecting it behind paywalls, sparking debate among commenters about copyright, open source, and the role of corporations.
Snapshot generated from the HN discussion
Discussion Activity
Light discussionFirst comment
2m
Peak period
4
0-1h
Avg / period
2.5
Key moments
- 01Story posted
Nov 4, 2025 at 7:29 AM EST
2 months ago
Step 01 - 02First comment
Nov 4, 2025 at 7:31 AM EST
2m after posting
Step 02 - 03Peak activity
4 comments in 0-1h
Hottest window of the conversation
Step 03 - 04Latest activity
Nov 4, 2025 at 2:00 PM EST
2 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
Further in, the article admits that such content is already free and sits inside the page source but is obfuscated by running code.
US law has the precedent (and a recent case that's about AI training) that training, reading, and transforming are not illegal if the materials themselves are legally obtained. Wholesale duplication of copyright material is illegal, but AI companies have already shown in court that they don't duplicate material but rather transform it at great effort and expense.
It reminds me a bit of the dilemma faced by authors of open source software. Collectively, developers have surrendered to the corporations, given up the GPL and given their work away for free. They don't worry about it because they can have a profitable sideline working for those same corporations, but journalists, writers and other artists don't have a lucrative fallback plan.