The Nonprofit Doing the AI Industry's Dirty Work
Postedabout 2 months agoActiveabout 2 months ago
theatlantic.comTechstory
calmmixed
Debate
20/100
AI Training DataWeb ArchivingCommon Crawl
Key topics
AI Training Data
Web Archiving
Common Crawl
The article discusses Common Crawl, a nonprofit that archives the web for AI research, with some commenters questioning the characterization of their work as 'dirty'.
Snapshot generated from the HN discussion
Discussion Activity
Light discussionFirst comment
3h
Peak period
1
2-3h
Avg / period
1
Key moments
- 01Story posted
Nov 4, 2025 at 3:35 PM EST
about 2 months ago
Step 01 - 02First comment
Nov 4, 2025 at 6:28 PM EST
3h after posting
Step 02 - 03Peak activity
1 comments in 2-3h
Hottest window of the conversation
Step 03 - 04Latest activity
Nov 4, 2025 at 6:28 PM EST
about 2 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45815598Type: storyLast synced: 11/17/2025, 7:52:31 AM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
Unless something has changed since I was there, the crawler didn't intentionally bypass any paywalls.
The crawler obeyed robots.txt, throttled itself when visiting slow sites to avoid overloading them and announced its user agent clearly with a URL explaining what it was and how to block it if desired.