The Nonprofit Doing the AI Industry's Dirty Work

Postedabout 2 months agoActiveabout 2 months ago

kgwgk

9 points

1 comments

theatlantic.comTechstory

calmmixed

Debate

20/100

AI Training DataWeb ArchivingCommon Crawl

Key topics

AI Training Data

Web Archiving

Common Crawl

The article discusses Common Crawl, a nonprofit that archives the web for AI research, with some commenters questioning the characterization of their work as 'dirty'.

Snapshot generated from the HN discussion

Discussion Activity

Light discussion

First comment

Peak period

2-3h

Avg / period

Key moments

01Story posted
Nov 4, 2025 at 3:35 PM EST
about 2 months ago
Step 01
02First comment
Nov 4, 2025 at 6:28 PM EST
3h after posting
Step 02
03Peak activity
1 comments in 2-3h
Hottest window of the conversation
Step 03
04Latest activity
Nov 4, 2025 at 6:28 PM EST
about 2 months ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (1 comments)

Showing 1 comments

Aloisius

about 2 months ago

Calling archiving the web for researchers dirty work is a bit much.

Unless something has changed since I was there, the crawler didn't intentionally bypass any paywalls.

The crawler obeyed robots.txt, throttled itself when visiting slow sites to avoid overloading them and announced its user agent clearly with a URL explaining what it was and how to block it if desired.

View full discussion on Hacker News

ID: 45815598Type: storyLast synced: 11/17/2025, 7:52:31 AM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

Open link View on HN