The Common Pile V0.1: an 8tb Dataset of Public Domain and Openly Licensed Text
Posted4 months agoActive3 months ago
arxiv.orgResearchstory
calmpositive
Debate
40/100
AI Training DataOpen LicensingLarge Datasets
Key topics
AI Training Data
Open Licensing
Large Datasets
Researchers release The Common Pile v0.1, an 8TB dataset of public domain and openly licensed text, sparking discussion on its potential impact on AI development and the advantages it may give to certain models.
Snapshot generated from the HN discussion
Discussion Activity
Active discussionFirst comment
4d
Peak period
13
96-108h
Avg / period
5.7
Comment distribution17 data points
Loading chart...
Based on 17 loaded comments
Key moments
- 01Story posted
Sep 19, 2025 at 6:22 AM EDT
4 months ago
Step 01 - 02First comment
Sep 23, 2025 at 5:19 AM EDT
4d after posting
Step 02 - 03Peak activity
13 comments in 96-108h
Hottest window of the conversation
Step 03 - 04Latest activity
Sep 24, 2025 at 3:40 AM EDT
3 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45299987Type: storyLast synced: 11/20/2025, 2:46:44 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
This is why Gemini has such an advantage.
Also, link to explore data: https://huggingface.co/collections/common-pile/common-pile-v...
I think model collapse gets talked about so much because it is irresistible schadenfreude. The idea of models eating their own tails in a way that leads to their inevitable demise is captivating to a lot of people, especially AI skeptics.
More generally, a lot of ideas have been speculated based on very tiny models in controlled settings and they didnt pan out in real LLMs. There probably exists a minimal compute threshold for overcoming generalization traps.
It's popular because some people latched onto the idea - desperately wanting something to stop the AI tech from advancing. It, quite obviously, doesn't stop the AI tech from advancing.
Now, you can write an entire research paper on why model collapse happens or fails to happen. But a simple way to think of it is: looping AI onto itself multiple times amplifies that AI's own deficiencies, distortions and idiosyncrasies - until, after enough iterations, they come to completely dominate its outputs.
This doesn't apply at all to training an LLM on Whisper outputs that are, in turn, based on human-generated videos. The LLM will inherit some Whisper quirks, but most of the data in Whisper outputs comes from the videos themselves.
There is some risk that Whisper transcribed inaccurately, but that’s less model collapse and more “the dataset is bad”.
For example, you had a classifier that works at 95% precission trained with carefully labeled data. Then, to train the next version you download 1Tb of images, classify with your previous model, and use that to retrain. Do you expect to get better than 95%, or are you poisoning your model?
I'm asking: can you do that with LLM? Feed them data that's known to be 95% precise at best? I did some Whisper, and usually get runs of words, like "bye bye bye bye bye bye", despite being only said once. Should I use that kind of data to train a LLM?
I saw this experiment where an LLM was feed an image and asked to make the same image. Then repeat with the generated image. After ten or so cycles, the content (a human head photo) was barely recognizable.