Finepdfs: 3t Token Dataset Made From Internet Pdfs
Posted4 months agoActive4 months ago
Techstory
calmneutral
Debate
0/100
AIDatasetPDF Processing
Key topics
AI
Dataset
PDF Processing
A new dataset called FinePDFs is released, containing 3T tokens made from internet PDFs.
Snapshot generated from the HN discussion
Discussion Activity
Light discussionFirst comment
6h
Peak period
1
6-7h
Avg / period
1
Key moments
- 01Story posted
Sep 7, 2025 at 3:19 AM EDT
4 months ago
Step 01 - 02First comment
Sep 7, 2025 at 9:28 AM EDT
6h after posting
Step 02 - 03Peak activity
1 comments in 6-7h
Hottest window of the conversation
Step 03 - 04Latest activity
Sep 7, 2025 at 9:28 AM EDT
4 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
Discussion (1 comments)
Showing 1 comments
compressedgas
4 months ago
https://huggingface.co/datasets/HuggingFaceFW/finepdfs
View full discussion on Hacker News
ID: 45156093Type: storyLast synced: 11/17/2025, 6:02:42 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.