Opendataloader-Pdf: an Open Source Tool for Structured PDF Parsing
Posted4 months agoActive3 months ago
github.comTechstory
supportivepositive
Debate
20/100
PDF ParsingStructured Data ExtractionAI-Friendly File Formats
Key topics
PDF Parsing
Structured Data Extraction
AI-Friendly File Formats
The OpenDataLoader-PDF project, an open-source tool for structured PDF parsing, was shared on HN, sparking discussions on its potential applications, comparisons to other tools, and the challenges of working with PDFs.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
1h
Peak period
22
0-12h
Avg / period
5.6
Comment distribution28 data points
Loading chart...
Based on 28 loaded comments
Key moments
- 01Story posted
Sep 23, 2025 at 9:58 AM EDT
4 months ago
Step 01 - 02First comment
Sep 23, 2025 at 11:22 AM EDT
1h after posting
Step 02 - 03Peak activity
22 comments in 0-12h
Hottest window of the conversation
Step 03 - 04Latest activity
Sep 28, 2025 at 11:30 AM EDT
3 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45347147Type: storyLast synced: 11/20/2025, 6:48:47 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
The bigger question is how well they perform - there are needle-in-haystack benchmarks that test that, they're mostly scoring quite highly on those now.
https://cloud.google.com/blog/products/ai-machine-learning/t... talks about that for Gemini 1.5.
Here's a couple of relevant leaderboards: https://huggingface.co/spaces/RMT-team/babilong and https://longbench2.github.io/
I've been toying with an idea of a new format that stores text naturally and captures semantics (e.g. to help with table parsing), but also preserves formatting rules so you can still achieve fairly consistent rendering. This format could be easily converted to PDF, although the opposite conversion would have the regular challenges. The main challenge is distribution of course.
Your transactions are probably already available in CSV.
The JSON extract actually looks pretty good and seems to produce something usable in one shot, which is very good compared to all the other tools I've tried so far, but I still need to check it more in-depth.
Sharing here in case someone chimes in with "hey, doofus, $magic_project already solves this."
[1] https://github.com/camelot-dev/camelot
[0]: http://cermine.ceon.pl/about.html
I'm looking for a library that can extract data tables from PDF and can be called from a C++ program (for https://www.easydatatransform.com). If anyone can suggest something, I'm all ears.
Or if anything I'll add it to the projects-that-already-do-this-but-havent-yet-found list.