Opendataloader-Pdf: an Open Source Tool for Structured PDF Parsing

Posted4 months agoActive3 months ago

phobos44

109 points

28 comments

github.comTechstory

supportivepositive

Debate

20/100

PDF ParsingStructured Data ExtractionAI-Friendly File Formats

Key topics

PDF Parsing

Structured Data Extraction

AI-Friendly File Formats

The OpenDataLoader-PDF project, an open-source tool for structured PDF parsing, was shared on HN, sparking discussions on its potential applications, comparisons to other tools, and the challenges of working with PDFs.

Snapshot generated from the HN discussion

Discussion Activity

Very active discussion

First comment

Peak period

0-12h

Avg / period

5.6

Comment distribution28 data points

Loading chart...

Based on 28 loaded comments

Key moments

01Story posted
Sep 23, 2025 at 9:58 AM EDT
4 months ago
Step 01
02First comment
Sep 23, 2025 at 11:22 AM EDT
1h after posting
Step 02
03Peak activity
22 comments in 0-12h
Hottest window of the conversation
Step 03
04Latest activity
Sep 28, 2025 at 11:30 AM EDT
3 months ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (28 comments)

Showing 28 comments

clueless

4 months ago

2 replies

Given the current llm context size limitation, what is the state of art for feeding large doc/text blobs into llm for accurate processing?

simonw

4 months ago

1 reply

The current generation of models all support pretty long context now - the Gemini family has had 1m tokens for over a year, GPT-4.1 is 1m, interestingly GPT-5 is back down to 400,000, Claude 4 is 200,000 but there's a mode of Claude Sonnet 4 that can do 1m as well.

The bigger question is how well they perform - there are needle-in-haystack benchmarks that test that, they're mostly scoring quite highly on those now.

https://cloud.google.com/blog/products/ai-machine-learning/t... talks about that for Gemini 1.5.

Here's a couple of relevant leaderboards: https://huggingface.co/spaces/RMT-team/babilong and https://longbench2.github.io/

clueless

4 months ago

1 reply

sorry I should have been more clear, I meant around open source llms. and I guess the question is, how are closed source llm doing it so well. And if OS OpenNote is the best we have...

simonw

4 months ago

1 reply

Mainly I think it's that you need a LOT of VRAM to handle long context - server-class hardware is pretty much a requirement to work with more than ~10,000 tokens.

ranger_danger

3 months ago

On my i9 desktop with 128GB RAM and only 8GB VRAM, using llama.cpp I can split the work between both CPU/GPU and get the max 200k context to run on Qwen3 at a decent (human-reading) speed.

lysecret

4 months ago

1 reply

Generally use 2.5 flash for this, works incredibly well. So many traditionally hard things can now we solved by stuffing it into a pretty cheap llm haha.

mekael

4 months ago

What do you mean by “traditionally hard” in relation to a pdf? Most if not all of the docs I’m tasked with parsing are secured, flattened, and handwritten, which can cause any tool (traditional or ai) to require a confidence score and manual intervention. Also might be that i just get stuck with the edge cases 90% of the time.

trevor-e

4 months ago

3 replies

I've been thinking lately that maybe we need a new AI-friendly file format rather than continuing to hack on top of PDF's complicated spec. PDF was designed to have consistent and portable page display rendering, it was not a goal for it to be easily parseable afaik, which is why we have to go through these crazy hoops. If you've ever looked at how text is stored internally in PDF this becomes immediately obvious.

I've been toying with an idea of a new format that stores text naturally and captures semantics (e.g. to help with table parsing), but also preserves formatting rules so you can still achieve fairly consistent rendering. This format could be easily converted to PDF, although the opposite conversion would have the regular challenges. The main challenge is distribution of course.

Jaxan

4 months ago

2 replies

Wouldn’t it be better to invest in a human-friendly format first (which also could be AI-friendly).

dotancohen

4 months ago

If you can convince your bank to make available your bank statement in Markdown, let us know.

Your transactions are probably already available in CSV.

trevor-e

4 months ago

Not really sure what you mean by a "human-friendly" file format, can you elaborate? File formats are inherently not friendly to humans, they are a bag of bytes. But that doesn't mean they can't be better consumed by tools which is what I mean by "AI friendly".

s0rce

4 months ago

1 reply

Doesn't Latex do this?

trevor-e

4 months ago

1 reply

Yea I think Latex is capable of much of this but it's also cursed

s0rce

4 months ago

Don't need to convince me. I typeset my wife's PhD thesis in LaTeX and it looks great but it was so frustrating that after I did mine in Word.

kykat

4 months ago

Sounds like you want XML

agsqwe

4 months ago

1 reply

How does it compare to docling?

favorited

4 months ago

1 reply

Docling primarily uses AI models to extract PDF content, this project looks like it uses a custom parser written in Java, built atop veraPDF.

brumar

4 months ago

1 reply

Correct me if I am wrong, but Docling can do both. It has also, among other strategies, a non-AI pipeline to determine the layout (based on qpdf I believe). So these projects are not that different.

favorited

4 months ago

While it has a PDF parser, my understanding is that it is mainly used to break a PDF document into chunks, which are then handed off to various specialized models. From its docs: "The main purpose of Docling is to run local models which are not sharing any user data with remote services."

emilburzo

4 months ago

2 replies

I just tested it on one of my nemeses: PDF bank statements. They're surprisingly tough to work with if you want to get clean, structured transaction data out of them.

The JSON extract actually looks pretty good and seems to produce something usable in one shot, which is very good compared to all the other tools I've tried so far, but I still need to check it more in-depth.

Sharing here in case someone chimes in with "hey, doofus, $magic_project already solves this."

vortex_ape

4 months ago

Camelot[1] worked very well for me with bank statements. Disclaimer: I'm one of the core contributors.

[1] https://github.com/camelot-dev/camelot

dleeftink

4 months ago

For 'zoned' extraction, Cermine[0] may be of use as a pre-processing step. Mileage may vary as its tailored towards papers.

[0]: http://cermine.ceon.pl/about.html

hermitcrab

4 months ago

1 reply

I got excited until I read that it was Java/Python based.

I'm looking for a library that can extract data tables from PDF and can be called from a C++ program (for https://www.easydatatransform.com). If anyone can suggest something, I'm all ears.

therealpygon

4 months ago

1 reply

What makes Java/Python not able to be called from C++, or did you mean you have other requirements that make the project unsuitable?

hermitcrab

4 months ago

I can fire up a Java program in a separate process. But it is slow and passing data backwards and forwards is clunky. Much better to be able to do it all in one process.

constantinum

4 months ago

There is also Unstract open-source. Structured data extraction + ETL. https://github.com/Zipstack/unstract

fedeb95

4 months ago

Very cool. I'll probably use it, but not for AI. I have lots of pdfs for which an epub doesn't exist.

Or if anything I'll add it to the projects-that-already-do-this-but-havent-yet-found list.

4d66ba06

4 months ago

Just finished migrating to it to replace pdf2docx in a new project I’ve been working on and it is so much better. Thanks for open sourcing OpenDataLoader-PDF!

View full discussion on Hacker News

ID: 45347147Type: storyLast synced: 11/20/2025, 6:48:47 PM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

Open link View on HN