Benchmarking the Most Reliable Document Parsing API

Postedabout 2 months agoActiveabout 2 months ago

calavera

28 points

14 comments

tensorlake.aiTechstory

skepticalmixed

Debate

60/100

Document ParsingAPI BenchmarkingOcr

Key topics

Document Parsing

API Benchmarking

Ocr

The post benchmarks document parsing APIs, but the discussion is dominated by skepticism about the methodology and suggestions for alternative tools.

Snapshot generated from the HN discussion

Discussion Activity

Moderate engagement

First comment

45m

Peak period

1-2h

Avg / period

3.5

Comment distribution14 data points

Loading chart...

Based on 14 loaded comments

Key moments

01Story posted
Nov 6, 2025 at 1:12 PM EST
about 2 months ago
Step 01
02First comment
Nov 6, 2025 at 1:58 PM EST
45m after posting
Step 02
03Peak activity
10 comments in 1-2h
Hottest window of the conversation
Step 03
04Latest activity
Nov 6, 2025 at 4:24 PM EST
about 2 months ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (14 comments)

Showing 14 comments

serjester

about 2 months ago

3 replies

This is just a company advertisement, not even one that’s well done. They didn’t benchmark any of the real leaders in the space (reducto, extend, etc) and left Gemini out of the first two tests, presumably because it was the best performer (while also being multiple orders of magnitude cheaper).

diptanu

about 2 months ago

2 replies

Hey! I am the founder of Tensorlake. We benchmarked the models that our customers consider using in enterprises or regulated industries where there is a big need for processing documents for various automation. Benchmarking takes a lot of time so we focussed on the ones that we get asked about.

On Gemini and other VLMs - we excluded these models because they don't do visual grounding - aka they don't provide page layouts, bounding boxes of elements on the pages. This is a table stakes feature for use-cases customers are building with Tensorlake. It wouldn't be possible to build citations without bounding boxes.

On pricing - we are probably the only company offer a pure on-demand pricing without any tiers. With Tensorlake, you can get back markdown from every page, summaries of figures, tables and charts, structured data, page classification, etc - in ONE api call. This means we are running a bunch of different models under the hood. If you add up the token count, and complexity of infrastructure to build a complex pipeline around Gemini, and other OCR/Layout detection model I bet the price you would end up with won't be any cheaper than what we provide :) Plus doing this at scale is very very complex - it requires building a lot of sophisticated infrastructure - another source of cost behind modern Document Ingestion services.

ianhawes

about 2 months ago

1 reply

I just tested a non-English document and it rendered English text. Does your model not support anything other than English?

diptanu

about 2 months ago

It does, we have users in Europe and Asia using it with non English languages. Can you please send me a message at diptanu at tensorlake dot ai, would love to see why it didn’t work.

coderintherye

about 2 months ago

1 reply

Google's Vertex API for document processing absolutely does bounding boxes. In fact, some of the document processors are just a wrap around Google's product.

diptanu

about 2 months ago

OP mentioned Gemini and not Google’s Vertex OCR API which has very different performance and accuracy characteristics than Gemini

JLO64

about 2 months ago

Personally I use OpenAI models via the API for transcription of PDF files. Is there a big difference between them and Gemini models?

hotpaper75

about 2 months ago

Thanks for mentioning them, indeed their post seem to only surface a couple of names in the field and maybe not the most relevant ones.

recursive4

about 2 months ago

1 reply

Curious how it compares to https://github.com/datalab-to/chandra

diptanu

about 2 months ago

We haven’t tested Chandra yet, because it’s very new. Under the hood Tensorlake is very similar to Marker - it’s a pipeline based OCR API, we do layout detection, Text Recognition and Detection, Table Structure Understanding, etc. We then use VLMs to enrich the results. Our models are much bigger than marker, and thus takes a little longer to parse documents. We optimized for accuracy. We will have a faster API soon.

kissgyorgy

about 2 months ago

1 reply

I just tried it out and docling finished in 20s (with pretty good results) the same document which in Tensorlake is still pending for 10 minutes. I won't even wait for the results.

diptanu

about 2 months ago

There was an unusual traffic spike around that time, if you try now it should be a lot faster. We were calling up but there was not enough GPU capacity at that time.

karakanb

about 2 months ago

I have been recently looking into extracting a bunch of details from a set of legacy invoice PDFs and had a subpar experience. Gemini was the best among the ones that I tried, but even that missed quite a bit. I'll definitely give this a look.

It seems like such a crowded space and there are many tools doing document extraction, I wonder if there's anything particular pulling more attention into the space?

goldenjm

about 2 months ago

This would be more helpful if it included DeepSeek-OCR, PaddleOCR-VL and MinerU 2.5. In general, I've found that OmniDocBench is a reliable benchmark, perhaps surprisingly because it is made by the same team as MinerU. They updated their benchmark table recently: https://github.com/opendatalab/OmniDocBench#end-to-end-evalu.... There are some other models that score above DeepSeek-OCR as well that I'm not as familiar with.

View full discussion on Hacker News

ID: 45838365Type: storyLast synced: 11/20/2025, 1:20:52 PM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

Open link View on HN