Llms Solving Problems Ocr+nlp Couldn't

Posted5 months agoActive5 months ago

universesquid

19 points

16 comments

cloudsquid.substack.comTech Discussionstory

informativepositive

Debate

40/100

Large Language ModelsOcrLLMAI

Key topics

Large Language Models

Ocr

LLM

The blog post "LLMs solving problems OCR+NLP couldn't" sparked a lively discussion, with many commenters calling out the article as a thinly veiled advertisement for a SaaS product, with one commenter joking that checking for such links has become a reflex. The conversation took a turn when the author, universesquid, chimed in to clarify that multimodal LLMs actually parse PDFs by taking multiple screenshots, rather than relying on OCR. As the debate unfolded, commenters weighed in on the limitations and potential of LLMs, with some noting that they still struggle with issues like prompt injection, while others poked fun at the notion that certain "attacks" are only considered legitimate if they come from state-sponsored hackers. The exchange highlights the ongoing fascination with the capabilities and limitations of LLMs in document processing.

Snapshot generated from the HN discussion

Discussion Activity

Active discussion

First comment

56m

Peak period

0-2h

Avg / period

5.3

Comment distribution16 data points

Loading chart...

Based on 16 loaded comments

Key moments

01Story posted
Aug 28, 2025 at 9:15 AM EDT
5 months ago
Step 01
02First comment
Aug 28, 2025 at 10:11 AM EDT
56m after posting
Step 02
03Peak activity
11 comments in 0-2h
Hottest window of the conversation
Step 03
04Latest activity
Aug 29, 2025 at 4:32 AM EDT
5 months ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (16 comments)

Showing 16 comments

behnamoh

5 months ago

2 replies

This is a nothing burger blog post that likely made it to the front page because it mentions "LLM" in the title. Worse yet, it's an ad actually.

OtherShrezzing

5 months ago

1 reply

The first thing I do on HN posts with lots of upvotes and few comments is scroll to the bottom and check if the closing paragraph has a link to some saas product. If it does, I close the tab.

thaeli

5 months ago

Ironically, this check would be a pretty good use for a LLM.

WesleyLivesay

5 months ago

You beat me to this comment, but you are absolutely correct.

tovej

5 months ago

1 reply

Are LLMs not NLP? They process natural language, no?

And I assume the multimodal tools still use OCR for text extraction, or am I missing something?

My understanding is that they're still doing OCR+NLP, just differently than traditional approaches.

universesquidAuthor

5 months ago

1.) technically yes, most models used for that task are NLP but not LLMs in the modern sense though 2.) Actually they don't. Multimodal LLMs parse PDFs by taking multiple screenshots on each page.

Tractor8626

5 months ago

1 reply

OCR doesn't have prompt injection problem

mattigames

5 months ago

It's only prompt injection if it comes from state sponsored hackers, otherwise it's just surprise prompt augmentation.

endymion-light

5 months ago

1 reply

I don't mind people doing blog-posts advertising they're own companies - but I feel like i'd like a little bit more substance within this topic. It is interesting in a way, I find I turn to things like gemini 2.5 within simple OCR/NLP and now more substantial image editing than specific models.

I think that's more because of the current state of the industry, a lot of those models are either internal, paywall locked or annoying to use. I don't want to waste effort in trying to sign up for a 4 week trail of X service to perform a one off task.

Unfortunately, this post didn't really elucidate or go into an interesting topic within this space.

I'm not expecting a research paper, but it would be great to get some stats, graphs, examples and meat on the bones. I opened this up expecting some actual examples of problems within OCR & NLP and showing how X multi-modal model solves them.

universesquidAuthor

5 months ago

1 reply

cool thanks for that comment, I might update this in a couple weeks time since it seems to interest people but general feedback that it's too shallow. Wanted to give some high level intuition I gained after working on document processing for a while now as many people are still surprised that e.g. layouts aren't a real problem anymore but will take the hint that hn is a crowd that wants more depth! :)

endymion-light

5 months ago

Yeah absolutley! Didn't mean for it to come across as snarky, more along the lines of I think this could be a really interesting subject to delve a little bit deeper on and would love to read that!

eithed

5 months ago

1 reply

OCRs don't hallucinate outputs = if it says "212.99mm" on architecture diagram it doesn't suddenly turn into "2413m" on the other end, because LLM thought this feels better. I remember reading on HN where that was happening in a such case (but sadly my google foo fails me to find a link)

strangecasts

5 months ago

1 reply

The case you might be thinking of is the JBIG2 implementation bug [1, 2] in Xerox photocopiers where the pattern-matching would incorrectly treat certain characters as interchangeable, leading to numbers getting rewritten in spreadsheets.

[1] https://www.bbc.com/news/technology-23588202

[2] https://www.dkriesel.com/en/blog/2013/0810_xerox_investigati...

eithed

5 months ago

That's exactly it! Thank you!

tiahura

5 months ago

"I still believe that processing documents will be a solved problem in a couple years time."

Current 80/20-rule-ignoring AI dogma in a nutshell.

daft_pink

5 months ago

Really looking for something we can run locally in terms of OCR LLM, I think a lot of people doing a lot of OCR and document extraction aren’t looking to upload every file into the cloud and the use is more narrow than typing into a chatbot.

While Gemini is nice, it would be nice to have a pipeline that works locally on a reasonably RAM’d unified memory Mac or Framework AMD board.

View full discussion on Hacker News

ID: 45051777Type: storyLast synced: 11/20/2025, 6:48:47 PM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

Open link View on HN