Llms Solving Problems Ocr+nlp Couldn't
Key topics
The blog post "LLMs solving problems OCR+NLP couldn't" sparked a lively discussion, with many commenters calling out the article as a thinly veiled advertisement for a SaaS product, with one commenter joking that checking for such links has become a reflex. The conversation took a turn when the author, universesquid, chimed in to clarify that multimodal LLMs actually parse PDFs by taking multiple screenshots, rather than relying on OCR. As the debate unfolded, commenters weighed in on the limitations and potential of LLMs, with some noting that they still struggle with issues like prompt injection, while others poked fun at the notion that certain "attacks" are only considered legitimate if they come from state-sponsored hackers. The exchange highlights the ongoing fascination with the capabilities and limitations of LLMs in document processing.
Snapshot generated from the HN discussion
Discussion Activity
Active discussionFirst comment
56m
Peak period
11
0-2h
Avg / period
5.3
Based on 16 loaded comments
Key moments
- 01Story posted
Aug 28, 2025 at 9:15 AM EDT
5 months ago
Step 01 - 02First comment
Aug 28, 2025 at 10:11 AM EDT
56m after posting
Step 02 - 03Peak activity
11 comments in 0-2h
Hottest window of the conversation
Step 03 - 04Latest activity
Aug 29, 2025 at 4:32 AM EDT
5 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
And I assume the multimodal tools still use OCR for text extraction, or am I missing something?
My understanding is that they're still doing OCR+NLP, just differently than traditional approaches.
I think that's more because of the current state of the industry, a lot of those models are either internal, paywall locked or annoying to use. I don't want to waste effort in trying to sign up for a 4 week trail of X service to perform a one off task.
Unfortunately, this post didn't really elucidate or go into an interesting topic within this space.
I'm not expecting a research paper, but it would be great to get some stats, graphs, examples and meat on the bones. I opened this up expecting some actual examples of problems within OCR & NLP and showing how X multi-modal model solves them.
[1] https://www.bbc.com/news/technology-23588202
[2] https://www.dkriesel.com/en/blog/2013/0810_xerox_investigati...
Current 80/20-rule-ignoring AI dogma in a nutshell.
While Gemini is nice, it would be nice to have a pipeline that works locally on a reasonably RAM’d unified memory Mac or Framework AMD board.