Deepseek Ocr

Posted3 months agoActive2 months ago

pierre

1,003 points

244 comments

github.comTechstoryHigh profile

calmpositive

Debate

60/100

OcrLarge Language ModelsComputer Vision

Key topics

Ocr

Large Language Models

Computer Vision

DeepSeek-OCR is an open-source OCR model that uses LLMs to achieve high accuracy and compression, sparking discussion on its potential applications, comparisons to existing models, and the future of OCR technology.

Snapshot generated from the HN discussion

Discussion Activity

Very active discussion

First comment

18m

Peak period

142

0-12h

Avg / period

22.9

Comment distribution160 data points

Loading chart...

Based on 160 loaded comments

Key moments

01Story posted
Oct 20, 2025 at 2:26 AM EDT
3 months ago
Step 01
02First comment
Oct 20, 2025 at 2:44 AM EDT
18m after posting
Step 02
03Peak activity
142 comments in 0-12h
Hottest window of the conversation
Step 03
04Latest activity
Oct 27, 2025 at 7:10 PM EDT
2 months ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (244 comments)

Showing 160 comments of 244

farseer

3 months ago

1 reply

How good is this compared to most commercial OCR software?

ozim

3 months ago

2 replies

Any vision model is better than commercial OCR software.

Etheryte

3 months ago

1 reply

I'm not really sure if that's an accurate summary of the state of the art, [0] is a better overview. In short, SOTA multi-modal LLMs are the best option for handwriting, nearly anything is good at printed text, for printed media, specialty models from hyperscalers are slightly better than multi-modal LLMs.

[0] https://research.aimultiple.com/ocr-accuracy/

ozim

3 months ago

1 reply

I see it confirms what I wrote state of art is “not using tessaract anymore” and I think bunch of commercial solutions are stuck with tessaract.

ares623

3 months ago

I assume Tesseract has the advantage of being able to give a confidence score?

dragonwriter

3 months ago

Since “commercial OCR software” includes VLM-based commercial offerings, that's clearly not correct.

piker

3 months ago

2 replies

This looks really cool for prototyping and playing around.

It seems to me though if one is building a modern application that needs to get image segmentation and/or text recognition right there are better APIs available than natural language? It seems like a lot of effort to make a production-scale CV application to weigh it down with all of an LLM’s shortcomings. Not a field I’m familiar with but I would assume that this doesn’t produce state of the art results—that would change the analysis.

randomNumber7

3 months ago

1 reply

Imagine you build an image segmentation model for a e.g. specific industrial application.

With this LLM approach you can at least create your training data from the raw images with natural language.

piker

3 months ago

That does make sense

CheeseFromLidl

3 months ago

As a hobby photographer, I organise everything for speedy retrieval but this would be amazing to search my collection.

krackers

3 months ago

6 replies

The paper is more interesting than just another VLM for OCR, they start talking about compression and stuff. E.g. there is this quote

>Our work represents an initial exploration into the boundaries of vision-text compression, investigating how many vision tokens are required to decode text tokens. The preliminary results are encouraging: DeepSeek-OCR achieves near-lossless OCR compression at approximately 10× ratios, while 20× compression still retains 60% accuracy.

(I guess you could say a picture token is worth 10 textual tokens...)

Could someone explain to a noob what the information-theoretic intuition is here? Why does this work, is it that text tokens are still too "granular"/repetitive and don't come close to the ideal entropy coding? Or is switching to vision tokens escaping the limitation of working "one word-ish at a time", allowing you to get closer to entropy (similar to the way that arithmetic encoding does compared to huffman codes)?

And then they start talking about handling long-context by literally(?) downscaling images, forming a correspondence between information loss in the textual domain and the image domain.

looobay

3 months ago

1 reply

LLMs are compute heavy with quadratic scaling (in compute) per tokens. They are trying to compress text tokens into visual tokens with their VLM.

Maybe they would render texts to an image before tokenizing to reduce the compute cost.

krackers

3 months ago

6 replies

But naively wouldn't you expect the representation of a piece of text in terms of vision tokens to be roughly the same number of bits (or more) than the representation as textual token? You're changing representation sure, but that by itself doesn't give you any compute advantages unless there is some sparsity/compressability you can take advantage of in the domain you transform to right?

So I guess my question is where is the juice being squeezed from, why does the vision token representation end up being more efficient than text tokens.

looobay

3 months ago

1 reply

Vision tokens are a good compression medium because with one vision token you have one vector of N elements, but with textual tokens you have M vectors of N elements, because one vision token represent multiple pixels (and possibly multiple words). This is why its a good compression medium for compute.

It will never be as precise as textual tokens but it can be really good as they show in the paper.

krackers

3 months ago

1 reply

>with one vision token you have one vector of N elements, but with textual tokens you have M vectors of N elements

Each vision token represents a 16x16 patch, but to fully cover a word you might need multiple vision tokens. So assuming that the embedding size of the vision token and text token is the same `d` (which I think has to be the case for multimodal models), then wouldn't the fair comparison be `x * d` elements for a sentence in terms of vision tokens, and `y * d` for the same sentence in terms of text tokens? I don't see how you could see a priori that x << y (especially by a factor of 10 as quoted in the paper).

That said, if I do experimentally try this by shrinking this very comment down to the smallest font size I can read it at, then seeing how many 16x16 tokens it takes, you can fit more text than I expected in each "vision token". So I can maybe buy that x is at least not greater than y. But it can't be as simple as "each vision token can cover more text", since that only enables better compression if the encoder can actually uncover some sort of redundancy within each token. (And presumably the type of redundancy it uncovers probably isn't something that "classical" compression techniques can exploit, otherwise it seems like it would have been tried by now?).

looobay

3 months ago

You should read the 6th page of the paper (and page 5 for architecture breakdown), they show that they are compressing the vision tokens with convolution to keep a strong semantic understanding and keep a small amount of tokens.

But I think it's still experimentall.

imjonse

3 months ago

1 reply

I wonder if text written using chinese characters is more compatible with such vision centric compression than latin text.

Werkzeug

3 months ago

I think it's not the case. Chinese characters have the highest information entropy of all writing systems. However, Chinese characters are all independent symbols, which means if you want the LLM to support 5000 Chinese characters, you need to put 5000 characters into the lookup table (obviously there's no root, prefix, and suffix in Chinese, you cannot split the character into multiple reusable word pieces). As a result, you may need fewer characters to represent the same meaning compared to latin languages, but LLMs may also need to activate more token embeddings.

f33d5173

3 months ago

2 replies

Vision is how humans see text. So text must have built in adaptations to protect from visual noise. For example, two words that look similar must never appear in similar contexts, or else they would be conflated. Hence we can safely reduce such words to the same token. Or something like that.

fxtentacle

3 months ago

1 reply

That also works purely on text and it's the trick I used in my German speech recognition engine ( https://arxiv.org/abs/2206.12693 ).

"I'm studying at Oxford Univ" has basically no loss in meaning even though "University" was truncated to less than half its characters.

UltraSane

3 months ago

This is like how many CLIs accept the shortest unique version of commands.

ffsm8

3 months ago

Is that really factual/true?

Lots of words have multiple meanings and can mean different things even if used in the same sentence/context just from the interpretation of the person reading it.

Heck, it'd argue that most (not all) dayjob conflicts are down to such differences in interpretation /miscommunications

psb217

3 months ago

1 reply

The trick is that the vision tokens are continuous valued vectors, while the text tokens are elements from a small discrete set (which are converted into continuous valued vectors by a lookup table). So, vision tokens can convey significantly more bits per token than text tokens. This allows them to pack the content of multiple text tokens into a single vision token.

mapleshamrock

3 months ago

1 reply

Couldn't you do something like add a bidirectional encoder after your embedding look up table to compress your text into some smaller token-count semantic space before feeding your transformer blocks to get a similar effect, then?

psb217

3 months ago

Yes, you can get good compression of a long sequence of "base" text tokens into a shorter sequence of "meta" text tokens, where each meta token represents the information from multiple base tokens. But, grouping a fixed number of base tokens into each meta token isn't ideal, since that won't align neatly with sensible semantic boundaries, like words, phrases, sentences, etc. So, the trick is how decide which base tokens should be grouped into each meta token....

This sort of "dynamic chunking" of low-level information, perhaps down to the level of raw bytes, into shorter sequences of meta tokens for input to some big sequence processing model is an active area of research. Eg, one neat paper exploring this direction is: "Dynamic Chunking for End-to-End Hierarchical Sequence Modeling" [1], from one of the main guys behind Mamba and other major advances in state-space models.

[1] - https://arxiv.org/abs/2507.07955

HarHarVeryFunny

3 months ago

A text token generally represents a portion of a single word, while a vision token represents a portion of the entire page, which may include multiple words. This is where the "compression factor" comes from.

The number of bits to represent a text or vision token is the same, since they are both represented as embeddings of a fixed number of dimensions defined by the Transformer (maybe a few thousand for a large SOTA model).

Whether a vision token actually contains enough information to accurately extract (OCR) all the text data from that portion of the image is going to depend on how many pixels that vision token represents and how many words were present in that area of the image. It's just like considering images of the same page of text at different resolutions - a 1024x1024 image vs a 64x64 one, etc. As the resolution decreases so will OCR accuracy. At some point the resolution is insufficient and the words become a blurry mess and OCR accuracy suffers.

This is what DeepSeek are reporting - OCR accuracy if you try to use a single vision token to represent, say, 10 text tokens, vs 20 text tokens. The vision token may have enough resolution to represent 10 tokens well, but not enough for 20.

numpad0

3 months ago

just a hunch but like, from something to do with Unicode?

miki123211

3 months ago

7 replies

Text tokens are quantized and represent subword units, vision tokens only exist in the embedding space.

The way text tokenization works in LLMs is that you have a "lookup table" of (small) token ids to (large) vector embeddings. To pass text to the LLM, you split it at token boundaries, convert strings to token ids, and then construct the "context", a matrix where each row is a vector taken from that lookup table.

Transmitting text token sequences can be relatively efficient, you just transmit the token IDs themselves[1]. They're small integers (~100k possible token ids is typical for large models). Transmitting the actual embeddings matrix would be far less efficient, as embeddings often consist of thousands of floating point numbers.

Images are encoded differently. After some basic preprocessing, image data is passed straight to a neural- network-based image encoder. That encoder encodes the image into vectors, which are then appended to the context. There are no token ids, there's no lookup table, we go straight from image data to token embeddings.

This means transmitting image tokens cannot be done as efficiently, as you'd have to transmit the embeddings themselves. Even though an image is encoded in fewer tokens, the most efficient representation of those tokens takes more bytes.

You can think of a text token as an integer between 0 and n, which we know how to map to a vector. This means you have `n` possible choices of tokens. In contrast, an image token is an array of m floating point numbers (the vector itself), each of which can take on many possible values. This means the "token space" of vision tokens is actually much larger.

There's also the issue of patterns. Text tokens correspond directly to a contiguous span of UTF-8 bytes, and most tokenizers won't create tokens that span word boundaries. This means they can't encode global patterns efficiently. You can't have a "Hamlet's monologue" or "the text that follows is in Spanish" token.

rco8786

3 months ago

3 replies

Great explanation, thanks. I was surprised to hear that models still only work with ~100k tokens, but after giving it some thought it makes sense. There's only so many words/subword units that get used in any given language. The entropy comes from all the billions of different ways those subwords can be ordered.

freeqaz

3 months ago

1 reply

There is also a tradeoff between different vocabulary sizes (how many entries exist in the token -> embedding lookup table) that inform the current shape of tokenizers and LLMs. (Below is my semi-armchair stance, but you can read more in depth here[0][1].)

If you tokenized at the character level ('a' -> embedding) then your vocabulary size would be small, but you'd have more tokens required to represent most content. (And context scales non-linearly, iirc, like n^3) This would also be a bit more 'fuzzy' in terms of teaching the LLM to understand what a specific token should 'mean'. The letter 'a' appears in a _lot_ of different words, and it's more ambiguous for the LLM.

On the flip side: What if you had one entry in the tokenizer's vocabulary for each word that existed? Well, it'd be far more than the ~100k entries used by popular LLMs, and that has some computational tradeoffs like when you calculate the probability of each 'next' token via softmax, you'd have to run that for each token, as well as increasing the size of certain layers within the LLM (more memory + compute required for each token, basically).

Additionally, you run into a new problem: 'Rare Tokens'. Basically, if you have infinite tokens, you'll run into specific tokens that only appear a handful of times in the training data and the model is never able to fully imbue the tokens with enough meaning for them to _help_ the model during inference. (A specific example being somebody's username on the internet.)

Fun fact: These rare tokens, often called 'Glitch Tokens'[2], have been used for all sorts of shenanigans[3] as humans learn to break these models. (This is my interest in this as somebody who works in AI security)

As LLMs have improved, models have pushed towards the largest vocabulary they can get away with without hurting performance. This is about where my knowledge on the subject ends, but there have been many analyses done to try to compute the optimal vocabulary size. (See the links below)

One area that I have been spending a lot of time thinking about is what Tokenization looks like if we start trying to represent 'higher order' concepts without using human vocabulary for them. One example being: Tokenizing on LLVM bytecode (to represent code more 'densely' than UTF-8) or directly against the final layers of state in a small LLM (trying to use a small LLM to 'grok' the meaning and hoist it into a more dense, almost compressed latent space that the large LLM can understand).

It would be cool if Claude Code, when it's talking to the big, non-local model, was able to make an MCP call to a model running on your laptop to say 'hey, go through all of the code and give me the general vibe of each file, then append those tokens to the conversation'. It'd be a lot fewer tokens than just directly uploading all of the code, and it _feels_ like it would be better than uploading chunks of code based on regex like it does today...

This immediately makes the model's inner state (even more) opaque to outside analysis though. e.g., like why using gRPC as the protocol for your JavaScript front-end sucks: Humans can't debug it anymore without other tooling. JSON is verbose as hell, but it's simple and I can debug my REST API with just network inspector. I don't need access to the underlying Protobuf files to understand what each byte means in my gRPC messages. That's a nice property to have when reviewing my ChatGPT logs too :P

Exciting times!

0: https://www.rohan-paul.com/p/tutorial-balancing-vocabulary-s...

1: https://arxiv.org/html/2407.13623v1

2: https://en.wikipedia.org/wiki/Glitch_token

3: https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldm...

rco8786

3 months ago

Again, super interesting thanks!

> One area that I have been spending a lot of time thinking about is what Tokenization looks like if we start trying to represent 'higher order' concepts without using human vocabulary for them. One example being: Tokenizing on LLVM bytecode (to represent code more 'densely' than UTF-8)

I've had similar ideas in the past. High level languages that humans write are designed for humans. What does an "LLM native" programming language look like? And, to your point about protobufs vs JSON, how does a human debug it when the LLM gets stuck?

> It would be cool if Claude Code, when it's talking to the big, non-local model, was able to make an MCP call to a model running on your laptop to say 'hey, go through all of the code and give me the general vibe of each file, then append those tokens to the conversation'. It'd be a lot fewer tokens than just directly uploading all of the code, and it _feels_ like it would be better than uploading chunks of code based on regex like it does today...

That's basically the strategy for Claude's new "Skills" feature, just in a more dynamic/AI driven way. Claude will do semantic search through YAML frontmatter to determine what skill might be useful in a given context, then load that entire skill file into context to execute it. Your idea here is similar, use a small local model to summarize each file (basically dynamically generate that YAML front matter), feed those into the larger model's context, and then it can choose which file(s) it cares about based on that.

jerf

3 months ago

1 reply

Textual language is really, really amazing if you sit down and think about what it does versus the resources it consumes to do it.

It's a common pasttime for programmers to claim that our textual programming languages are just terrible and need to be replaced somehow with something visual, but I think this very often comes from a place of not understanding just how amazing textual languages are. Not they couldn't possibly be improved by something in at least some domains, and there are after all some successful niches for visual languages, but I think if you set out to wholesale replace textual languages without an understanding of and appreciation for the impressive nature of the competition they offer you're setting yourself up to fail.

mbando

3 months ago

This also touches on the contrast between how human beings and LLM's trade compression for nuance. Human beings have enormous resources devoted to long-tailed distribution of information, for example in lexical items. Word distributions follow Zipf's Law, so like in the million word FROWN corpus, roughly half the words only occur one time. Like when's the last time you use the word chrysanthemum, or corpulent? But did you have any difficulty recognizing them? So while human beings have limited scale compared to machines, we do have an enormous capacity for nuanced, communication and conception.

Whereas LLM's make the opposite trade-off. There are information centric theory limitations on the amount of information LM's can store (roughly 3.6 bits per parameter) so they aggressively compress information and trade away nuance (https://arxiv.org/abs/2505.17117).

davidguetta

3 months ago

Theres almost an infinity of chess games possible from barely 32 pièces and simple moves.

Already at move 3 you have bilions of positions possible

ttul

3 months ago

1 reply

This is a great summary. If you think about it a bit, text is an expanded representation of concepts meant for display on a two-dimensional surface that can then be read back by human eyes; our brains convert the two-dimensional information into concepts again.

So to me it’s not a surprise that you can transform the two-dimensional representation of the same information into concepts again without losing much.

The paper talks about using this approach to generate large amounts of LLM training data rapidly. That’s intriguing. It suggests that one of the best ways of training models on a wide variety of input data with very long context is to provide it with an image representation instead of text tokens.

miki123211

3 months ago

Text is actually one-dimensional, writing is two-dimensional.

To a pure LLM, characters 15 and 16 at line 1 are considered adjacent, but there's no relationship between character 15 of line 1 and character 15 of line 2.

A vision model (which considers text as squiggles, not UTF8 codepoints), such a relationship does exist.

krackers

3 months ago

1 reply

Thank you, this makes sense! As [1] puts it pithily

>Image-patch tokens make better use of the high-dimensional embedding space than text tokens do.

That seems to imply it's not necessarily something unique about images, just a byproduct of having better conversion from "raw input -> embeddings" [2]. Although there is a certain elegance of handling both images and text with the same method.

[1] https://twitter.com/c0mbinat0r/status/1980698103234891892 [2] https://twitter.com/Kangwook_Lee/status/1980709454522744902

krackers

2 months ago

(Just noting that https://news.ycombinator.com/item?id=45652952 and the article therein are also worth reading)

jph00

3 months ago

Actually there are VAEs which use a codebook approach to creating discrete tokens instead of float vectors. There has been some success in that direction in diffusion models for instance.

storus

3 months ago

That's not really true, the latest autoregressive image models create a codebook of patches that are then encoded as tokens and image is assembled out of them.

lubesGordi

3 months ago

So in terms of OCR, does the neural network 'map' the words into an embedding directly, or is it getting a bunch of words like "Hamlet's monologue" and mapping that to an embedding? Basically what I'm asking is if the neural network image encoder is essentially doing OCR 'internally' when it is coming up with the embedding (if that makes any sense).

isaacfung

3 months ago

Some models use vector quantized variational autoencoders to discretize images into sequences of discrete symbols from a fixed codebook.

https://grok.com/share/bGVnYWN5LWNvcHk%3D_572b4955-6265-4210...

runeblaze

3 months ago

each text token is often subword unit, but in VLMs the visual tokens are in semantic space. Semantic space obviously compresses much more than subword slices.

disclaimer: not expert, on top of my head

HarHarVeryFunny

3 months ago

I don't know if there is any common practice among multi-modal input "LLM"s as to how they encode image inputs - convert them into "vision tokens", but it's basically going to come down to splitting the image into a grid of regions and encoding those.

I'm not sure there's any information theoretic intuition to be had with DeepSeek's experiments - it seems to be more about what's the lowest resolution image resolution/grid you can get away with and still capture enough image detail to be able to accurately perform OCR on it.

It'd be cool if Karpathy would extend his NanoChat to be multi-modal to spread the knowledge of how this is typically done.

hendersoon

3 months ago

Exactly right, the OCR isn't the interesting part. 10x context compression is potentially huge. (With caveats, at only ~97% accuracy, so not appropriate for everything.)

ssivark

3 months ago

Surely the appropriate ratio depends on the resolution of each character, relative to the size of the vision token patch? That is the only way the number of text tokens needed to describe the output of OCR can be independent of the resolution of the image (as it should).

bugglebeetle

3 months ago

2 replies

Looks great, but looking at the benchmark, can’t help but think about how crazy good dots-ocr is as a model. Too bad they’re not as open as the Deepseek team because its so crazy good and would love to know how it was trained.

rfoo

3 months ago

1 reply

If you look you'd notice that it's the same Haoran Wei behind DeepSeek-OCR and GOT-OCR2.0 :p

bugglebeetle

3 months ago

Oh you’re right! Good catch!

bethekind

3 months ago

Did we read the same graph? DeepSeek Gundam 200 dpi appeared to get similar perf as dots-ocr, but with less tokens needed. The x axis is inverted, descending with distance from the origin.

k_sze

3 months ago

1 reply

It's interesting how they use "Gundam" in their variant names. I gather that Gundam-M and Gundam are their most powerful ones.

daemonologist

3 months ago

I think maybe to distinguish their dynamic resolution approach from the t-shirt sizes, which have a fixed input. (Although I don't know why "Gundam")

yoran

3 months ago

7 replies

How does an LLM approach to OCR compare to say Azure AI Document Intelligence (https://learn.microsoft.com/en-us/azure/ai-services/document...) or Google's Vision API (https://cloud.google.com/vision?hl=en)?

ozgune

3 months ago

2 replies

OmniAI has a benchmark that companies LLMs to cloud OCR services.

https://getomni.ai/blog/ocr-benchmark (Feb 2025)

Please note that LLMs progressed at a rapid pace since Feb. We see much better results with the Qwen3-VL family, particularly Qwen3-VL-235B-A22B-Instruct for our use-case.

CaptainOfCoit

3 months ago

Magistral-Small-2509 is pretty neat as well for its size, has reasoning + multimodality, which helps in some cases where context isn't immediately clear, or there are few missing spots.

cheema33

3 months ago

Omni OCR team says that according to their own benchmark, the best OCR is the Omni OCR. I am quite surprised.

numpad0

3 months ago

1 reply

Classical OCR still probably make undesirable su6stıtutìons in CJK from there being far too many of similar ones, even some absurd ones that are only distinguishable under microscope or by looking at binary representations. LLMs are better constrained to valid sequences of characters, and so they would be more accurate.

Or at least that kind of thing would motivate them to re-implement OCR with LLM.

fluoridation

3 months ago

1 reply

Huh... Would it work to have some kind of error checking model that corrected common OCR errors? That seems like it should be relatively easy.

colonCapitalDee

3 months ago

1 reply

It's harder then it first seems. The root problem is that for text like "hallo", correcting to "hello" may be fixing an error or introducing an error. In general, the more aggressive your error correction, the more errors you inadvertently introduce. You can try and make a judgement based on context ("hallo, how are you?"), which certainly helps, but it's only a mitigation. Light error correction is common and effective, but you can't push it to a full solution. The only way to fully solve this problem is to look at the entire document at once so you have maximum context available, and this is what non-traditional OCR attempts to do.

fluoridation

3 months ago

Okay, but there way more common errors that should be easy to fix. "He11o", "Emest Herningway", incorrect diacritics like the other person mentioned, etc.

stopyellingatme

3 months ago

Not sure about the others but we use Azure AI Document Intelligence and its working well for our resume parsing system. Took a good bit of tuning but we havent had to touch it for almost a year now.

make3

3 months ago

aren't all of these multimodal LLM approaches, just open vs closed ones

sandblast

3 months ago

Not sure why you're being downvoted, I'm also curious.

junto

3 months ago

Not sure how it compares but we did some trials with Azure AI Document Intelligence and were very surprised at how good it was. We had a document example which was a poor photograph of a document that had quite a skew, and it (too our surprise), also detected the customer’s human legible signature and extracted their name from that signature.

daemonologist

3 months ago

My base expectation is that the proprietary OCR models will continue to win on real-world documents, and my guess is that this is because they have access to a lot of good private training data. These public models are trained on arxiv and e-books and stuff, which doesn't necessarily translate to typical business documents.

As mentioned though, the LLMs are usually better at avoiding character substitutions, but worse at consistency across the entire page. (Just like a non-OCR LLM, they can and will go completely off the rails.)

CloseChoice

3 months ago

2 replies

It's deepseek so one can expect an open-source license but for anyone (like me) who wants to see that explicitly, since it's not obvious in the GitHub repo: https://huggingface.co/deepseek-ai/DeepSeek-OCR/blob/main/LI...

TLDR: It's MIT licensed

AndroTux

3 months ago

> since it's not obvious in the GitHub repo

Literally says MIT license on the right sidebar and in the readme tab and in the file called LICENSE

maxloh

3 months ago

Model weights are MIT too: https://huggingface.co/deepseek-ai/DeepSeek-OCR/blob/main/LI...

x______________

3 months ago

4 replies

  >先天下之忧而忧

How is this an example of a prompt?

Google translated this to "Worry about the world first" while Bing says "Worry before the worries of the world."

Can anyone shed some light on this saying or why it's in the article?

raincole

3 months ago

1 reply

It's a very famous (classical) Chinese phrase.

Both translations don't catch the meaning well though. It means: "worry before the rest of the world (notice that they have something to) worry." The next part is 後天下之樂而樂("be happy only after the rest of the world is happy.")

I don't know why it's a prompt example.

jdthedisciple

3 months ago

2 replies

Sibling comment has the second part as

后天下之乐而乐

which one is correct?

Y_Y

3 months ago

1 reply

It depends on who you think is the rightful successor to the Qing dynasty

emptyhandeddev

3 months ago

Wrong. It merely depends on whether the local policy maker before computer age prioritize reducing illiteracy and convenience over other considerations.

Macau, HK and Taiwan uses traditional Chinese character.

Mainland China, Singapore, Malaysia use simplified Chinese character.

Japan uses its own version, some simplified, some traditional, and also invented over 100 Japanese-made-Kanji following the same logic how Chinese characters are formed.

As a matter of fact, simplification of Chinese characters started when KMT/Republic of China was in control of the whole China. Politics gets in the way later and RoC stopped this simplification process while PRC kept it going, Macau & HK were not involved since the Portuguese and British colonial government doesn't care. Singapore and Malaysia pick the simplified version out of convenience.

raincole

3 months ago

Traditional vs Simplified Chinese.

There are two (modern) "spellings" of written Chinese. Basically colour vs color.

gudzpoz

3 months ago

1 reply

This clause is usually used together with the next sentence in the original poem:

> 先天下之忧而忧，后天下之乐而乐

> (put the world's worries before yours, and put your happiness after the world's) > edit: this translation is wrong, and raincole has a definitely better translation

Since the model is a language model, they probably use this to demonstrate the model's language capabilities – the model should be able to complete the whole sentence pair. The paper also mentions this:

> To ensure the model’s language capabilities, we introduced 10% of in-house text-only pretrain data.

So I believe it is just a text-only demonstration.

jdthedisciple

3 months ago

1 reply

Sibling comment has the second part as

後天下之樂而樂

Which one is correct?

numpad0

3 months ago

2 replies

  a) 后天下之乐而乐
  b) 後天下之樂而樂
  c) 後天下之楽而楽

a) is clearly Simplified Chinese from a sibling comment, b) is Traditional copied from your comment, and c) is as I just typed in my own language. Unicode Hanzi/Kanji are a mess and there are characters same or different, in appearance or in binary, depending on intended variants, languages, fonts, systems, keyboard, distance between Earth and Alpha Centauri, etc.

jdthedisciple

3 months ago

2 replies

Fascinating! That's exactly why I asked, so thank you.

Do people usually recognize all variants as valid and legible? Or does any particular set of letters/symbols prevail in practice?

numpad0

3 months ago

Traditional kinds are usually recognizable, but I'd be unsure or straight up wrong about most Simplified versions. Overall proportions and small details often feel "wrong" for both as well due to cultures converging at different points.

hank2000

3 months ago

Very location dependent. But when you learn to write the characters you understand the variants differently. They look like random strokes to an untrained eye. But they’re not. I’m not sure if that makes sense.

Take a lowercase a in English for example. This font writes it differently than a child. Or in cursive. Or probably than you would write it. But you recognize all of them and don’t really think about it.

emptyhandeddev

3 months ago

a) Simplified Chinese

b) Traditional Chinese

c) 楽 is a variation of 樂, which is now widely used in Japanese Kanji but deprecated in Traditional Chinese.

Note:

A variation means some people write 樂 as 楽 in ancient China, but not widely adopted.

Kanji is a Japanese word, means "Chinese Character".

fspeech

3 months ago

Google is closer. This is from a famous essay expressing tbe author's desire to bear the burden for the world. Essay is 岳阳楼记 by 范仲淹 in year 1046 https://zh.wikisource.org/zh-hans/%E5%B2%B3%E9%99%BD%E6%A8%9...

SequoiaHope

3 months ago

Ask a language model - ChatGPT says it’s a line from a famous poem “Memorial to Yueyang Tower” which expresses the Confucian ideal of selfless concern for people and society.

ellisd

3 months ago

5 replies

The paper makes no mention of Anna’s Archive. I wouldn’t be surprised if DeepSeek took advantage of Anna’s offer granting OCR researchers access to their 7.5 million (350 TB) Chinese non-fiction collection ... which is bigger than Library Genesis.

https://annas-archive.org/blog/duxiu-exclusive.html

ikamm

3 months ago

2 replies

Why do they need to grant access for people to use copies of books they don’t own?

JohnLocke4

3 months ago

1 reply

Not to rationalize it, but it appears that they're gatekeeping the dataset to get access to the OCR-scans from the people they choose to share it with. This is to improve their existing service by making the content of books (and not just their title/tags) searchable.

As per the blog post: >What does Anna’s Archive get out of it? Full-text search of the books for its users.

ikamm

3 months ago

Fair enough, it just seems like they're painting an even bigger target on their backs by restricting access to copyrighted material they don't own the rights to

est

3 months ago

> The books from Duxiu have long been pirated on the Chinese internet. Usually they are being sold for less than a dollar by resellers. They are typically distributed using the Chinese equivalent of Google Drive, which has often been hacked to allow for more storage space

Ownership laundering.

dev1ycan

3 months ago

2 replies

Oh great so now Anna's archive will get taken down as well by another trash LLM provider abusing repositories that students and researchers use, META torrenting 70TB from library genesis wasn't enough

sigmoid10

3 months ago

1 reply

Seems like they are doing fine:

https://open-slum.org

dev1ycan

3 months ago

Yeah, for now, Meta torrented 70TB and right after that they cut the rope for everyone else, mysteriously their hitman (US govenrment) hit both Libgen and Z-Lib shortly after.

c0balt

3 months ago

It appears this is an active offer from Anna's archive, so presumably they can handle the load and are able to satisfy the request safely.

singularfutur

3 months ago

Yes it means they will never release their dataset :(

throawayonthe

3 months ago

hahaha also immediately thought of this, wonder when the ocr'd dataset would be getting released

bluecoconut

3 months ago

Previous paper from DeepSeek has mentioned Anna’s Archive.

> We cleaned 860K English and 180K Chinese e-books from Anna’s Archive (Anna’s Archive, 2024) alongside millions of K-12 education exam questions. https://arxiv.org/abs/2403.05525 DeepSeek-VL paper

mrasong

3 months ago

1 reply

Kinda reminds me of PaddleOCR.

Would be awesome if DeepSeek OCR could be integrated into a mobile app someday. That’d make OCR way more convenient!

pzo

3 months ago

iOS already have on device both text detector and document scanner in apple Vision API. Hard to say how good are they compared to LLM based solutions. Similarly google had MLKit with OCR working on devices also for many years.

pietz

3 months ago

22 replies

My impression is that OCR is basically solved at this point.

The OmniAI benchmark that's also referenced here wasn't updated with new models since February 2025. I assume that's because general purpose LLMs have gotten better at OCR than their own OCR product.

I've been able to solve a broad range of OCR tasks by simply sending each page as an image to Gemini 2.5 Flash Lite and asking it nicely to extract the content in Markdown under some additional formatting instructions. That will cost you around $0.20 for 1000 pages in batch mode and the results have been great.

I'd be interested to hear where OCR still struggles today.

kbumsik

3 months ago

2 replies

> My impression is that OCR is basically solved at this point.

Not really in practice to me. Especially they still struggle with Table format detection.

coulix

3 months ago

2 replies

This.

Any complex parent table span cell relationship still has low accuracy.

Try the reverse, take a complex picture table and ask Chatgpt5, claude Opus 3.1, Gemini Pro 2.5 to produce a HTML table.

They will fail.

bobsmooth

3 months ago

1 reply

Maybe I misunderstood the assignment but it seems to work for me.

https://chatgpt.com/share/68f5f9ba-d448-8005-86d2-c3fbae028b...

Edit: Just caught a mistake, transcribed one of the prices incorrectly.

kbumsik

3 months ago

Right, I wouldn't use full table detection to VLM model because they tend to mistake with numbers in table...

pietz

3 months ago

2 replies

Maybe my imagination is limited or our documents aren't complex enough, but are we talking about realistic written documents? I'm sure you can take a screenshot of a very complex spreadsheet and it fails, but in that case you already have the data in structured form anyway, no?

kbumsik

3 months ago

1 reply

> realistic written documents?

Just get a DEF 14A (Annual meeting) filing of a company from SEC EDGAR.

I have seen so many mistakes when looking at the result closely.

Here is a DEF 14A filing from Salseforce. You can print it to a PDF and then try converting.

https://www.sec.gov/Archives/edgar/data/1108524/000110852425...

grosswait

3 months ago

1 reply

Historical filings are still a problem, but hasn’t the SEC required filing in an XML format since the end of 2024?

richardlblair

3 months ago

1 reply

It's not really about SEC filings, though. While we folks on HN would never think of hard copies of invoices, but much of the world still operates this way.

As mentioned above I have about 200 construction invoices. They are all formatted in a way that doesn't make sense. Most fail both OCR and OpenAI

KoolKat23

3 months ago

OpenAI has unusuably low image DPI. Try Gemini.

daemonologist

3 months ago

1 reply

Now if someone mails or faxes you that spreadsheet? You're screwed.

Spreadsheets are not the biggest problem though, as they have a reliable 2-dimensional grid - at worst some cells will be combined. The form layouts and n-dimensional table structures you can find on medical and insurance documents are truly unhinged. I've seen documents that I struggled to interpret.

KoolKat23

3 months ago

To be fair, this is problematic for humans too. My old insurer outright rejected things like that stating it's not legible.

(I imagine it also had the benefit of reducing fraud/errors).

In this day and age, it's probably easier/better to change the process around that as there's little excuse for such shit quality input. I understand this isn't always possible though.

richardlblair

3 months ago

I had mentioned this when the new QWEN model dropped - I have a stack of construction invoices that fail through both OCR and OpenAI.

It's a hard (and very interesting) problem space.

carschno

3 months ago

3 replies

Technically not OCR, but HTR (hand-written text/transcript recognition) is still difficult. LLMs have increased accuracy, but their mistakes are very hard to identify because they just 'hallucinate' text they cannot digitize.

sramam

3 months ago

1 reply

Interesting - have you tried sending the image and 'hallucinated' text together to a review LLM to fix mistakes?

I don't have a use case of 100s or 1000s of hand-written notes have to be transcribed. I have only done this with whiteboard discussion snapshots and it has worked really well.

lazide

3 months ago

Often, the review LLM will also say everything is okay when it’s made up too.

pietz

3 months ago

1 reply

We ran a small experiment internally on this and it looked like Gemini is better at handwriting recognition than I am. After seeing what it parsed, I was like "oh yeah, that's right". I do agree that instead of saying "Sorry, I can't read that" it just made up something.

CraigRood

3 months ago

1 reply

I have a thought that whilst LLM providers can say "Sorry" - there is little incentive and it will expose the reality that they are not very accurate, nor can be properly measured. That said, there clearly are use cases where if the LLM can't a certain level of confidence it should refer to the user, rather than guessing.

Rudybega

3 months ago

This is actively being worked on my pretty much every major provider. It was the subject of that recent OpenAI paper on hallucinations. It's mostly caused by benchmarks that reward correct answers, but don't penalize bad answers more than simply not answering.

E.g.

Most current benchmarks have a scoring scheme of

1 - Correct Answer 0 - No answer or incorrect answer

But what they need is something more like

1 - Correct Answer 0.25 - No answer 0 - Incorrect answer

You need benchmarks (particularly those used in training) to incentivize the models to acknowledge when they're uncertain.

mormegil

3 months ago

This. I am reading old vital records in my family genealogy quest, and as those are sometimes really difficult to read, I turned to LLMs, hearing they are great in OCR. It’s been… terrible. The LLM will transcribe the record without problems, the output seems completely correct, a typical text of a vital record. Just… the transcribed text has nothing to do with my specific record. On the other hand, transkribus.eu has been fairly usable for old vital record transcription – even though the transcribed text is far from perfect, many letters and words are recognized incorrectly, it helps me a lot with the more difficult records.

raincole

3 months ago

4 replies

If you can accept that the machine just make up what it doesn't recognize instead of saying "I don't know," then yes it's solved.

(I'm not being snarky. It's acceptable in some cases.)

jakewins

3 months ago

2 replies

But this was very much the case with existing OCR software as well? I guess the LLMs will end up making up plausible looking text instead of text riddled with errors, which makes it much harder to catch the mistakes, in fairness

wahnfrieden

3 months ago

3 replies

Existing ocr doesn’t skip over entire (legible) paragraphs or hallucinate entire sentences

KoolKat23

3 months ago

This must be some older/smaller model.

Davidzheng

3 months ago

rarely happens to me using LLMs to transcribe pdfs

criddell

3 months ago

I usually run the image(s) through more than one converter then compare the results. They all have problems, but the parts they agree on are usually correct.

rkagerer

3 months ago

Good libraries gave results with embedded confidence levels for each unit recognized.

wahnfrieden

3 months ago

2 replies

Do any LLM OCRs give bounding boxes anyway? Per character and per block.

dajonker

3 months ago

Try MinerU 2.5 with two-step parsing. It gives good results with bounding boxes per block. Not sure if you can get it to do more detailed such as word or character level.

kelvinjps10

3 months ago

Gemini does but it's not as good as Google vision, and the format it's différent Here it's the documentation https://cloud.google.com/vertex-ai/generative-ai/docs/boundi...

Also Simon Willison Made a blog post that might be helpful https://simonwillison.net/2024/Aug/26/gemini-bounding-box-vi...

I hope that this capability improves so I can use only Gemini API.

red75prime

3 months ago

Just checked it with Gemini 2.5 Flash. Instructing it to mark low-confidence words seems to work OK(ish).

KoolKat23

3 months ago

These days it does just that, it'll say null or whatever if you give it the option. When it does make it up, it tends to be limitation of the image qualify ( max dpi).

Blotchy text and specific typeface make 6's look like 8's, even to the non-discerning eye, a human would think it's an 8, zoom in and see it's a 6.

Google's image quality on uploads is still streets ahead of openai for instance btw.

peter-m80

3 months ago

3 replies

No way it's solved. try to make OCR over a magazine with creative layouts. Not possible. I have a collection of vintage computer magazines and from time to time I try to OCR them whith the state of the art mechanisms. All of them requiere a lot of human intervention

jmkni

3 months ago

1 reply

do you have an example of a particularly tricky one?

ekianjo

3 months ago

Just try old ads you will see how hard it gets

pietz

3 months ago

Could you provide an example that fails? I'm interested in this.

constantinum

3 months ago

I use LLMWhisperer[1] for OCR'ing old magazine ads. It preserves the layout and context. Example > https://postimg.cc/ts3vT7kG

https://pg.llmwhisperer.unstract.com/

cahaya

3 months ago

2 replies

Lot's of OCR/ LLM's (Even Gemini Pro 2.5) still struggle converting complex tables to markdown or HTML: Tables with multiple headers and merged cells that get mixed up, multiple columns with tick boxes get mixed up, multi page tables that are not understood correctly. Also Llamaindex fails miserably on those things.

Curious to hear which OCR/ LLM excels with these specific issues? Example complex table: https://cdn.aviation.bot/complex-tables.zip

I can only parse this table correctly by first parsing the table headers manually into HTML as example output. However, it still mixes up tick boxes. Full table examples: https://www.easa.europa.eu/en/icao-compliance-checklist

CaptainOfCoit

3 months ago

3 replies

> Lot's of OCR/ LLM's (Even Gemini Pro 2.5) still struggle converting complex tables to markdown or HTML:

But that's something else, that's no longer just OCR ("Optical Character Recognition"). If the goal suddenly changes from "Can take letters in images and make into digital text" to "Can replicate anything seen on a screen", the problem-space gets too big.

For those images you have, I'd use something like Magistral + Structured Outputs instead, first pass figure out what's the right structure to parse into, second pass to actually fetch and structure the data.

kmacdough

3 months ago

1 reply

> But that's something else, that's no longer just OCR ("Optical Character Recognition").

Lines often blur for technologies under such rapid evolution. Not sure it's helpful to nitpick the verbal semantics.

It is a fair question whether the OCR-inspired approach is the correct approach for more complex structured documents where wider context may be important. But saying it's "not OCR" doesn't seem meaningful from a technical perspective. It's an extension of the same goal to convert images of documents into the most accurate and useful digitized form with the least manual intervention.

CaptainOfCoit

3 months ago

Personally I think it's a meaningful distinction between "Can extract text" VS "Can extract text and structure". It is true that some OCR systems can handle trying to replicate the structure, but still today I think that's the exception, not the norm.

Not to mention it's helpful to separate the two because there is such a big difference in the difficulty of the tasks.

kmacdough

3 months ago

> But that's something else, that's no longer just OCR ("Optical Character Recognition").

Lines often blur for technologies under such rapid evolution. Not sure it's helpful to nitpick the verbal semantics.

It is a fair question whether the OCR-inspired approach is the correct approach for more complex structured documents. But saying it's "not OCR" do doesn't seem meaningful from a technical perspective.

eeixlk

3 months ago

htceaad t nofdnsy lyruuieo sieerrr t owcope?

pietz

3 months ago

I threw in the first image/table into Gemini 2.5 Pro letting it choose the output format and it looks like it extracted the data just fine. It decided to represent the checkboxes as "checked" and "unchecked" because I didn't specify preferences.

baobun

3 months ago

Chinese, especially handwritten.

sbinnee

3 months ago

Maybe for English. Other languages are very much not solved.

darkwater

3 months ago

So, the mug with inspirational text says "Bountiful Potential"?

llm_nerd

3 months ago

Complex documents is where OCR struggles mightily. If you have a simple document with paragraphs of text, sure OCR is pretty solved. If you have a complex layout with figures and graphs and supporting images and asides and captions and so on (basically any paper, or even trade documents), it absolutely falls apart.

And GP LLMs are heinous at OCR. If you are having success with FL, your documents must be incredibly simple.

There has been enormous advances in OCR over the past 6 months, so the SoTa is a moving, rapidly advancing target.

constantinum

3 months ago

Why PDF parsing is Hell[1]:

Fixed layout and lack of semantic structure in PDFs.

Non-linear text flow due to columns, sidebars, or images.

Position-based text without contextual or relational markers.

Absence of standard structure tags (like in HTML).

Scanned or image-based PDFs requiring OCR.

Preprocessing needs for scanned PDFs (noise, rotation, skew).

Extracting tables from unstructured or visually complex layouts.

Multi-column and fancy layouts breaking semantic text order.

Background images and watermarks interfering with text extraction.

Handwritten text recognition challenges.

[1] https://unstract.com/blog/pdf-hell-and-practical-rag-applica...

robotswantdata

3 months ago

VLLMs suck at complex layouts and there is a high risk of hallucination. Never use alone for contracts or health data.

Gazoche

3 months ago

There is no "solved" in computer vision, there is only "good enough" and what constitutes "good enough" depends on your problem domain.

Take an OCR model with 99.9% character-wise accuracy. Sounds pretty good, right? Well if your use case is, say, digitalizing old printed novels, then yeah it's probably good enough.

But what if your documents are personal records with millions of names, to insert in some administrative database? Now 1 out of 1000 persons will have their name misspelled. Ooops.

kelvinjps10

3 months ago

Google vision it's still better than Gemini at OCR, for example at getting bounding boxes.

Davidzheng

3 months ago

I think it'll be good to have an end-to-end pdf to latex converter for old math papers. For commutative diagrams almost all models still struggle. especially very complicated commutative diagrams.

simlevesque

3 months ago

> That will cost you around $0.20 for 1000 pages in batch mode and the results have been great.

Can you explain more about your setup ? I have a quarter million pages I want to OCR.

burpsnard

3 months ago

I've only used tesseract, 'recreationally', but i tried generating images of random chars to see what resolution/contrast/noise was minimally recognisable; shocked at how bad it was. heavily relies on language models of character sequences, pretty useless On 'line noise'

vintermann

3 months ago

OCR of printed text may be one thing, but handwriting OCR (a.k.a HTR) is very, very far from solved. It's actually hard to find a practical task general historical HTR is good enough to do usefully, even for state of the art models.

blindriver

3 months ago

I attempted OCR using all of the open source models available about 3 months ago, including Llama 4. These were pngs of text using a regular font. Most produced garbage except Llama 4, and even then it was only about 90% accurate. Using OpenAI or Gemini produced much better results but the open source models were really bad.

themanmaran

3 months ago

> OmniAI benchmark that's also referenced here wasn't updated with new models since February 2025. I assume that's because general purpose LLMs have gotten better at OCR than their own OCR product.

Benchmark author here. No, just pivoted away from OCR API as a product! Still use our API internally but have been lazy about updating benchmarks.

Gemini is definitely the best model for OCR. But it has a really high rate of "recitation" errors. Where it will determine the output token is too close to its training data and cut it off. Something like 10% of the time from our testing. Also it has this hilarious hallucination when you have a blank page in the document mix and it just makes up new info.

OpenAI is OK. GPT5 wasn't any better than 4o or 4.1. Main issues were: dropping content like headers/footers, loses it's mind on sideways pages, and will frequently refuse to read things like ID documents, health care forms, or things it judges to have too much PII.

KoolKat23

3 months ago

I agree, Gemini 2.5 models are excellent.

The fuss around old fashioned OCR seemed strange to me initially considering the above, but I selfishly forgot to consider addressing compute/offline requirements.

It would also be nice for there to be a good competitor.

cle

3 months ago

That will not work with many of the world's most important documents because of information density. For example, dense tables or tables with lots of row/col spans, or complex forms with checkboxess, complex real-world formatting and features like strikethroughs, etc.

To solve this generally you need to chunk not by page, but by semantic chunks that don't exceed the information density threshold of the model, given the task.

This is not a trivial problem at all. And sometimes there is no naive way to chunk documents so that every element can fit within the information density limit. A really simple example is a table that spans hundreds pages. Solving that generally is an open problem.

6gvONxR4sf7o

3 months ago

OCR for printed documents is super robust, but handwriting, low res, and aligned recognition (not just image to "hello world" but also having "h is here in space e is here in space...) are all still well behind "basically solved."

empressplay

3 months ago

This could be great for extracting text from old magazines; traditional OCR gives you a bit of a mess you have to clean up, but this looks like it can properly identify columns and track the flow accurately (and extract images!) It appears it can convert magazine layouts to markdown too

brightUiso

3 months ago

Please a bit of education, what does it do?

84 more comments available on Hacker News

View full discussion on Hacker News

ID: 45640594Type: storyLast synced: 11/26/2025, 1:00:33 PM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

Open link View on HN