Scribeocr – Web Interface for Recognizing Text, Ocr, & Creating Digitized Docs

Posted3 months agoActive3 months ago

atomicnature

114 points

18 comments

github.comTechstory

calmmixed

Debate

40/100

OcrText RecognitionDocument Digitization

Key topics

Ocr

Text Recognition

Document Digitization

ScribeOCR is a web interface for OCR and document digitization that has garnered interest and discussion on its capabilities and limitations, with users sharing their experiences and suggestions for improvement.

Snapshot generated from the HN discussion

Discussion Activity

Moderate engagement

First comment

Peak period

Days 3-4

Avg / period

4.5

Comment distribution18 data points

Loading chart...

Based on 18 loaded comments

Key moments

01Story posted
Oct 6, 2025 at 6:39 AM EDT
3 months ago
Step 01
02First comment
Oct 9, 2025 at 11:27 PM EDT
4d after posting
Step 02
03Peak activity
9 comments in Days 3-4
Hottest window of the conversation
Step 03
04Latest activity
Oct 27, 2025 at 8:25 AM EDT
3 months ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (18 comments)

Showing 18 comments

aidenn0

3 months ago

2 replies

This is my first encounter with Scribe.js; since I have many book scans I always try OCRing them when I see this. Compared to Tesseract (which is the best I have so far), it gets the words right slightly more, but the paragraph segmentation is many times worse. On a book where every paragraph is indented, it reliably decides two consecutive one-line paragraphs are the same paragraph, which is understandable, but a downgrade from Tesseract which gets the paragraph segmentation as correct as possible (It doesn't handle paragraphs that spanpage-breaks, since I'm feeding it one page at a time)

Elucalidavah

3 months ago

1 reply

> Tesseract (which is the best I have so far)

Have you looked at EasyOCR?

aidenn0

3 months ago

1 reply

EasyOCR is significantly worse than Tesseract for clean printed text and , while being orders of magnitude slower; far better than Tesseract for low-quality clean scans and extracting text from pictures (e.g. comics), which Tesseract does not as well.

criddell

3 months ago

1 reply

Have you tried Abbyy FineReader? It's the best OCR package I've seen.

aidenn0

3 months ago

It doesn't seem to have a Linux version; I don't have a mac or windows machine.

zihotki

3 months ago

2 replies

Scribe is Tesseract. It uses tesseract.js which is a Web Assembly port of Tesseract. So they in theory should be equal. In practice custom settings or older versions could make a difference.

criddell

3 months ago

1 reply

What's the motivation for doing this in the browser? It seems like intentionally choosing a more difficult path to create an inferior result.

A native MacOS or Windows application could use the OCR facilities of the operating system and, in my experience, both produce results that are far better than Tesseract.

Zardoz84

3 months ago

Generate the OCR on the fly, in the browser, when you do not have the proper OCR info. As someone that works on public web libraries, I see it useful (but wasteful)

aidenn0

3 months ago

This is only true in the "speed" mode; in the "quality" mode it claims better word recognition than Tesseract on clean scans (which matches my tests): https://github.com/scribeocr/scribe.js/blob/master/docs/scri...

zihotki

3 months ago

2 replies

According to what I read in the documentation, it uses Tesseract underneath. I've used Tesseract v3 in the past and it was pain. Tesseract 4 uses LSTM neural net. How good is the performance and quality of the recognition nowadays in v4? Could anyone share his experience?

graynk

3 months ago

1 reply

I use paperless-ngx for digitizing all my documents, it also uses Tesseract. The result is not perfect, but more than acceptable, if I scan at 600dpi

oigursh

3 months ago

1 reply

There's https://github.com/icereed/paperless-gpt as a plugin

graynk

3 months ago

Local LLMs I've found to not be good enough for OCR (while being a lot more resource hungry), and OpenAI models I want to avoid for privacy reasons. Default tesseract does the job for me, since my only requirements for the results it "I can easily find what I need with full-text search" - I rarely need to actually copy the text from the resulting PDFs

btian

3 months ago

it's fine for simple use cases, but far inferior to the likes of GPT, Gemini or Mistral

fodkodrasz

3 months ago

I really like the idea, but unfortunately it could not cope with my usecase.

I have some lecture slides as image-only PDF (Hungarian language with a sparkle of English and Latin (biology)). I tried the tool on it and I had the following experience:

- proofreading with the overlay seems like a good idea, actually it is unusable when the original text has colors, and you need to recognize diacritic marks. Being able to show the original in grayscale or black&white could help. (BW worked, but Grayscale left everything colored)

- For proofreading the ebook mode was the most useful, I immediately spotted lots of errors that I could not see with overlay. A quick switch between the modes would be useful

- Editing text is not efficient when error rate is high (Hungarian language is not supported, that caused it mostly I guess), the interface has high overhead for mass corrections.

Very good idea, I think after a little polish it would even fit my usecase. For more traditional OCR usecases than mine it is probably already great.

ranger_danger

3 months ago

This is awesome. Only issue was I had to disable my JShelter extension because it would freeze the page using 100% CPU forever.

Zardoz84

3 months ago

If it would generate ALTO XML files... IF!

constantinum

3 months ago

anyone looking for an ocr or text pre-processor that maintains the layout(tables, forms) try LLMWhisperer > https://pg.llmwhisperer.unstract.com/

View full discussion on Hacker News

ID: 45489881Type: storyLast synced: 11/20/2025, 8:42:02 PM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

Open link View on HN