Questions tagged [tesseract-ocr]

Tesseract is an optical character recognition engine(OCR)

Tesseract is an optical character recognition engine(OCR). Tesseract works on Linux, Windows (with VC++ Express or CygWin) and Mac OSX.

Most of the work on Tesseract is sponsored by Google.

ReadMe - Installation and usage information.
Compiling - How to build Tesseract on a variety of platforms.
FAQ - Common questions and problems. Please check before filing a bug or consulting the forum.
Too many errors? - See the guidance on getting the best out of Tesseract.

48 questions

votes

3 answers

OCR Tesseract, Empty page error?

I compiled it from sources with leptonica. This is a png image with transparent background, which I edited adding a blue color and still this error: Tesseract Open Source OCR Engine v3.02.02 with Leptonica Empty page!! Empty page!! Here's the image…

ocr tesseract-ocr

asked Jan 18 '13 at 04:41

Jim

votes

2 answers

How to leave pdf image unchanged while adding OCR to a pdf with pdfsandwich?

I am trying to add OCR to PDFs and am using pdfsandwich to do so. The problem is that pdfsandwich processes the image when doing OCR which changes what the document looks like. Is there any way to ensure that the PDF image remains completely…

pdf scanning ocr tesseract-ocr

asked Apr 25 '19 at 02:10

user3750888

votes

1 answer

Use ffmpeg for JPEG to TIFF conversion

I would like to use Tesseract OCR with a video. With ffmpeg I can export some (.jpeg) images from a video. Can I convert a .jpeg into a valid .tiff or export directly .tiff images from the video with ffmpeg?

ffmpeg jpeg tiff image-conversion tesseract-ocr

asked Oct 17 '14 at 07:27

Tenaciousd93

votes

1 answer

Tesseract OCR : Unsupported image type

I converted the PDF to TIF file using the following commands on terminal convert -density 300 -depth 4 lang.font-name.exp0.pdf lang.font-name.exp0.tif convert lang.font-name.exp0.tif -colorspace rgb -type truecolor lang.font-name.exp0.tif Then I…

macos macports homebrew tesseract-ocr

asked Jul 02 '14 at 08:45

Nina

votes

2 answers

OCR with non-language text

I am interested in using OCR to recognize text from a document that doesn't contain words. Rather, it is a document with a long string of "random" printed characters. I have been trying to use tesseract to scan the text, but it seems to be looking…

ocr tesseract-ocr

asked Aug 28 '13 at 15:00

Daniel

votes

0 answers

How to compress Tesseract-encoded PDFs while maintaining embedded text from OCR?

I've been experimenting with using Tesseract to OCR my PDFs, and it has been mostly successful, particularly with German Fraktur texts (the old style gothic print), which tools like Adobe Acrobat can't recognize properly. The problem is that the…

pdf compression adobe-acrobat ocr tesseract-ocr

asked May 15 '16 at 23:52

Jason

votes

2 answers

Tesseract 3.03 english language data

Tesseract 3.03 have been released recently and I have just installed it. Nevertheless, English language data is not provided with the download (from https://launchpad.net/ubuntu/+source/tesseract/3.03.03-1). On the Tesseract website, there is a…

tesseract-ocr

asked May 26 '14 at 11:44

MarAja

votes

1 answer

Tesseract hocr and txt at the same time, or converting from Tesseracts hocr to txt

I've been playing around with Linux OCR software, and I really like Tesseract, especially in conjunction with gsan2pdf. Tesseract v3 or greater supports outputting in the hocr format, and gscan2pdf is able to make use of that in order to create…

linux pdf tesseract-ocr

asked May 16 '13 at 20:57

Petr Skocik

1,402
3
15
30

votes

2 answers

What exactly is "tesseract"?

Like so many software companies that provide a free/open source version and also sell a "commercial" version, they make it as cryptic and unfriendly as possible to actually download and use the free one. Here is a typical example:…

windows pdf open-source tesseract-ocr

asked Oct 19 '20 at 17:22

Yashveer Hulsey

votes

1 answer

Training Tesseract-OCR for english language fonts

I have about 3000 small images of single words that I am trying to convert to text. I have installed tesseract on my windows 7 machine using the installer and successfully managed to OCR images throught cmd and powershell. tesseract.exe…

ocr tesseract-ocr

asked Jan 19 '11 at 19:51

andrew

votes

0 answers

Tesseract: OCR hex and binary strings from old documents

I have some questions about Tesseract Context I am currently working on an old cryptographic algorithm from East Germany (GDR) which was developed in the 80s. I implemented the algorithm in C#. Now I have about 30 pages of test cases which I want…

ocr image-processing hexadecimal tesseract-ocr

asked Jul 24 '19 at 11:38

tassadarius

votes

1 answer

Can dvdsub subtitles be converted to srt via command line?

Is there a way to convert dvdsub (image based) subtitles to srt? for example with mencoder or ffmpeg combined with tesseract? I'm looking for something command-line based, and I'm ok with having to go through a couple of passes. I'm less keen on GUI…

ffmpeg mencoder tesseract-ocr

asked Sep 30 '17 at 14:59

simone

votes

3 answers

Optimal font for Tesseract? (specifically the .NET wrapper)

I am using Tesseract as a means to convert printed text documents captured by my cell phone camera into text. The results are not great. The quality of the image is very good, far clearer than a fax, but it seems to have a very difficult time…

tesseract-ocr

asked Jul 03 '16 at 16:12

user613051

votes

1 answer

How to extract Unicode character from .png file?

I want to extract Unicode character from .jpg and .png files. I try to do it by using following command: tesseract 1.png output.txt That command works for English characters but when I try it for Unicode like Hindi, Marathi, or Devanagari Script…

ocr tesseract-ocr

asked Feb 14 '16 at 16:01

Madhav Nikam

votes

2 answers

How to avoid skewed results with the OCR tool pdfsandwich?

Usually, scanned pages need to be deskewed before applying an OCR tool. Here, my input is a straight scanned page, and the OCR output is sometimes skewed, either clockwise or counter-clockwise. In my use case of a 260 pages english book, it happens…

pdf scanning ocr tesseract-ocr

asked Jan 19 '15 at 15:19

lalebarde

2 3 4 Next