Questions tagged [tesseract-ocr]

Tesseract is an optical character recognition engine(OCR)

Tesseract is an optical character recognition engine(OCR). Tesseract works on Linux, Windows (with VC++ Express or CygWin) and Mac OSX.

Most of the work on Tesseract is sponsored by Google.

  • ReadMe - Installation and usage information.
  • Compiling - How to build Tesseract on a variety of platforms.
  • FAQ - Common questions and problems. Please check before filing a bug or consulting the forum.
  • Too many errors? - See the guidance on getting the best out of Tesseract.
48 questions
7
votes
3 answers

OCR Tesseract, Empty page error?

I compiled it from sources with leptonica. This is a png image with transparent background, which I edited adding a blue color and still this error: Tesseract Open Source OCR Engine v3.02.02 with Leptonica Empty page!! Empty page!! Here's the image…
Jim
  • 71
  • 1
  • 1
  • 2
6
votes
2 answers

How to leave pdf image unchanged while adding OCR to a pdf with pdfsandwich?

I am trying to add OCR to PDFs and am using pdfsandwich to do so. The problem is that pdfsandwich processes the image when doing OCR which changes what the document looks like. Is there any way to ensure that the PDF image remains completely…
user3750888
  • 111
  • 5
5
votes
1 answer

Use ffmpeg for JPEG to TIFF conversion

I would like to use Tesseract OCR with a video. With ffmpeg I can export some (.jpeg) images from a video. Can I convert a .jpeg into a valid .tiff or export directly .tiff images from the video with ffmpeg?
Tenaciousd93
  • 445
  • 3
  • 5
  • 9
5
votes
1 answer

Tesseract OCR : Unsupported image type

I converted the PDF to TIF file using the following commands on terminal convert -density 300 -depth 4 lang.font-name.exp0.pdf lang.font-name.exp0.tif convert lang.font-name.exp0.tif -colorspace rgb -type truecolor lang.font-name.exp0.tif Then I…
Nina
  • 205
  • 5
  • 13
5
votes
2 answers

OCR with non-language text

I am interested in using OCR to recognize text from a document that doesn't contain words. Rather, it is a document with a long string of "random" printed characters. I have been trying to use tesseract to scan the text, but it seems to be looking…
Daniel
  • 161
  • 1
  • 7
5
votes
0 answers

How to compress Tesseract-encoded PDFs while maintaining embedded text from OCR?

I've been experimenting with using Tesseract to OCR my PDFs, and it has been mostly successful, particularly with German Fraktur texts (the old style gothic print), which tools like Adobe Acrobat can't recognize properly. The problem is that the…
Jason
  • 315
  • 3
  • 7
  • 17
4
votes
2 answers

Tesseract 3.03 english language data

Tesseract 3.03 have been released recently and I have just installed it. Nevertheless, English language data is not provided with the download (from https://launchpad.net/ubuntu/+source/tesseract/3.03.03-1). On the Tesseract website, there is a…
MarAja
  • 321
  • 1
  • 5
  • 15
4
votes
1 answer

Tesseract hocr and txt at the same time, or converting from Tesseracts hocr to txt

I've been playing around with Linux OCR software, and I really like Tesseract, especially in conjunction with gsan2pdf. Tesseract v3 or greater supports outputting in the hocr format, and gscan2pdf is able to make use of that in order to create…
Petr Skocik
  • 1,402
  • 3
  • 15
  • 30
4
votes
2 answers

What exactly is "tesseract"?

Like so many software companies that provide a free/open source version and also sell a "commercial" version, they make it as cryptic and unfriendly as possible to actually download and use the free one. Here is a typical example:…
3
votes
1 answer

Training Tesseract-OCR for english language fonts

I have about 3000 small images of single words that I am trying to convert to text. I have installed tesseract on my windows 7 machine using the installer and successfully managed to OCR images throught cmd and powershell. tesseract.exe…
andrew
  • 897
  • 2
  • 9
  • 12
3
votes
0 answers

Tesseract: OCR hex and binary strings from old documents

I have some questions about Tesseract Context I am currently working on an old cryptographic algorithm from East Germany (GDR) which was developed in the 80s. I implemented the algorithm in C#. Now I have about 30 pages of test cases which I want…
3
votes
1 answer

Can dvdsub subtitles be converted to srt via command line?

Is there a way to convert dvdsub (image based) subtitles to srt? for example with mencoder or ffmpeg combined with tesseract? I'm looking for something command-line based, and I'm ok with having to go through a couple of passes. I'm less keen on GUI…
simone
  • 181
  • 2
  • 9
3
votes
3 answers

Optimal font for Tesseract? (specifically the .NET wrapper)

I am using Tesseract as a means to convert printed text documents captured by my cell phone camera into text. The results are not great. The quality of the image is very good, far clearer than a fax, but it seems to have a very difficult time…
user613051
  • 31
  • 1
  • 2
3
votes
1 answer

How to extract Unicode character from .png file?

I want to extract Unicode character from .jpg and .png files. I try to do it by using following command: tesseract 1.png output.txt That command works for English characters but when I try it for Unicode like Hindi, Marathi, or Devanagari Script…
Madhav Nikam
  • 187
  • 2
  • 14
2
votes
2 answers

How to avoid skewed results with the OCR tool pdfsandwich?

Usually, scanned pages need to be deskewed before applying an OCR tool. Here, my input is a straight scanned page, and the OCR output is sometimes skewed, either clockwise or counter-clockwise. In my use case of a 260 pages english book, it happens…
lalebarde
  • 705
  • 1
  • 8
  • 20
1
2 3 4