Questions tagged [tesseract]

An open-source optical character recognition engine

Tesseract is an open-source optical character recognition engine. Character data sets for various scripts and languages pre-exist and the engine allows training of additional (custom) data sets.

Tesseract's output will have very poor quality if the input images are not preprocessed to suit it: Images (especially screenshots) must be scaled up such that the text x-height is at least 20 pixels, any rotation or skew must be corrected or no text will be recognized, low-frequency changes in brightness must be high-pass filtered, or Tesseract's binarization stage will destroy much of the page, and dark borders must be manually removed, or they will be misinterpreted as characters.

17 questions

votes

1 answer

ocrfeeder doesn't detect anything

When I try to detect text on my jpeg, it shows correctly all areas where it suspects text and images, but when I export it to ODT it only creates an ODT with empty text- and imageframes. Do I have to configure tesseract somehow? (I use Ubuntu 14.10…

ocr tesseract

asked Jul 03 '15 at 23:29

rubo77

31,573
49
159
281

votes

1 answer

How to improve tesseract performance?

By all accounts, tesseract is superb. However, my results are dismal. I need to convert (digital, as opposed to from a book) text that I only have as a png. For instance: 2 3 academics 1 1711 2 3 Achlmbobelmann 211 191—2 1 3 Aoqusmono|Food…

command-line image-processing ocr tesseract

asked Jan 27 '14 at 07:27

katriel

votes

0 answers

Tesseract giving errors

This morning I tried to use tesseract and I'm getting the following error messages: $ tesseract --list-langs Error in pixReadMemTiff: function not present Error in pixReadMem: tiff: no pix returned Error in pixaGenerateFontFromString: pix not…

19.04 tesseract

asked Oct 07 '19 at 10:06

To Do

15,172
12
70
116

votes

1 answer

What program is suitable for making scanned PDF files searchable?

I would like to be able to scan paper documents to PDF files and make the text searchable. I believe the Tesseract program can assist this, but don't know how to begin, and don't know what would be the best program to use. Is anybody making…

pdf scanner search ocr tesseract

asked Jul 13 '23 at 10:45

Hedley Finger

votes

0 answers

OCR with two-page layout

I'm trying to do OCR on a pdf with a two-page layout - in a landscape-orientation page of the PDF, the left half is one (portrait-orientation) page, the right half is the next (portrait-orientation) page. Sometimes the layout messes up tesseract.…

pdf ocr tesseract

asked Feb 22 '21 at 18:39

Raffi

votes

2 answers

How to write bash script to run the same command for all files in a directory

I want to run this command for all files in a directory. tesseract /home/kong/Documents/input/248.jpg stdout --psm 1 --oem 1 --dpi 300 tsv >/home/kong/Documents/input/ocr_output/input/248.tsv The input and output should have same number like…

18.04 bash tesseract

asked Jul 31 '19 at 16:22

BloodThirst

vote

0 answers

I'm having trouble installing OCRopy, I want to use it to create train data for an old manuscript in latin. What prereqs are needed and lines to write

So I am new to using Ubuntu and I am trying to install OCRopy to make train data with the end goal of creating a transcript for a 15th c. manuscript. So far I am considering that my problem may be a lack of prerequisites. I have installed python3…

opencv github ocr tesseract training

asked May 02 '21 at 03:48

mumbot

vote

1 answer

Cannot make .box files -Training Tessearct

I am trying to train Tesseract in Ubuntu 20.04.1 LTS.I have downloaded tesseract and the training tools required. For the training data I am using jTessBoxEditor.I have the .tiff files but I am unable to make the .box files.When I type the following…

tesseract training

asked Aug 16 '20 at 13:38

Hula

vote

1 answer

Can Qt-box-editor be used for tesseract 4.0?

I am using tesseract 4.0 for character recognition. In many blogs, it is written that Qt-box-editor can be used with tesseract 3.x. My question is:- Can Qt-box-editor be used with tesseract 4.0?

ocr tesseract

asked Jul 12 '19 at 05:19

Ashna Eldho

vote

3 answers

Tesseract -tessdata-dir option not working in ubuntu 18.04

I am trying to use the best model from tesseract. However, I am getting the following error: tesseract sample.jpg stdout --tessdata-dir tessdata/ Error opening data file tessdata/eng.traineddata Please make sure the TESSDATA_PREFIX environment…

18.04 tesseract

asked Jan 19 '19 at 11:59

Monster

vote

3 answers

Ubuntu 18.04 error install tesseract

I've installed Ubuntu 18.04. I've installed tesseract using sudo apt-get install tesseract-ocr When I type: tesseract -v I had an error: tesseract: symbol lookup error: /usr/lib/x86_64-linux-gnu/libtesseract.so.4: undefined symbol:…

php ocr tesseract

asked Jan 11 '19 at 14:55

mayur panchal

votes

2 answers

How can I get Tesseract OCR to recognise the large digits of an electricity meter?

I want to use an OCR program on an RPi to recognise the digits from a photo of my electricity meter. The digits are large and are very obvious to me, but Tesseract appears unable to recognise them at all - at best it detects a few random wrong…

ocr tesseract

asked Aug 07 '17 at 20:13

Shaka Zulu

votes

0 answers

Aletheia equivalent for Ubuntu

Is there a ubuntu-linux equivalent to windows Aletheia, a program to analyze fonts, export points to xml? Mainly I would use it for OCR and training tesseract. I understand that ImageMagick can cover lot's of the ground for image cleanup. I need a…

fonts tesseract

asked Oct 25 '20 at 03:36

rearThing

votes

1 answer

Why does tesseract append ^L to the output

I am using tesseract to OCR some text in images, e.g this one: I have this version of tesseract on my Ubuntu 20.04: $ tesseract --version tesseract 4.1.1 leptonica-1.79.0 libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 2.0.3) : libpng 1.6.37 : libtiff…

command-line ocr tesseract

asked Sep 20 '20 at 08:29

izri_zimba

votes

0 answers

Tesseract with "fl" and "fi" characters

I started using tesseract yesterday. It worked very well, but apparently my original text (in the scanned image) had characters that combine fi into one, single character and fl into another single character. And tesseract converts those into…

special-characters ocr tesseract

asked Jun 26 '20 at 14:07

Georgie

2 Next