Questions tagged [tesseract]

An open-source optical character recognition engine

Tesseract is an open-source optical character recognition engine. Character data sets for various scripts and languages pre-exist and the engine allows training of additional (custom) data sets.

Tesseract's output will have very poor quality if the input images are not preprocessed to suit it: Images (especially screenshots) must be scaled up such that the text x-height is at least 20 pixels, any rotation or skew must be corrected or no text will be recognized, low-frequency changes in brightness must be high-pass filtered, or Tesseract's binarization stage will destroy much of the page, and dark borders must be manually removed, or they will be misinterpreted as characters.

17 questions
4
votes
1 answer

ocrfeeder doesn't detect anything

When I try to detect text on my jpeg, it shows correctly all areas where it suspects text and images, but when I export it to ODT it only creates an ODT with empty text- and imageframes. Do I have to configure tesseract somehow? (I use Ubuntu 14.10…
rubo77
  • 31,573
  • 49
  • 159
  • 281
3
votes
1 answer

How to improve tesseract performance?

By all accounts, tesseract is superb. However, my results are dismal. I need to convert (digital, as opposed to from a book) text that I only have as a png. For instance: 2 3 academics 1 1711 2 3 Achlmbobelmann 211 191—2 1 3 Aoqusmono|Food…
katriel
  • 437
  • 1
  • 5
  • 10
3
votes
0 answers

Tesseract giving errors

This morning I tried to use tesseract and I'm getting the following error messages: $ tesseract --list-langs Error in pixReadMemTiff: function not present Error in pixReadMem: tiff: no pix returned Error in pixaGenerateFontFromString: pix not…
To Do
  • 15,172
  • 12
  • 70
  • 116
2
votes
1 answer

What program is suitable for making scanned PDF files searchable?

I would like to be able to scan paper documents to PDF files and make the text searchable. I believe the Tesseract program can assist this, but don't know how to begin, and don't know what would be the best program to use. Is anybody making…
Hedley Finger
  • 928
  • 2
  • 14
  • 30
2
votes
0 answers

OCR with two-page layout

I'm trying to do OCR on a pdf with a two-page layout - in a landscape-orientation page of the PDF, the left half is one (portrait-orientation) page, the right half is the next (portrait-orientation) page. Sometimes the layout messes up tesseract.…
Raffi
  • 121
  • 4
2
votes
2 answers

How to write bash script to run the same command for all files in a directory

I want to run this command for all files in a directory. tesseract /home/kong/Documents/input/248.jpg stdout --psm 1 --oem 1 --dpi 300 tsv >/home/kong/Documents/input/ocr_output/input/248.tsv The input and output should have same number like…
1
vote
0 answers

I'm having trouble installing OCRopy, I want to use it to create train data for an old manuscript in latin. What prereqs are needed and lines to write

So I am new to using Ubuntu and I am trying to install OCRopy to make train data with the end goal of creating a transcript for a 15th c. manuscript. So far I am considering that my problem may be a lack of prerequisites. I have installed python3…
mumbot
  • 11
  • 3
1
vote
1 answer

Cannot make .box files -Training Tessearct

I am trying to train Tesseract in Ubuntu 20.04.1 LTS.I have downloaded tesseract and the training tools required. For the training data I am using jTessBoxEditor.I have the .tiff files but I am unable to make the .box files.When I type the following…
Hula
  • 11
  • 3
1
vote
1 answer

Can Qt-box-editor be used for tesseract 4.0?

I am using tesseract 4.0 for character recognition. In many blogs, it is written that Qt-box-editor can be used with tesseract 3.x. My question is:- Can Qt-box-editor be used with tesseract 4.0?
Ashna Eldho
  • 113
  • 3
1
vote
3 answers

Tesseract -tessdata-dir option not working in ubuntu 18.04

I am trying to use the best model from tesseract. However, I am getting the following error: tesseract sample.jpg stdout --tessdata-dir tessdata/ Error opening data file tessdata/eng.traineddata Please make sure the TESSDATA_PREFIX environment…
Monster
  • 21
  • 1
  • 2
1
vote
3 answers

Ubuntu 18.04 error install tesseract

I've installed Ubuntu 18.04. I've installed tesseract using sudo apt-get install tesseract-ocr When I type: tesseract -v I had an error: tesseract: symbol lookup error: /usr/lib/x86_64-linux-gnu/libtesseract.so.4: undefined symbol:…
mayur panchal
  • 129
  • 1
  • 5
0
votes
2 answers

How can I get Tesseract OCR to recognise the large digits of an electricity meter?

I want to use an OCR program on an RPi to recognise the digits from a photo of my electricity meter. The digits are large and are very obvious to me, but Tesseract appears unable to recognise them at all - at best it detects a few random wrong…
Shaka Zulu
  • 121
  • 1
  • 1
  • 6
0
votes
0 answers

Aletheia equivalent for Ubuntu

Is there a ubuntu-linux equivalent to windows Aletheia, a program to analyze fonts, export points to xml? Mainly I would use it for OCR and training tesseract. I understand that ImageMagick can cover lot's of the ground for image cleanup. I need a…
rearThing
  • 150
  • 7
0
votes
1 answer

Why does tesseract append ^L to the output

I am using tesseract to OCR some text in images, e.g this one: I have this version of tesseract on my Ubuntu 20.04: $ tesseract --version tesseract 4.1.1 leptonica-1.79.0 libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 2.0.3) : libpng 1.6.37 : libtiff…
izri_zimba
  • 45
  • 6
0
votes
0 answers

Tesseract with "fl" and "fi" characters

I started using tesseract yesterday. It worked very well, but apparently my original text (in the scanned image) had characters that combine fi into one, single character and fl into another single character. And tesseract converts those into…
Georgie
  • 61
  • 3
1
2