I started using tesseract yesterday. It worked very well, but apparently my original text (in the scanned image) had characters that combine fi into one, single character and fl into another single character. And tesseract converts those into special characters. How can I tell it to generate "f i" or "f l" instead?
Asked
Active
Viewed 158 times
0
-
1This question does not seem to be about Ubuntu or anyone of its flavors. Questions about programming should be asked in [StackOverflow](https://stackoverflow.com/). May be you should apply some image processing before you pass the image to Tessract. – singrium Jun 26 '20 at 14:09
-
1Perhaps this will be helpful? [Ligatures in Tesseract OCR Output](https://mlichtenberg.wordpress.com/2015/09/11/ligatures-in-tesseract-ocr-output/) – steeldriver Jun 26 '20 at 14:10
-
@singrium Questions about using software on Ubuntu are on topic, ([to the extent that *about using software* is a reasonable description](https://chat.stackexchange.com/transcript/201?m=51223939#51223939), which applies here). – Zanna Jun 27 '20 at 01:16
-
1@Zanna I agree with that. However, based on the description of the problem, it does not seem to have a relation to Ubuntu, it is more related to image processing techniques and how to make the characters clear and readable (reduce the noise in the image). Hence, it is more related to *programming* than to *Ubuntu*. – singrium Jun 27 '20 at 01:34
-
OK. I am an Ubuntu user, but I understand that the question might belong elsewhere. I am asking about using tesseract from the command line. Is that a programming problem that belongs in Stackoverflow? Or is there an even better place for the question? – Georgie Jun 28 '20 at 14:53