0

I am using tesseract to OCR some text in images, e.g this one:

enter image description here

I have this version of tesseract on my Ubuntu 20.04:

$ tesseract --version
tesseract 4.1.1
 leptonica-1.79.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 2.0.3) : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.1
 Found AVX2
 Found AVX
 Found FMA
 Found SSE
 Found libarchive 3.4.0 zlib/1.2.11 liblzma/5.2.4 bz2lib/1.0.8 liblz4/1.9.2 libzstd/1.4.4

Invoking it as follows:

tesseract example.png output txt

However, when I open the output.txt file in vim, I see ^L at the last line as follows:

enter image description here

What is the meaning of that character? Why it is appended at the last line? Is it possible to get rid of it?

I have looked in the man page of tesseract, but I can't find anything about that.

FedKad
  • 9,212
  • 7
  • 40
  • 79
izri_zimba
  • 45
  • 6
  • Ctrl+L is the "Form Feed" character. Normally it used to indicate the end of page or the beginning of next page. – FedKad Sep 20 '20 at 08:36
  • Can you try the option `-c page_separator=""` in `tesseract` command line? – FedKad Sep 20 '20 at 08:48

1 Answers1

2

I assume that tesseract adds a new page (the ASCII "Form Feed") character to the end of the text. You can delete it using:

sed -i 's/^L//' output.txt

To enter the ^L character in the above command, fist type Ctrl+V and then Ctrl+L.

For GNU sed you can simply use the following command also:

sed  -i 's/\x0c//' output.txt

As a more straightforward method, you can use the -c option as follows:

tesseract -c page_separator="" example.png output txt

so you will not have any "page separator" in the output file.

FedKad
  • 9,212
  • 7
  • 40
  • 79
  • 1
    Maybe using `-c` option. take a look at this link: https://groups.google.com/g/tesseract-dev/c/VsgJ9R-cTQ0 –  Sep 20 '20 at 08:46
  • 1
    Can you try the option `-c page_separator="[PAGE SEPARATOR]"` where you can put empty string for `[PAGE SEPARATOR]`, in other words: `-c page_separator=""`? – FedKad Sep 20 '20 at 08:47
  • 1
    Please note that in the command you have to separate the basename from the extension by a space to get the expected filename, e.g : `output` `txt` to get `output.txt` otherwise you will get `output.txt.txt` – izri_zimba Sep 20 '20 at 08:55
  • 1
    Sorry. I corrected. – FedKad Sep 20 '20 at 08:55