21

I have some PDF files that I want to split apart into TIFF files using convert (in order to OCR via tesseract). This so far is working great - except that in order to automate the whole process, I need to set the DPI of the convert output. Right now, I am using a command like this:

convert -density 300 myFile.pdf -depth 8 -background white output-%04d.tiff

... which outputs the PDF files at 300 DPI. However, some PDF files have lower DPI (e.g. 150 DPI) which means that I don't want to output them at 300 DPI via convert - this creates excessively large TIFF files without any additional information.

I know that there are ways to check the DPI of images in a PDF file by opening Adobe Acrobat and messing around in the "preflight" tools. However, is there a way to determine via the command line the DPI of a particular PDF file?

Jason
  • 311
  • 1
  • 2
  • 3

3 Answers3

17

Main answer

Since I am interested in the same kind of job (though not necessarily to OCR the PDF files, but to convert them to DjVu and then OCR them), I found this question and the responses lacking (since I needed to guess the DPI of the images with the number of pixels and then use the size as output by pdfinfo or other tricks---not to mention that the images inside a PDF may have different densities etc.).

After a lot of research more, I found that you can use pdfimages (from package poppler-utils) like the following:

$ pdfimages -list deptest.pdf
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image     100   100  gray    1   1  image  no         9  0    53    53  169B  14%
   2     1 image     100   100  gray    1   1  ccitt  no   [inline]      53    53  698B  56%

Notice the x-ppi and y-ppi at the listing above. It also lists the format in which the images are stored in the PDF, which is cool (sometimes, it is JBIG2, sometimes JPEG2000 etc.)

Note: The file deptest.pdf used above is available from pdfsizeopt's repository.

The real action

After that, you can simply extract the images with pdfimages itself or use pdftoppm (also from poppler-utils) to render entire pages in many formats that you may like (e.g., tiff, for scanning with tesseract).

You can use something like the following (assuming you have created a directory named imgs where you will put your images):

pdfimages -png Faraway-PRA.pdf imgs/prefix

The files will be created inside the directory imgs with names starting with prefix, as in:

$ ls 
prefix-000.png  prefix-047.png  prefix-094.png  prefix-141.png
prefix-001.png  prefix-048.png  prefix-095.png  prefix-142.png
prefix-002.png  prefix-049.png  prefix-096.png  prefix-143.png
prefix-003.png  prefix-050.png  prefix-097.png  prefix-144.png
(...)

You can, then, perform any surgery that you see fit with tools like scantailor or whatever you like.

More direct answer

If you just want to OCR a PDF file, you can use a program that is well-maintained and already packaged, namely ocrmypdf.

rbrito
  • 339
  • 3
  • 11
  • Note that `x-ppi` (x resolution in DPI) and `y-ppi` (y resolution in DPI) are NOT shown on the older versions of `pdfimages` which come with Ubuntu 14.04, for instance. What is available on Ubuntu 18.04, however, does include these values. `pdfimages -v` on my Ubuntu 18.04 machine shows I have version 0.62.0, which *does* have these features. – Gabriel Staples Nov 10 '19 at 23:49
  • @GabrielStaples, thanks for pointing that out. I thought that Ubuntu 14.04 was already EOL'ed, but it "only" had its Standard Support ended July of 2019 according to https://wiki.ubuntu.com/Releases – rbrito Nov 12 '19 at 20:41
7

This technique also uses ImageMagick:

identify -format "%w x %h %x x %y" DAT_1.tif

The output is the size of the image and the dpi:

2480 x 3507 300 x 300
excyberlabber
  • 361
  • 2
  • 7
  • I would add a new line to the end of format, in case you want to do *.pdf to process all pdfs in directory. "%w x %h %x x %y\n" – Hatoru Hansou Apr 06 '18 at 23:43
  • This does not do what the main question asks. This returns the whole document size and resolution – Prescol Jan 09 '22 at 09:04
  • The site seems to be broken. [the wayback machine only finds a domain parking from 6 May 2021](http://web.archive.org/web/20210506133836/http://www.wizards-toolkit.org/discourse-server/viewtopic.php?t=16110) – Cadoiz Jan 25 '23 at 09:57
2

I use the following command:

convert MyPDF.pdf -print "Size: %wx%h\n" /dev/null

and it returns:

Size: 380x380
Mahdi
  • 1,457
  • 1
  • 13
  • 22
  • Thanks - this gets the *size* of the pdf images (in your case, 380x380 as it is a square). The DPI is different. On my file that I just ran this command on, I get `Size: 595x842` although the DPI (checking in Acrobat) is around 130 – Jason Apr 23 '16 at 14:11