6

Is there a command-line solution to extract highlighted text from pdf?

I have a bunch of pdf documents where I personally annotated, and was wondering if there is a convenient way to automatically extract this to the text file

EDIT This is not a duplicate question in that I am looking for a command-line solution like ImageMagick for image processing.

Sathyajith Bhat
  • 61,504
  • 38
  • 179
  • 264
Alby
  • 487
  • 1
  • 6
  • 13
  • possible duplicate of [Print only the annotations of a pdf](http://superuser.com/questions/275964/print-only-the-annotations-of-a-pdf) – Ƭᴇcʜιᴇ007 Mar 20 '14 at 16:18
  • Related: [How do I extract highlighted text only from PDF files in Adobe Acrobat Pro version 9?](http://superuser.com/questions/620880/how-do-i-extract-highlighted-text-only-from-pdf-files-in-adobe-acrobat-pro-versi?rq=1) – Ƭᴇcʜιᴇ007 Mar 20 '14 at 16:18

2 Answers2

0

Under Linux you can use pdfgrep

Pierre-Damien
  • 341
  • 2
  • 7
0

I would recommend usage of the nifty little Python library pdfannots, which has the very capability you are looking for.

$ pdfannots document.pdf

If combined with some other Bash commands, it can produce nicely formatted output. For example:

$ pdfannots document.pdf --no-condense | \
# Removing duplicate lines:
cat -n | sort -uk2 | sort -nk1 | cut -f2- | \
# Improving output formatting:
awk '{$1=$1};1' | sed 's/^\(> \)//g' | sed 's/* Page #/\n&/'