Is there a automatic way to extract highlighted text from pdf?

Question

Is there a command-line solution to extract highlighted text from pdf?

I have a bunch of pdf documents where I personally annotated, and was wondering if there is a convenient way to automatically extract this to the text file

EDIT This is not a duplicate question in that I am looking for a command-line solution like ImageMagick for image processing.

possible duplicate of [Print only the annotations of a pdf](http://superuser.com/questions/275964/print-only-the-annotations-of-a-pdf) — Ƭᴇcʜιᴇ007, Mar 20 '14 at 16:18
Related: [How do I extract highlighted text only from PDF files in Adobe Acrobat Pro version 9?](http://superuser.com/questions/620880/how-do-i-extract-highlighted-text-only-from-pdf-files-in-adobe-acrobat-pro-versi?rq=1) — Ƭᴇcʜιᴇ007, Mar 20 '14 at 16:18

score 0 · Answer 1 · answered Jun 17 '19 at 20:40

0

Under Linux you can use pdfgrep

answered Jun 17 '19 at 20:40

Pierre-Damien

341
2
7

How is it possible to extract highlighted text with `pdfgrep`? – joelostblom Jul 04 '19 at 22:27
I was thinking there will be some sort of tag around highlighted text which can be parse with pdfgrep... – Pierre-Damien Jul 16 '19 at 13:35
nope -- default highlighting is coordinate positions only – Mike M Oct 11 '20 at 13:26

score 0 · Answer 2 · answered Nov 15 '22 at 12:35

I would recommend usage of the nifty little Python library pdfannots, which has the very capability you are looking for.

$ pdfannots document.pdf

If combined with some other Bash commands, it can produce nicely formatted output. For example:

$ pdfannots document.pdf --no-condense | \
# Removing duplicate lines:
cat -n | sort -uk2 | sort -nk1 | cut -f2- | \
# Improving output formatting:
awk '{$1=$1};1' | sed 's/^\(> \)//g' | sed 's/* Page #/\n&/'

Is there a automatic way to extract highlighted text from pdf?

2 Answers2