How to compare multiple (thousands) of PDFs?

Question

I have two collections of PDFs. One (collection1) is 1000+ PDFs, much larger in file size (100+GB), and in illogical sections (think pdf 1 (1), 1 (3), ... when it could and should just be one file). The other (collection2) is 300 files.

Collection2 is supposed to be a compressed and organized version of collection1. I used Adobe Acrobat to process, condensed multiple PDFs into a single PDF, and then applied compression (and bates numbering). After doing a few I had a junior staff take over...

And, we've recently discovered that there are errors. Sections missing as compared to the original PDFs, and similar issues. This is a whopper of an error and something I'm hoping we can fix easily.

Not sure if what I'm looking for in this case is really diff, as I'd need to compare multiple files to one single file.

If I could isolate the problem files, I could fix those easily. The best I can figure right now is perhaps surprisingly Preview (MacOS), which allows you to open multiple set of files (and provides page count). From there I can check first, last and several in the middle. If these are consistent and page count is consistent, it's likely the files are solid, from what I can tell from the errors. This isn't the most thorough solution however.

Answers for similar questions are here and here however they are either several years old, windows specific (Which is fine if need be, but not preferred in this particular case), or not at the scale I need to operate at. No one on my team has advanced technical skills, relative to the SU community, so a detailed answer or links out to relevant prereq knowledge would be much, much appreciated.

Thank you so much SU

What about a more general solution? Why not use md5 or SHA sums on the files to compare them to see if they are identical or not? A checksum will only tell you if the files are identical, but if you have multiple files, each with the same checksum, you can - to practical intents - be sure they are the same. — davidgo, Nov 13 '17 at 21:42
How would I do this with multiple files v a single file? And won't the comparison not work because they are different files, with different compression applied to them, different OCR, etc? I've used SHA before, but never for something this detailed, and glancing over the technical documentation, it's a bit over my head. — Gryph, Nov 13 '17 at 21:48
If the files are not identical, this won't work. (The way you would compare identical files would be to run the checksum algorithm on each one, and check that the resulting string is the same across all files). If your files use OCR and different kinds of compression you will struggle to find a non-manual way to do an accurate comparison - although you might be able to get some trivial indication by looking at the number of pages in each file - which won't help if the pages are blank or repeated with others missing. — davidgo, Nov 13 '17 at 21:51
_Compressing_ a PDF modifies the content in non-trivial ways (rescaling images; removing invisible and cropped content, etc.). There is no way this can be automated. Easier to rerun it (and maybe save the command files?) — Aganju, Nov 14 '17 at 00:41
I was assuming that you - after having sorted out the input files and their sequence - feed it into your Acrobat exe by ways of command line. I don’t know the syntax, but something like `Acrobat.exe -compress -combine file1.pdf file97.pdf file43.pdf ...`. Maybe even multiple commands to cut certain pages out of certain files, and then combine them, etc. The full command lines should be kept in a file, and could be corrected and then run again if there is an issue. — Aganju, Nov 14 '17 at 02:25
oh I see! Interesting. I was using the GUI 'add files'. I will look into this. — Gryph, Nov 14 '17 at 02:31

LSerni · Answer 1 · 2017-11-14T06:16:38.533

You absolutely need first some way of mapping the 1000 files with the 300 files, in order.

In the simplest case, you will have say "CIDOC Ontology 2.0 (1).pdf", "CIDOC Ontology 2.0 (2).pdf" and "CIDOC Ontology 2.0 (3).pdf" on one hand, and "CIDOC ontology.pdf" on the other.

Now, the best approaches I can figure are these:

Using pdftk or pdf2json, extract the number of pages of the 1000 group, and see whether the sum corresponds to the 300-group:
```
12, 9, 10  vs.   31   = OK
12, 9, 10  vs    22   = BAD (and you might suspect section 2 is missing)
```
This method is quite basic and won't recognize three sections being out of order.
Using pdf2ps and ps2ascii, create text versions of all the files. Depending on the PDF process, these might well be next to illegible, but it matters little: with a little luck, the tool used to coalesce the files will not have changed text metrics and grouping. If it is so, then the concatenation of the three files will be very, very much alike the fourth file (and if not, you'll mark it as an anomaly). So these heuristics should work:
- the sum of the outputs of "wc" from the three files will be equal (or very close to) to the output from the fourth file.
- cat'ting the three text files, or the fourth file, through cat file1 file2 file3 | sed -e "s#\s#\n#g" | sort should yield almost identical word lists (the output from diff -Bbawd should be no more than three or four lines; ideally, none). If you omit the | sort stage, then sections out of order should be recognizable: if the sorted check matches and the unsorted does not, you're facing a section-out-of-order situation.

The sed part will split words, which might help even if the coalescing tool did alter the text somewhat. A change in kerning, with words turning out to have been split differently inside the PDF ("homeostasis" having become "ho meos tas is" from "home osta sis"), will render even this insufficient; but it's not that likely.

The difficulty I see is matching the raw files with the final. Having a sample of each, I could probably whip up a script to run the comparison.

score 1 · Answer 2 · answered Nov 28 '17 at 13:52

You could use a sequence alignment process similar to DNA sequence analysis. Specifically, a dynamic programming approach to sequence alignment.

Extract the text of each PDF in each collection and then attempt to align each individual text sequence from Collection 1 with each longer, concatenated sequence from Collection 2. Perfect matching of any letter gets a score of one, and mismatches get a zero. The overall score is the number of matches between aligned sequences. You can also allow for edits between sequences but introducing gaps.

The algorithm isn't hard, but might take a while to run. Given the dataset size you mentioned, I'm guessing it would run in a few hours or overnight.

Here's a link to the algorithm in Wikipedia: https://en.m.wikipedia.org/wiki/Sequence_alignment

How to compare multiple (thousands) of PDFs?

2 Answers2