Compare/Check PDF filename and contents

Question

We currently have a application that generates a pdf document and automatically names it based on {UniqueID-DocCode-StartDate-StartTime}, all this data is coming from a db via our application. We are getting one major problem.

pdf contents and filename are getting mixed up. e.g

Filename: 123456-Doc001-28042017-1415.pdf

Contents: 987654-Doc002-28042017-1312

My problem is identifying the pdfs that have failed (contents != filename) and re-triggering them.

The filename would match the contents in terms of being present, but the contents is structured as a letter, so a direct compare wouldn't work, also they vary in length dramatically depending on how complicated the contents is.

So, my wish list would be:

Ideally check for each parameter from filename. However just being able to check UniqueID would be sufficient.
A way of either moving failed files, renaming them or a report back of failed files in a list.
Run as a scheduled job or constantly from a directory.

Let me know if there is any particular info you need and I should be able to get it to you.

So you are confirming the content belongs to the file by confirming at least one string is found that matches the file name minus the extension i.e. `123456-Doc001-28042017-1415.pdf` contains at least one match on one line in the file for `123456-Doc001-28042017-1415`?? — Vomit IT - Chunky Mess Style, Apr 28 '17 at 04:50
It'd be helpful to know what system this is running on or needs to run from such as Linux, Windows, etc. too. — Vomit IT - Chunky Mess Style, Apr 28 '17 at 04:53
Hi @Spittin'IT - At a high level the file would contain each of the parameters, but not together, split up around the contents of the file. e.g. Hi ID... on the spine of the pdf is DocCode, and date and time would be in the doc referencing to it. Running on windows. Would have access to powershell. — Taz, Apr 28 '17 at 05:02
Do you already use a method to search the documents (in a non-bulk manner) where you can find each of the parameters to build the file name separated by the dash? Have you confirmed if the PDF document content is in a searchable text format and if so can you confirm that all the DB fields values or parameters that make up the file name are all searchable? I assume these aren't PDF images content wise but text converted to PDF format, correct? — Vomit IT - Chunky Mess Style, Apr 28 '17 at 05:22
This sounds more like a coding issue with your application rather than the output needing to be checked. — Sorean, Apr 28 '17 at 06:15
If you generate the information based on the data you have and you end up with the wrong result you probably should think about fixing your generation. If you really don't want to do this, make sure to populate the meta information fields for those files with the correct information. That way you might have an easier time than actual handling the PDF ([PS example](https://social.technet.microsoft.com/Forums/ie/en-US/e1c1f26b-6f9d-45ae-bb8c-5f4d4e38058a/powershell-script-to-read-metadata-info-from-pictures?forum=winserverpowershell)). — Seth, Apr 28 '17 at 07:19
Thanks everyone for the feedback, we are very limited in what we can change in the application. The vendor that supports the application is withdrawing from AU so no further developments are taking place. Therefore, we're stuck with fixing the output, rather than the problem. — Taz, Apr 30 '17 at 23:59
@Spittin'IT can I have this reopened as I have found a solution. — Taz, May 09 '17 at 00:42

score 0 · Accepted Answer · answered May 11 '17 at 00:36

Using the powershell script below it converted the pdf to text which is stored in temp.txt file, which is then used to compare against the filename. The filename is split using a delimiter, and then told which of the splits to use to compare. This runs for every file in the directory where the file ends with .pdf. It would provide a list in error.log of files that did not match.

We had to use a third party .exe to convert pdf to text.

$path = "C:\brokenPDFs\"

$output = $path + "\output.log"
$errorpath = $path + "\error.log"

"Start:" | Out-File $output
"Start:" | Out-File $errorpath

Clear-Content $output
Clear-Content $errorpath

$exe = $path + "pdftotext.exe" 

$errorcount = 0

$files = Get-ChildItem $path *.pdf

 Foreach ($currentfile In $files)
        {
        $filename=$currentfile.Name
        $splitname = $filename.split("^")
        $currentUR = $splitname[0]

        #write-host $currentfile.Name

        &$exe $currentfile.FullName $path\temp.txt

        $result = select-string -Path $path\temp.txt -Pattern $currentUR -Quiet      

            If ($result -eq $true)
                {
                $match = $currentfile.FullName
                "Match on string :  $currentUR  in file :  $match" | Out-File $output -Append
                }
            If ($result -eq $false)
                {
                $match = $currentfile.FullName
                "String not found:  $currentUR  missing from file :  $match" | Out-File $errorpath -Append
                write-host "ERROR: $currentfile missing $currentUR"
                $errorcount++
                }
            $result = $null
        }

        write-host "Total Errors: $errorcount"

Compare/Check PDF filename and contents

1 Answers1