How it works...

First, we import the argparse, datetime, and sys libraries along with the newly installed PyPDF2 module.

from __future__ import print_function
from argparse import ArgumentParser, FileType
import datetime
from PyPDF2 import PdfFileReader
import sys

This recipe's command-line handler accepts one positional argument, PDF_FILE, which represents the file path to the PDF to process. For this script, we need to pass an open file object to the PdfFileReader class, so we use the argparse.FileType handler to open the file for us.

parser = ArgumentParser(
    description=__description__,
    epilog="Developed by {} on {}".format(", ".join(__authors__), __date__)
)
parser.add_argument('PDF_FILE', help='Path to PDF file',
                    type=FileType('rb'))
args = parser.parse_args()

After providing the open file to the PdfFileReader class, we call the getXmpMetadata() method to provide an object containing the available XMP metadata. If this method returns None, we print a succinct message to the user before exiting.

pdf_file = PdfFileReader(args.PDF_FILE)

xmpm = pdf_file.getXmpMetadata()
if xmpm is None:
    print("No XMP metadata found in document.")
    sys.exit()

With the xmpm object ready, we begin extracting and printing relevant values. We extract a number of different values including the title, creator, contributor, description, creation, and modification dates. These value definitions are from http://wwwimages.adobe.com/content/dam/Adobe/en/devnet/xmp/pdfs/XMP%20SDK%20Release%20cc-2016-08/XMPSpecificationPart1.pdf. Even though many of these elements are different data types, we pass them to the custom_print() method in the same manner. Let's take a look at how this function works.

custom_print("Title: {}", xmpm.dc_title)
custom_print("Creator(s): {}", xmpm.dc_creator)
custom_print("Contributors: {}", xmpm.dc_contributor)
custom_print("Subject: {}", xmpm.dc_subject)
custom_print("Description: {}", xmpm.dc_description)
custom_print("Created: {}", xmpm.xmp_createDate)
custom_print("Modified: {}", xmpm.xmp_modifyDate)
custom_print("Event Dates: {}", xmpm.dc_date)

Since the XMP values stored may differ based on the software used to generate the PDF, we use a custom print handling function, creatively called custom_print(). This allows us, as presented here, to handle the conversion of lists, dictionaries, dates, and other values into a readable format. This function is portable and can be brought into other scripts as needed. The function, through a series of if-elif-else statements, checks if the input value is a supported object type using the built-in isinstance() method and handles them appropriately. If the input value is an unsupported type, this is printed to the console instead.

def custom_print(fmt_str, value):
    if isinstance(value, list):
        print(fmt_str.format(", ".join(value)))
    elif isinstance(value, dict):
        fmt_value = [":".join((k, v)) for k, v in value.items()]
        print(fmt_str.format(", ".join(value)))
    elif isinstance(value, str) or isinstance(value, bool):
        print(fmt_str.format(value))
    elif isinstance(value, bytes):
        print(fmt_str.format(value.decode()))
    elif isinstance(value, datetime.datetime):
        print(fmt_str.format(value.isoformat()))
    elif value is None:
        print(fmt_str.format("N/A"))
    else:
        print("warn: unhandled type {} found".format(type(value)))

Our next set of metadata includes more details about the document's lineage and creation. The xmp_creatorTool attribute stores information about the software used to create the resource. Separately, we can also deduce additional lineage information based on the following two IDs:

The Document ID represents an identifier, usually stored as a GUID, that is generally assigned when the resource is saved to a new file. For example, if we create DocA.pdf and then save it as DocB.pdf, we would have two different Document IDs.
Following the Document ID is the second identifier, Instance ID. This Instance ID is usually generated once per save. An example of this identifier updating is when we update DocA.pdf with a new paragraph of text and save it with the same filename.

When editing the same PDF, you would expect the Document ID to remain the same while the Instance ID would likely update, though this behavior can vary depending on the software used.

custom_print("Created With: {}", xmpm.xmp_creatorTool)
custom_print("Document ID: {}", xmpm.xmpmm_documentId)
custom_print("Instance ID: {}", xmpm.xmpmm_instanceId)

Following this, we continue extracting other common XMP metadata, including the language, publisher, resource type, and type. The resource type field should represent a Multipurpose Internet Mail Extensions (MIME) value and the type field should store a Dublin Core Metadata Initiative (DCMI) value.

custom_print("Language: {}", xmpm.dc_language)
custom_print("Publisher: {}", xmpm.dc_publisher)
custom_print("Resource Type: {}", xmpm.dc_format)
custom_print("Type: {}", xmpm.dc_type)

Lastly, we extract any custom properties saved by the software. Since this should be a dictionary, we can print it without our custom_print() function.

if xmpm.custom_properties:
    print("Custom Properties:")
    for k, v in xmpm.custom_properties.items():
        print("	{}: {}".format(k, v))

When we execute the script, we can quickly see many of the attributes stored within the PDF. Notice how the Document ID does not match the Instance ID, this suggests this document may have been modified from the original PDF.

Table of Contents for How it works...

Create new playlist

Sign In

Sign Up

Table of Contents for
How it works...