How it works...

First, we import the argparse and datetime libraries, followed by xml.etree and zipfile libraries. The ElementTree class allows us to read an XML string into an object that we can iterate through and interpret.

from __future__ import print_function
from argparse import ArgumentParser
from datetime import datetime as dt
from xml.etree import ElementTree as etree
import zipfile

This recipe’s command-line handler takes one positional argument, Office_File, the path to the office file we will be extracting metadata from.

parser = argparse.ArgumentParser(
    description=__description__,
    epilog="Developed by {} on {}".format(", ".join(__authors__), __date__)
)
parser.add_argument("Office_File", help="Path to office file to read")
args = parser.parse_args()

Following our argument handling, we check to make sure the input file is a zipfile and raise an error if it is not. If it is, we open the valid ZIP file using the ZipFile class before accessing the two XML documents containing the metadata we are interested in. Though there are other XML files containing data describing the document, the two with the most metadata are named core.xml and app.xml. We will open the two XML files from the ZIP container with the read() method and send the returned string directly to the etree.fromstring() XML parsing method.

# Check if input file is a zipfile
zipfile.is_zipfile(args.Office_File)

# Open the file (MS Office 2007 or later)
zfile = zipfile.ZipFile(args.Office_File)

# Extract key elements for processing
core_xml = etree.fromstring(zfile.read('docProps/core.xml'))
app_xml = etree.fromstring(zfile.read('docProps/app.xml'))

With the prepared XML objects, we can start extracting data of interest. We set up a dictionary, called core_mapping, to specify the fields we want to extract, as key names, and the value we want to display them as. This method allows us to easily print only the values important to us, if present, with a friendly title. This XML file contains great information about the authorship of the file. For instance, the two authorship fields, creator and lastModifiedBy, can show scenarios where one account modified a document created by another user account. The date values show us information about creation and modification of the document. Additionally, metadata fields like revision can give some indication to the number of versions of this document.

# Core.xml tag mapping 
core_mapping = {
    'title': 'Title',
    'subject': 'Subject',
    'creator': 'Author(s)',
    'keywords': 'Keywords',
    'description': 'Description',
    'lastModifiedBy': 'Last Modified By',
    'modified': 'Modified Date',
    'created': 'Created Date',
    'category': 'Category',
    'contentStatus': 'Status',
    'revision': 'Revision'
}

In our for loop, we iterate over the XML using the iterchildren() method to access each of the tags within the XML root of the core.xml file. Using the core_mapping dictionary, we can selectively output specific fields if they are found. We have also added logic to interpret date values using the strptime() method.

for element in core_xml.getchildren():
    for key, title in core_mapping.items():
        if key in element.tag:
            if 'date' in title.lower():
                text = dt.strptime(element.text, "%Y-%m-%dT%H:%M:%SZ")
            else:
                text = element.text
            print("{}: {}".format(title, text))

The next set of column mappings focuses on the app.xml file. This file contains statistical information about the contents of the document, including total edit time and counts of words, pages, and slides. It also contains information about the company name registered with the software and hidden elements. To print these values to the console, we use a similar set of for loops as we did with the core.xml file.

app_mapping = {
    'TotalTime': 'Edit Time (minutes)',
    'Pages': 'Page Count',
    'Words': 'Word Count',
    'Characters': 'Character Count',
    'Lines': 'Line Count',
    'Paragraphs': 'Paragraph Count',
    'Company': 'Company',
    'HyperlinkBase': 'Hyperlink Base',
    'Slides': 'Slide count',
    'Notes': 'Note Count',
    'HiddenSlides': 'Hidden Slide Count',
}
for element in app_xml.getchildren():
    for key, title in app_mapping.items():
        if key in element.tag:
            if 'date' in title.lower():
                text = dt.strptime(element.text, "%Y-%m-%dT%H:%M:%SZ")
            else:
                text = element.text
            print("{}: {}".format(title, text))

When we run the script with a sample word document, as the following shows, a number of details about the document are in question.

Separately, we can use the script on a PPTX document and review format-specific metadata associated with PPTX files:

Table of Contents for How it works...

Create new playlist

Sign In

Sign Up

Table of Contents for
How it works...