Chapter 12. Metadata

This chapter will explore the various ways in which metadata can be incorporated into a PDF file, from the simplest document-level strings to rich XML attached to individual objects.

The Document Information Dictionary

It was clear even with the original 1.0 version of PDF that the presence of metadata was a requirement for any file format, and certainly one that would be representing documents for electronic distribution and storage. For this purpose, the document information dictionary (or info dictionary, or even just info dict) was created (see Example 12-1).

As the name implies, the info dictionary is a standard PDF dictionary object. However, unlike every other object you’ve encountered so far, this object is referenced not from the catalog, but instead from the trailer. The original PDF 1.0 specification documented four (optional) keys for this dictionary, each one allowing only a string value encoded in PDFDocEncoding.

Author
The name of the person(s) who created the document.
CreationDate
The date and time the document was created, formatted as a date.

Note

Dates, as a type of string, were added to PDF in version 1.1, so very early PDF files may have the value of this key as a simple string.

Creator
The software used to author the original document that was used as the basis for conversion to PDF. If the PDF was created directly, the value may be left blank or may be the same as the Producer.
Producer
The name of the product that created the PDF.

In PDF 1.1, four additional (optional) keys were added, each allowing a string value encoded in PDFDocEncoding:

Title
The document’s title.
Subject
The document’s subject.
Keywords
Any keywords associated with this document.
ModDate
The date and time the document was most recently modified, formatted as a date.

While the PDF specification documented only those eight keys, developers and users were originally free to add additional keys to the dictionary whose values could be of any type. Later, the PDF specification (and now ISO 32000-1 itself) restricted the values to only those of type text string (see String Objects).

Note

Since ISO 32000-1 allows only for values of type text string, developers cannot store more complex information in a dictionary or a stream.

Example 12-1. Example info dictionary
1 0 obj
<<
   /Title (PostScript Language Reference, Third Edition)
   /Author (Adobe Systems Incorporated)
   /Creator (Adobe FrameMaker 5.5.3 for Power Macintosh®)
   /Producer (Acrobat Distiller 3.01 for Power Macintosh)
   /CreationDate (D:19970915110347-08'00')
   /ModDate (D:19990209153925-08'00')
   /DEV1_CustomKey (Here is a sample custom key using a proper second class name)
   /CustomKey2 (Here is a sample custom key improperly using a first class name)
>>
endobj

Metadata Streams

As you can see, the info dictionary has a number of limitations, including data typing and handling of complex structures (such as arrays or dictionaries), not to mention being associated with only the document as a whole and not individual objects in the PDF.

To address these concerns, a new type of metadata was introduced called a metadata stream (because, as you can probably guess, it’s stored as a stream object). These streams can be associated not only with the document but with any object in it (though some are more likely to have them than others). As these are streams, they can have any of the standard compression or encoding filters applied to the data. However, it is strongly recommended that at least the document-level metadata stream be stored in plain text. In fact, some of the PDF standards specifically require that the document-level metadata stream be stored in plain text (see Chapter 13).

Note

Although the reason for recommending plain-text metadata streams is to enable non-PDF-aware tools to examine, catalog, and classify documents, it turns out that doing so may actually be problematic if incremental updates are made to the document. When the document is updated, there will be a second (updated) metadata stream, which will confuse a non-PDF-aware tool.

The contents of a metadata stream are in a specific Extensible Markup Language (XML) grammar known as the Extensible Metadata Platform (XMP), which has been standardized as ISO 16684-1.

XMP

In XMP, metadata consists of a set of properties. Properties are always associated with a particular entity (referred to as a resource). That is, the properties are “about” the resource. Any given property has a name and a value. Conceptually, each property makes a statement about a resource of the form, “The property_name of resource is property_value.” For example, “The author of Moby Dick is Herman Melville.” This statement is represented by metadata in which the resource is the book Moby Dick, the property name is author, and the property value is Herman Melville (see Example 12-2).

Example 12-2. Example XMP
<xmp:CreateDate>1851-08-18</xmp:CreateDate>
<xmp:CreatorTool>Ink and Paper</xmp:CreatorTool>
 <dc:creator>
    <rdf:Seq>
       <rdf:li>Herman Melville</rdf:li>
    </rdf:Seq>
 </dc:creator>
 <dc:title>
    <rdf:Alt>
       <rdf:li xml:lang="x-default">Moby Dick</rdf:li>
    </rdf:Alt>
 </dc:title>

All property, structure field, and qualifier names in XMP must be legal XML qualified names. That is, they must be well-formed XML names and in an XML namespace—this applies to top-level properties, struct fields, and qualifiers. This is a requirement inherited from RDF (Resource Definition Framework), the technology on which XMP is based.

Schemas

An XMP Schema is a set of top-level property names in a common XML namespace, along with their data types and descriptive information. Typically, an XMP Schema contains properties that are relevant for particular types of documents or for certain stages of a workflow. There exists a set of standard schemas, as well as a mechanism for how to define new schemas.

Note

The term “XMP Schema” is used here to clearly distinguish this concept from other uses of the term “schema,” and notably from the W3C XML Schema language. An XMP Schema is typically less formal and defined by documentation instead of a machine-readable schema file.

An XMP Schema is identified by its XML namespace URI. It also has an associated namespace prefix that can take any value, though there are common ones in use. This use of namespaces avoids conflict between properties in different schemas that have the same name but different meanings. For example, two independently designed schemas might have a creator property: in one, it might mean the person who created a resource; and in another, the application used to create the resource.

The term “top-level” distinguishes the root properties in an XMP Schema from the named fields of a structure within a property value. By convention, an XMP Schema defines its top-level properties, but the names of structure fields are part of the data type information.

The data types that can represent the values of XMP properties fall into three basic categories: simple types, structures, and arrays. Since XMP metadata is stored as XML, values of all types are written as Unicode strings. Example 12-3 shows a simple metadata stream.

Example 12-3. An example metadata stream
157 0 obj
<<
    /Length 4520
    /Subtype/XML
    /Type/Metadata
>>
stream
<?xpacket begin="Ôªø" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/"
 x:xmptk="Adobe XMP Core 5.1-c004 1.136136, 2010/05/14-18:06:40">
   <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
      <rdf:Description rdf:about=""
            xmlns:xmp="http://ns.adobe.com/xap/1.0/">
         <xmp:ModifyDate>2010-07-06T19:33:16-03:00</xmp:ModifyDate>
         <xmp:CreateDate>2010-07-06T19:33:03-03:00</xmp:CreateDate>
         <xmp:MetadataDate>2010-07-06T19:33:16-03:00</xmp:MetadataDate>
         <xmp:CreatorTool>Acrobat PDFMaker 10.0 for Word</xmp:CreatorTool>
      </rdf:Description>
      <rdf:Description rdf:about=""
xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/">
     <xmpMM:DocumentID>uuid:483d2fca-113d-4c81-b650-a39a67866aa6</xmpMM:DocumentID>
     <xmpMM:InstanceID>uuid:83008e27-bcc3-4480-a03a-13dc46d7f1f5</xmpMM:InstanceID>
         <xmpMM:subject>
            <rdf:Seq>
               <rdf:li>127</rdf:li>
            </rdf:Seq>
         </xmpMM:subject>
      </rdf:Description>
      <rdf:Description rdf:about=""
xmlns:dc="http://purl.org/dc/elements/1.1/">
         <dc:format>application/pdf</dc:format>
         <dc:title>
            <rdf:Alt>
               <rdf:li xml:lang="x-default">ISO TC 171/SC 2/WG5</rdf:li>
            </rdf:Alt>
         </dc:title>
         <dc:description>
            <rdf:Alt>
               <rdf:li xml:lang="x-default">ISO/WD 19005-2</rdf:li>
            </rdf:Alt>
         </dc:description>
         <dc:creator>
            <rdf:Seq>
               <rdf:li>Leonard Rosenthol</rdf:li>
            </rdf:Seq>
         </dc:creator>
      </rdf:Description>
      <rdf:Description rdf:about=""
      xmlns:pdf="http://ns.adobe.com/pdf/1.3/">
         <pdf:Producer>Adobe PDF Library 10.0</pdf:Producer>
      </rdf:Description>
      <rdf:Description rdf:about=""
 xmlns:pdfx="http://ns.adobe.com/pdfx/1.3/">
         <pdfx:SourceModified>D:20100706222950</pdfx:SourceModified>
         <pdfx:Company>AIIM</pdfx:Company>
         <pdfx:Manager>Betsy Fanning</pdfx:Manager>
      </rdf:Description>
      <rdf:Description rdf:about=""
xmlns:photoshop="http://ns.adobe.com/photoshop/1.0/">
         <photoshop:headline>
            <rdf:Seq>
               <rdf:li>ISO/WD 19005-2</rdf:li>
            </rdf:Seq>
         </photoshop:headline>
      </rdf:Description>
   </rdf:RDF>
</x:xmpmeta>
<?xpacket end="w"?>
endstream
endobj

In this example, you can see the various aspects of XMP that were mentioned previously: RDF, multiple namespaces (dc, xmp, pdf, and xmpMM), and both simple (xmp:CreateDate) and array types (dc:creator).

XMP in PDF

The primary metadata stream is for the document itself and is the value of the Metadata entry in the document catalog. In addition, any stream or dictionary object may have metadata attached to it via its Metadata entry. It is recommended that you place the Metadata entry on the dictionary or stream that represents the data itself (such as a font or image).

Along these lines, metadata may also be associated with marked content within a content stream. This association is created by including an entry in the property list dictionary whose key is Metadata and whose value is the metadata stream dictionary (see Example 12-4).

Example 12-4. Example catalog with metadata
485 0 obj
<<
    /Type/Catalog
    /Metadata 54 0 R
    /Pages 466 0 R
    /ViewerPreferences<</Direction/L2R>>
>>
endobj

54 0 obj
<<
    /Type/Metadata
    /Subtype/XML
    /Length 71746
>>
stream
<?xpacket begin="Ôªø" id="W5M0MpCehiHzreSzNTczkc9d"?>
    <x:xmpmeta xmlns:x="adobe:ns:meta/"
               x:xmptk="Adobe XMP Core 5.2-c001 63.139439, 2010/09/27-13:37:26        ">
        <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
        <rdf:Description rdf:about=""
               xmlns:xmp="http://ns.adobe.com/xap/1.0/">
        <xmp:CreateDate>2011-04-25T15:33:20Z</xmp:CreateDate>
        <xmp:CreatorTool>Microsoft PowerPoint</xmp:CreatorTool>
        <xmp:ModifyDate>2011-04-25T10:34:09-05:00</xmp:ModifyDate>
        <xmp:MetadataDate>2011-04-25T10:34:09-05:00</xmp:MetadataDate>
        </rdf:Description>
        <rdf:Description rdf:about=""
                         xmlns:pdf="http://ns.adobe.com/pdf/1.3/">
            <pdf:Keywords/>
            <pdf:Producer>Adobe Mac PDF Plug-in</pdf:Producer>
        </rdf:Description>
        <rdf:Description rdf:about=""
                         xmlns:dc="http://purl.org/dc/elements/1.1/">
            <dc:format>application/pdf</dc:format>
            <dc:creator>
                <rdf:Seq>
                    <rdf:li>Rick</rdf:li>
                </rdf:Seq>
            </dc:creator>
            <dc:title>
                <rdf:Alt>
                    <rdf:li xml:lang="x-default">Presentation2.pptx</rdf:li>
                </rdf:Alt>
            </dc:title>
        </rdf:Description>
        <rdf:Description rdf:about=""
                         xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/">
        <xmpMM:DocumentID>uuid:51b92418-85ef-f843-ba4d-9c6ea3482287</xmpMM:DocumentID>
        <xmpMM:InstanceID>uuid:5506218e-628b-8046-8af4-f2eb28096824</xmpMM:InstanceID>
        </rdf:Description>
        </rdf:RDF>
</x:xmpmeta>
<?xpacket end="w"?>
endstream
endobj

XMP versus the Info Dictionary

Although the XMP-based document-level metadata referenced from the document’s catalog dictionary is the canonical metadata in the document, it may be superseded by updated metadata in the document information dictionary. This is to handle the case where an older PDF processor updates only the information dictionary.

Both XMP and the information dictionary provide a date stamp. If the date stamp in the XMP is equal to or later than the modification date in the document information dictionary, the XMP metadata is taken as authoritative. If, however, the modification date in the document information dictionary is later than the XMP metadata’s date stamp, the information stored in the document information dictionary will override any semantically equivalent items in the XMP metadata.

When you are writing metadata, always use XMP.

What’s Next

In this chapter, you learned how to incorporate metadata into a PDF at the document as well as the object level. In the next and final chapter, we will look at how PDF became an international standard, both in its entirety as well as various subsets of its full capabilities.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.140.197.136