Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 9. Designing metadata

As you have no doubt noticed, more information is available than ever before—on the Web, your company intranet, in your content management repository, and elsewhere. This is both exciting and problematic, and extremely frustrating when you can’t find what you’re looking for.

What is missing is information about the information—that is, labeling, cataloging and descriptive information— that enables a computer to properly process and search the content elements. This information about information is known as metadata.

Although metadata has been a buzz word in the information technology and data warehousing business for some time, it has recently emerged as an important concept for those who are developing search and retrieval strategies for information in reference databases or on the Web, for authors of structured content, and for developers of enterprise content management and Web publishing solutions. With more complex authoring processes and information delivery requirements, you need some way of classifying and identifying all of the information or content “bits” so that they can be retrieved and combined in meaningful ways for users. Well-designed metadata can provide the classification and identification you need.

This chapter introduces the levels and types of metadata that will be appropriate to your unified content strategy. It also describes methods for defining metadata and suggests ways to ensure that your authors apply metadata consistently.

What is metadata?

Traditionally, metadata has been defined as “data about data.” Although this is true, metadata is actually much more. It is the encoded knowledge of your organization, described by David Marco as:

…all physical data (contained in software and other media) and knowledge (contained in employees and various media) from inside and outside an organization, including information about the physical data, technical and business processes, rules and constraints of the data, and structures of the data used by a corporation. ^[1]

This definition is significant because it includes the often-overlooked idea that metadata can be used to describe the data’s behavior, processes, rules, and structure. Describing information in this way is important when developing a sound metadata strategy for content search and retrieval, reuse, and dynamic content delivery, because you can determine not only what the content is, but who uses it, how it will be used, how it will be delivered, and when.

In a unified content strategy, metadata enables content to be retrieved, tracked, and assembled automatically. Metadata enables

Effective retrieval
Systematic reuse
Automatic routing based on workflow status
Tracking of status
Reporting

Properly defining and categorizing the types of metadata you want to use is extremely important to the success of your unified content strategy. Improperly identified metadata, or missed categories, can cause problems ranging from misfiled and therefore inaccessible content to more serious problems such as those encountered by the National Aeronautics and Space Administration’s (NASA’s) 1999 Mars Climate Orbiter mission, in which misidentified metadata resulted in the loss of the spacecraft, at a cost of $300 million! ^[2]

Benefits of metadata to a unified content strategy

Using metadata for retrieval and content management enables content to be retrieved, tracked, and assembled automatically, resulting in the following benefits:

Reduction of redundant content
If content is consistently labeled with metadata, authors can easily retrieve existing reusable content, and if multiple authors accidentally create the same piece of content, your content management system identifies that multiple versions of the same content exist. With systematic reuse, the system automatically populates a document with the appropriate reusable content. If the content is already in place when authors start to write, they are aware that they do not need to create it again.
Improved workflow
When you tag content with metadata that identifies its status, workflow automatically manages that content. For example, an element marked with “ready for review” can be compiled automatically into an information product such as a brochure, after which the brochure is automatically routed for review and approval.
Reduced costs
There are many ways in which metadata can help to reduce costs in a unified content strategy. For example, content is reusable only if it can be correctly identified with metadata and retrieved. If content exists and can be easily retrieved, the work required to create it again is eliminated. Metadata can also be used to automatically identify source elements that have changed. Triggering a translation process for the element saves the author or translator time and energy identifying the content to be translated. Additionally, if a reusable element is already translated, the metadata can facilitate the automatic population everywhere the source element is reused to ensure that the element is not translated again.

Types of metadata

Unified content requires two types of metadata: categorization and element. Users tend to retrieve information based on categorization metadata, whereas authors tend to retrieve information based on element metadata.

Categorization metadata

For years, libraries have used metadata to catalog and categorize documents. Originally they used card catalogs to provide information about the books stored in the library. The card catalog provided such information as title, author, publication date, subject, and a brief description (abstract) of what was contained within the document. In today’s world, the items on the card would be referred to as metadata. Without the card catalog and the Dewey Decimal system it would have been impossible to find content in a library. Without metadata it is nearly impossible to find content online (for example on the Internet, a company intranet, or a content management systems).

The increasing use of portals has encouraged organizations to make the portal the central location for access to organizational content. However, as each new piece of content is added, users’ ability to find content decreases. Corporate information needs to be just as accessible as library content, which means organizing content in a logical structure, categorizing it, and using the categories to add metadata to the information. Metadata is like the old card catalog, presenting information to users in context, and enabling them to quickly find relevant information. Metadata hierarchies or metadata taxonomies are used to organize the content.

Both metadata hierarchies and metadata taxonomies are similar in appearance: They are represented as tree structures, but are different in design and usage. A hierarchy provides the content user with an understanding of how content is organized. Content may be organized under multiple categories to provide the content user with multiple ways to find the information. For example, in an index (which is a hierarchy), information on Boston may be found under Cities or under Massachusetts. This provides the content user with multiple ways to find the information.

However, in a taxonomy, content may be categorized in only one place, not in multiple places. In the case of Boston, the taxonomy designer decided it can only be classified under Massachusetts.

Content users use hierarchies to retrieve content because hierarchies give them multiple “paths” to the same information, but taxonomies are used by authors to ensure that content is categorized in only one way, not in multiple ways. Categorizing content in multiple ways makes it difficult to retrieve.

Categorizing content can be a difficult and time-consuming process. Frequently it involves a manual approach, with people finding, reviewing, and categorizing content. Often it is the job of a corporate librarian or information architect to manually identify and tag content appropriately. Corporate content can be any content the corporation creates, receives, or wants to make available to its employees, customers, or suppliers. This body of content is much broader than the content we refer to in this book; it encompasses email, reports, correspondence, strategic analysis, and much more. The volume of this content grows at a tremendous pace, making it difficult for organizations to maintain if they do it all manually. Vendors are starting to provide tools that can assist your organization in categorizing your content and automatically adding metadata (see the sidebar “Categorization metadata standards.”); however, these tools will not assist you in the task of creating element metadata.

Some industries have created industry-specific taxonomies, sometimes known as vertical taxonomies. Vertical taxonomies have been developed to help save organizations from the task of having to create everything themselves (thereby creating inconsistent taxonomies from company to company), and to facilitate the sharing of content. For example, bookstores use a taxonomy that helps them shelve books so that readers can more easily locate their desired subject matter: nonfiction, reference, travel reference, European travel and so on. Vertical taxonomies have been created for such areas as IT, healthcare, telecommunications, HR, financial, legal, e-learning, sales and marketing, and geography, and more are being created daily.

In the absence of a vertical taxonomy, industries are creating standards for the format, structure, and syntax of metadata to enable different organizations and even different departments within an organization to share metadata. For more information on metadata standards see the sidebar “Categorization metadata standards.”

If you have a lot of content to categorize, check to see whether a vertical taxonomy already exists for your industry, and check with vendors to see whether they can support your information set. Categorization metadata is a large, sometimes costly, and intensive ongoing task. If you don’t have to do this task on your own, don’t try to. If you do decide to tackle the job, consider including corporate librarians or information architects on the team.

To begin the process of creating categorization metadata you need to understand your users. Understanding your users helps you to define the ways in which they will retrieve information. Ask the following questions:

Who is going to retrieve the content?
For example, customers will want to retrieve product information, marketing will want to retrieve reports, product information, and industry materials, and employees will want to retrieve policies and procedures.
What tasks are they trying to accomplish with the content?
For example, are they trying to complete a task or make a decision? You need to categorize content for tasks in areas such as procedures, while policies may be categorized separately.
What terms will they use when retrieving the content?
Anticipating the terms that people will use is always difficult. Everyone uses different terms and thinks about information in different ways. You will never be able to ensure that content will be accessible under everyone’s terms, but understanding how people will refer to content helps you to determine your taxonomy. After you develop a taxonomy you can educate your users to use the available terms.

Now you need to categorize your content and create a taxonomy. This involves:

Grouping or clustering related content
As you start to categorize your content, you need to start grouping like or similar content together. These groups create categories, which are then refined by individual items in the category. For example:
- Company benefits
  - Benefit policies
  - Benefit forms
  - Benefit frequently asked questions

Which can be simplified to:
- Company benefits
  - Policies
  - Forms
  - Frequently asked questions

Developing your taxonomy
As you group content, categorize it, and define the terms to be used to identify your content, you are automatically creating your taxonomy. Each term in your taxonomy becomes metadata.
Testing your taxonomy to ensure that it is appropriate and comprehensive.
You need to ensure that the metadata you have created is appropriate and usable by your audience. Before even using the metadata electronically, categorize some sample content and ask users to perform a usability test to ensure your taxonomy is appropriate.

Categorization metadata standards

One of the most valuable components of a unified content strategy is the ability to reuse content. As long as you are using one system, sharing content among users of the system is relatively easy. However, many organizations have multiple systems because they have existing legacy systems, or because one system is unable to meet all the organization’s needs. Sharing content across multiple systems can be more problematic. The effective use of metadata requires common conventions for defining the semantics (meaning) of metadata. Typically, each system carries its own metadata with its own semantics and its own structure, and there are no matches or very few matches across the systems. In addition, if content is stored in multiple locations that use multiple metadata structures, the task of categorization metadata is even more difficult. Information retrieval becomes very complex because users need to learn multiple methods for retrieving content. To address this problem, standards for the structure and semantics of metadata and for sharing metadata are being created.

Dublin Core

The Dublin Core Metadata Initiative is an organization promoting the widespread adoption of interoperable metadata standards. The Dublin Core Metadata Element Set defines 15 elements of semantic metadata (Contributor, Coverage, Creator, Date, Description, Format, Identifier, Language, Publisher, Relation, Rights, Source, Subject, Title, and Type).

Dublin Core metadata has been designed to make it simple to understand and use. Dublin Core can be used to create metadata that can be used across a number of knowledge domains, including corporate content. Dublin Core is also extensible to allow for site-specific or application-specific metadata, which means that you can customize it. Dublin Core is now RDF-compliant (see the following).

RDF

The Resource Description Framework (RDF) was developed by the World Wide Web Consortium (W3C). Unlike Dublin Core, RDF is a framework for describing and interchanging metadata; it does not actually define metadata. RDF is an application of XML that imposes a specific structure to ensure consistent encoding and exchange of standardized metadata. By using XML, RDF imposes a structure that explicitly defines semantics, ensuring consistent encoding, exchange, and machine-readable processing of standardized metadata.

RDF provides a model for describing resources (content). Resources have properties (attributes or characteristics). A resource is an object that can be uniquely defined by a Uniform Resource Identifier (URI). RDF allows descriptions of web resources (any object with a URL as its address) to be made available in machine-readable form.

RDF helps to solve the problem industries have faced in exchanging metadata and its associated content among different systems. It does not define the metadata; rather, it enables organizations to define the metadata they need for their applications, yet still share that metadata with other RDF-compliant metadata applications.

Other standards such as Dublin Core and XMP have adopted RDF as the underlying structure for their standard.

XMP

eXtensible Metadata Platform (XMP) is a metadata framework (way of labeling content) created by Adobe. XMP provides a method for combining metadata from “documents” and all its associated elements. The metadata for each element is preserved within the container document.

XMP can also be used to facilitate workflow. The labels of different types of content (for example, photograph versus text) can assist in appropriately directing content through workflow or to databases. Developers of workflow tools, particularly those designed to support the publishing process, are seriously looking at XMP as a potential metadata standard in their products.

XMP has been built on existing RDF standards. Adobe has made XMP public and extensible and has distributed it to developers of content creation applications, content management systems, database publishing systems, web-integrated production systems, and document repositories.

Crosswalks

The best way to share content is through a consistent content structure and through consistent metadata. RDF enables organizations to exchange metadata; however, the success of RDF requires that all the developers of the metadata for the multiple systems use RDF. If the systems pre-date the creation of RDF, or if RDF was not used, you may need to map one metadata schema to another metadata schema. The mapping of metadata is achieved through the use of “crosswalks,” which are essentially tables that map one metadata value to another.

For example, say you have a knowledge management (KM) system in your company, a customer relationship management (CRM) system, and now a content management system. A simple example of the differences in the three systems is in how a “document” is named. The CRM system refers to the name of a document such as a brochure with the metadata tag “Title.” The KM system calls it “Subject,” and the CMS uses a combination of “Information product” and “Title.” The CMS introduces further complexity because it uses “Title” in multiple ways, because a title can exist at many levels in an XML document (for example, document title, section title, and subsection title are all considered titles, but are clarified by their location in the document hierarchy). Table 9.1 illustrates a sample crosswalk:

Table 9.1. Sample crosswalk

CRM	KM	CM
Title	Subject	Information Product Title ^[3]
^[3]Where the title is drawn from the highest level of title in the document hierarchy.

However, mapping metadata from one system to another may not be so straightforward. Some common problems include

Two or more elements in one system may be represented by a single element in another system.
There may be no comparable element in the other system.

These and other problems may make it difficult to retrieve and share content. If you do plan to set up a crosswalk, be sure to clearly identify the rules of conversion and track the decisions that were made.

You can also consider using crosswalks when your metadata terms change. For example, a product may have had a specific name for a year, but is renamed in a new product offering. You don’t want to have to try to retrieve both the old product name (particularly when new people start and don’t even know the old product name) and the new product name. Instead, you can use a crosswalk to map the old product name to the new product name.

Tools

There are a number of tools that provide automated categorization of content and application of metadata. These are some of the more common:

Applied Semantics Auto-Categorizer (http://www.appliedsemantics.com)
Quiver QKS Classifier (http://www.quiver.com)
Semio, Semio Tagger (http://www.semio.com)

Element metadata

Element metadata identifies your content at the element level, based on the elements defined in your information model (see Chapter 8, “Information modeling”). Authors use element metadata to help them manage content throughout the authoring process. There are three main types of element metadata:

Reuse metadata
Retrieval metadata
Tracking metadata

This section explains these three types of metadata.

Metadata for reuse

Metadata for reuse identifies the components of content that can be reused in multiple areas. For example, if an overview already exists for the ABC product, you can use metadata like “content type = overview, product = ABC” to help you find the correct content to reuse.

Before even beginning to write, authors can search the content management system by metadata for reusable content. Alternatively, the content management system can automatically search for appropriate reusable content (based on models and metadata) and deliver it (systematic reuse) to authors. In both cases, metadata is very important to correctly identify the elements of content.

To determine what metadata you need to enable reuse, you need to determine the business result you are trying to achieve and build your metadata backward to achieve that result. Think about the following:

Where is content going to be reused?
Across product? Across information product? If you answered yes to any of these then you need to create metadata to identify each reuse. For example:
- Product, such as:
  - ABC
  - EFG
  - HIJ
- Information product, such as:
  - Brochure
  - Web
  - Help
  - User guide

Note that metadata such as information product can be derived from the template type.

What type of content is it?
You also need to know the element content type for which the content is valid. Your metadata might include
- Content type, for example:
  - Overview
  - Caution
  - Warning
  - Troubleshooting
  - Example

Note that metadata such as content type can be derived from your model or semantic tags.

What else do you need to know about the content to ensure that the correct piece of content is reused?
For example, you might also need to know to which version of the product the content applies:
- Version, for example:
  - 1
  - 2
  - 2.5

Furthermore, you may need to know the region or location where the product is being sold or used, so that you can identify content such as safety regulations, language, and configuration. In this case, your metadata might include
- Region, for example:
  - United States
  - Canada
  - South America
  - Europe
  - Language, for example:
  - English
  - Spanish
  - French
  - Italian
  - German
Finally, you may need to know the audience so that appropriate content is provided for each audience.
- Audience, for example:
  - Consumer
  - Decision maker
  - Technical support

Metadata for retrieval

Metadata for retrieval is used to help authors retrieve content and may include much or all of the metadata used for reuse. However, metadata for retrieval is more extensive then metadata for reuse, providing additional information about an element that facilitates retrieval. Think about what other information would help you retrieve content more effectively. For example, your retrieval metadata might include

Title/Subject
This type of metadata can be entered by the author, or the system can use the title that appears in the content to create this metadata.
Author
The system usually automatically generates this type of metadata, based on the author information.
Date (creation, completion, modification)
The system usually automatically generates this metadata as it is checked into the content management system.
Keywords
This metadata can be entered by the author; however, it is preferable to provide the author with a list of keywords from which to choose. This way keywords will be used consistently (see the section “Creating a controlled vocabulary,” later in this chapter).
Security level (who can view the content)
This type of metadata is usually applied by the author from a selected list of options.

As you do with metadata for reuse, you identify your metadata for retrieval by determining the business result you are trying to achieve and building your metadata backward to achieve that result. Think about the following issues:

Who is going to retrieve your content?
You will probably have two levels of users interested in retrieving your content: authors and users. Understanding your users’ information requirements helps you to determine what kind of metadata they will use for retrieval. Authors will want to retrieve content at many levels of granularity (individual elements, sections, and whole information products). They will need metadata that enables them to identify the desired content at any level of granularity. Users probably don’t want to retrieve granular content; instead, they want to retrieve whole “documents” and will need categorization metadata such as date, author, and subject, in addition to retrieval metadata.
In what form do they want to retrieve it?
Authors probably want to retrieve content in a format that is appropriate for a particular authoring tool, whereas users want to retrieve content in the form in which the content was designed to be displayed (for example, PDF, HTML, or Windows Help). For authors, the metadata needs to define the source format and the desired format (for example, XML may be the source format, but Word is the desired format) so that the system can convert content appropriately. For users, the metadata needs to define the appropriate format for the content (for example, PDF or HTML) so the system can either retrieve the content in this format or, if it needs to be converted, convert to the appropriate format.
What permissions should users have for retrieving content?
Typically the permission to create, edit, or modify content is restricted. Authors and users may have restrictions on what content they are allowed to see or even to know exists. Each element, container, and information product needs to have appropriate security permissions expressed through metadata.
How are they going to specifically identify the desired content?
People articulate their desire for specific content in different ways, using different terms. You need to analyze the terms your authors and users will use, then determine what metadata the content should carry to enable a match between the search and the content. You may want to consider adding keywords to metadata to facilitate this retrieval; however, to ensure consistency, consider using a controlled list of keywords rather than author-created keywords.

Metadata for tracking (status)

Metadata for tracking is particularly useful when you are implementing workflow as part of your unified content strategy. By assigning status metadata to each content element, you can determine which elements are active. You can also control what can to be done to an element and who can do it. Generally, status changes based on the metadata are controlled through workflow automation, not by end users. Sometimes an author will identify a status change such as “ready for review” because the system cannot automate this type of information. Status metadata can include such tracking items as:

Draft (under development by the author)
Ready for review
Reviewed
Approved
Final
Submitted
Published
Archived

Again, like your other metadata, you identify tracking metadata by determining the business result you are trying to achieve, and then build your metadata backward to achieve that result. Design your metadata for tracking after you have designed your workflow (see Chapter 11, “Designing workflow”). This enables you to identify what metadata needs to be applied to the content at each stage of workflow to enable the workflow system to manage it. For example, the metadata for the Review and Approval Workflow shown in Figure 11.3 could look like the following:

Content status
Indicates status of the content. Before it can be reviewed it must have the appropriate metadata attached to identify that it is ready for review. For example, the metadata could include:
- Draft
- Ready for review
- In review
- Final
- In approval
- Approved

When the content is ready for review, authors apply the “Ready for review” metadata. When the content includes the feedback from review and is ready for final approval, authors apply the “final” metadata. When the final approval reviewers approve the content, they apply the “Approved” metadata.
The system needs to identify the status of the content at any point in time. When the content has been passed to review, its status is automatically changed to “In review” and later, when it is passed onto final approval, the status is changed to “In approval.”
Review status
Indicates the status of the review content. A reviewer can either accept the content without changes or reject the content by asking for changes and returning it to the author. For example, the review status metadata could include:
- Accept
- Reject

If the metadata is “Accept,” the system moves the content to the final approval stage, but if the metadata is “Reject,” the content is routed back to the author for changes.

After you have designed your metadata to support your workflow, you need to identify other metadata that can help you to track your content. For example:

Who created the content (author)?
When was it created/modified (date)?
Who modified the content (editor)?
Who reviewed/approved the content (reviewer/approver)?
How long did it take to create/modify/review (time)?
Where has it been reused (information product, product)?
Has it been translated (content status)?

Most content management systems automatically create some of this metadata (for example, author, date), whereas other metadata may already be defined in retrieval metadata and reuse metadata, but you should go through this exercise to make sure that you have identified all the possible metadata you require for tracking and reports.

Creating a controlled vocabulary

Metadata needs to be consistent to facilitate reuse, retrieval, and tracking. This requires a controlled vocabulary. A controlled vocabulary reconciles all the various possible words that can be used to identify content and to differentiate among all the possible meanings that can be attached to certain words. Using an unlimited or uncontrolled set of metadata terms leads to additional work for authors (they have to figure out the metadata each time they apply it) and reduces the percentage of content that can be effectively retrieved (different terms means either using multiple terms to search or missing some content because retrievers are unaware of alternate terms). If authors can create their own metadata tags, there is a high probability they will create different metadata.

To create a controlled vocabulary:

Identify your metadata categories (for example, Content type, Product).
Identify the terms that make up that metadata category.
For example:
- Content status
- Draft
- Ready for review
- In review
- Final
- In approval
- Approved

In this example, “content status” is the metadata category and the controlled terms are “Draft,” “Ready for review,” and so on.

Uncontrolled metadata terms should be the exception to the rule. If possible, do not provide any metadata that can be defined by the author. If that is not possible, monitor the uncontrolled metadata terms to see whether patterns are emerging that could then be used to create a controlled vocabulary.

Ensuring metadata gets used

Metadata can be very valuable and useful; however, it is only valuable if it gets used. Wherever possible, automate the application of metadata. Leaving the application of metadata up to authors adds yet another burden to the authoring process and leads to inconsistency. Some authors diligently apply the correct metadata, some apply some of it correctly and some of it incorrectly, and some don’t apply it at all. Unless it is applied appropriately all the time, your metadata could become useless.

Wherever possible, have the system apply the metadata. This can include automatic:

Categorization metadata based on the content
Metadata based on the template and model
Inheritance of metadata based on the parent (for example, if a container element is given a restricted security, all the elements within the container automatically have the same security metadata applied to them)
Metadata based on position in the workflow

If it is necessary for authors to add metadata, make it possible for them to add the metadata as they are authoring so that they don’t have to wait until the content is checked into the content management system. For example, if a step varies based on role, let them add the role metadata as soon as they finish writing the step. If it has to wait until the content is checked into the content management system, the system either has to prompt them to add the metadata for every single element (a very tedious process), or it may be up to the author to remember to add the metadata in all the relevant places (a recipe for missed metadata).

Summary

Metadata is critical to the success of your unified content strategy. It is more than just data about data; it is the encoded knowledge of your organization. Metadata can be used to describe the behavior, processes, rules, and structure or data, as well as to add descriptive information.

There are two types of metadata:

Categorization metadata
Categorization metadata categorizes your documents. Categorization metadata is usually used by content users to retrieve content.
Element metadata
Element metadata identifies your content at the element level. Element metadata is used by authors to retrieve elements of content. There are three kinds of element metadata:
- Metadata for reuse is used to identify the components of content that can be reused in multiple areas.
- Metadata for retrieval is used to retrieve content. It may consist of metadata for reuse as well as additional retrieval metadata.
- Metadata for tracking (status) is used to identify the status of your content in a workflow system.

To define your metadata, start by identifying the business result you want to achieve with your metadata and work backward to identify what metadata will achieve that result.

Use a controlled vocabulary for your metadata to ensure that metadata is named and applied consistently.

If you need to share metadata across systems consider using RDF (a W3C framework for describing interchangeable metadata) to design your metadata. If RDF has not been used to create your metadata, consider using a crosswalk (a table to map metadata from one structure to another) to provide a metadata interchange.

Automate as much of the application of metadata as possible to ensure that metadata gets used and to enable authors to add metadata in the authoring tool rather than as they check the content back into the content management system.

^[1]Marco, David. Building and Managing the Meta Data Repository: A Full Lifecycle Guide. New York:, John Wiley & Sons, Inc., 2000, p.5.

^[2]Marco, David. Building and Managing the Meta Data Repository: A Full Lifecycle Guide. New York: John Wiley & Sons, Inc., 2000.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 9. Designing metadata

Create new playlist

Sign In

Sign Up