By Charles M. Dollar and Lori J. Ashley
Every organization—public, private, or not for profit—now has electronic records and digital content that it wants to access and retain for periods in excess of 10 years. This may be due to regulatory or legal reasons, a desire to preserve organizational memory and history, or entirely by operational reasons. But long-term continuity of digital information does not happen by accident—it takes information governance (IG), planning, sustainable resources, and a keen awareness of the information technology (IT) and file formats in use by the organization, as well as evolving standards and computing trends.
Information is universally recognized as a key asset that is essential to organizational success. Digital information, which relies on complex computing platforms and networks, is created, received, and used daily to deliver services to citizens, consumers and customers, businesses, and government agencies. Organizations face tremendous challenges in the 21st century to manage, preserve, and provide access to electronic records for as long as they are needed.
Digital preservation is defined as long-term, error-free storage of digital information, with means for retrieval and interpretation, for the entire time span the information is required to be retained. Digital preservation applies to content that is born digital as well as content that is converted to digital form.
Some digital information assets must be preserved permanently as part of an organization's documentary heritage. Dedicated repositories for historical and cultural memory, such as libraries, archives, and museums, need to move forward to put in place trustworthy digital repositories that can match the security, environmental controls, and wealth of descriptive metadata that these institutions have created for analog assets (such as books and paper records). Digital challenges associated with records management affect all sectors of society—academic, government, private and not-for-profit enterprises—and ultimately all citizens of all developed nations.
Digital preservation is defined as long-term, error-free storage of digital information, with means for retrieval and interpretation, for the entire time span that the information is required to be retained.
The term “preservation” implies permanence, but it has been found that electronic records, data, and information that is retained for only 5 to 10 years is likely to face challenges related to storage media failure and computer hardware/software obsolescence. A useful point of reference for the definition of “long term” comes from the International Organization for Standardization (ISO) standard 14721, which defines long-term as “long enough to be concerned with the impacts of changing technologies, including support for new media and data formats, or with a changing user community. Long Term may extend indefinitely.”1
Long-term records are common in many different sectors, including government, health care, energy, utilities, engineering and architecture, construction, and manufacturing. During the course of routine business, thousands or millions of electronic records are generated in a wide variety of information systems. Most records are useful for only a short period of time (up to seven years), but some may need to be retained for long periods or permanently. For those records, organizations must plan for and allocate resources for preservation efforts to ensure that the data remains accessible, usable, understandable, and trustworthy over time.
In addition, there may be the requirement to retain the metadata associated with records even longer than the records themselves.2 A record may have been destroyed according to its scheduled disposition at the end of its life cycle, but the organization still may need its metadata to identify the record, its life cycle dates, and the authority or person who authorized its destruction.
Some electronic records must be preserved, protected, and monitored over long periods of time to ensure they remain authentic, complete, and unaltered and available into the future. Planning for the proper care of these records is a component of an overall records management program and should be integrated into the organization's information governance (IG) policies and technology portfolio as well as its privacy and security protocols.
Total capability for properly ensuring access to authentic electronic records over time, (in addition to the challenges of technological obsolescence), is a sophisticated combination of policies, strategies, processes, specialized resources, and adoption of standards.
Most records are useful for only a short period of time, but some may need to be retained for long periods or permanently.
Enterprise strategies for sustainable and trustworthy digital preservation repositories have to take into account several prevailing and compound conditions: the complexity of electronic records, decentralization of the computing environment, obsolescence and aging of storage media, massive volumes of electronic records, and software and hardware dependencies.
The challenges of managing electronic records significantly increased with the trend of decentralization of the computing environment. In the centralized environment of a mainframe computer, prevalent from the 1960s to 1980s but also in use today, it is relatively easy to identify, assess, and manage electronic records. This is not the case in the decentralized environment of specialized business applications and office automation systems, where each user creates electronic objects that may constitute a formal record and thus will have to be preserved under IG polices that address record retention and disposition rules, processes, and accountability.
Electronic records have evolved from simple text-based word processing files or reports to include complex mixed media digital objects that may contain embedded images (still and animated), drawings, sounds, hyperlinks, or spreadsheets with computational formulas. Some portions of electronic records, such as the content of dynamic Web pages, are created on demand from databases and exist only for the duration of the viewing session. Other digital objects, such as electronic mail, may contain multiple attachments, and they may be threaded (i.e., related e-mail messages linked in send-reply chains). These records cannot be converted to paper or text formats for preservation without the loss of context, functionality, and metadata.
Electronic records are being created at rates that pose significant threats to our ability to organize, control, and make them accessible for as long as they are needed. This continued volume increase includes documents that are digitally scanned or imaged from a variety of formats to be stored as electronic records.
Electronic records are stored as representations of bits—1s and 0s—and therefore depend on software applications and hardware networks for the entire period of retention, whether it is 3 days, 3 years, or 30 years or longer. As information technologies become obsolete and are replaced by new generations, the capability of a specific software application to read the representations of 1s and 0s and render them into human-understandable form will degrade to the point that the records are neither readable nor understandable. As a practical matter, this means that the readability and understandability of the records can never be recovered, and there can be serious legal consequences.
Electronic records are being created at rates that pose significant threats to our ability to organize, control, and make them accessible for as long as they are needed.
Storage media are affected by the dual problems of obsolescence and decay. They are fragile, have limited shelf life, and become obsolete in a matter of a few years. Mitigating media obsolescence is critical to long-term digital preservation (LTDP) because the bitstreams of 1s and 0s that comprise electronic records must be kept “alive” through periodic transfer to new storage media.
In addition to these current conditions associated with technology and records management, organizations face tremendous internal change management challenges with regard to reallocation of resources, business process improvements, collaboration and coordination between business areas, accountability, and the dynamic integration of evolving recordkeeping requirements. Building and sustaining the capability to manage digital information over long periods of time is a shared responsibility of all stakeholders.
A number of known threats may degrade or destroy electronic records and data:
Threats to LTDP of records can be internal or external, from natural disasters, computer or storage failures, and even from the financial viability of an organization.
The impact on the preserved records can be gauged by determining what percentage of the data has been lost and cannot be recovered or, for the data that can be recovered, what the impact or delay to users may be.
It should be noted that threats can be interrelated and more than one type of threat may impact records at a time. For instance, in the event of a natural disaster, operators are more likely to make mistakes, and computer hardware failures can create new software failures.
The digital preservation community recognizes that open standard technology-neutral standards play a key role in ensuring that digital records are usable, understandable, and reliable for as far into the future as may be required.
There are two broad categories of digital preservation standards. The first category involves systems infrastructure capabilities and services that support a trustworthy repository. The second category relates to open standard technology-neutral file formats.
Digital preservation infrastructure capabilities and services that support trustworthy digital repositories include the international standard ISO 14721:2003, 2012 Space Data and Information Transfer Systems—Open Archival Information System (OAIS)—Reference Model, which is a key standard applicable to LTDP.4
The fragility of digital storage media in concert with ongoing and sometimes rapid changes in computer software and hardware poses a fundamental challenge to ensuring access to trustworthy and reliable digital content over time. Eventually, every digital repository committed to LTDP must have a strategy to mitigate computer technology obsolescence. Toward this end, the Consultative Committee for Space Data Systems developed an Open Archival Information System (OAIS) reference model to support formal standards for the long-term preservation of space science data and information assets. OAIS was not designed as an implementation model.
The OAIS Reference Model defines an archival information system as an archive, consisting of an organization of people and systems that has accepted the responsibility to preserve information and make it available and understandable for a designated community (i.e., potential users or consumers), who should be able to understand the information. Thus, the context of an OAIS-compliant digital repository includes producers who originate the information to be preserved in the repository, consumers who retrieve the information, and a management/organization that hosts and administers the digital assets being preserved.
OAIS encapsulates digital objects into information packages. Each information package includes the digital object content (a sequence of bits) and representation information that enables rendering of an object into human usable information along with preservation description information (PDI) such as provenance, context, and fixity.
The OAIS Information Model employs three types of information packages: a submission information package (SIP), an archival information package (AIP), and a dissemination information package (DIP). An OAIS-compliant digital repository preserves AIPs and any PDI associated with them. A SIP encompasses digital content that a producer has organized for submission to the OAIS. After the completion of quality assurance and transformation procedures, an AIP is created, which is the focus of preservation activity. Subsequently, a DIP is created that consists of an AIP or information extracted from an AIP customized to the requirements of the designated community of users and consumers.
The core of OAIS is a functional model that consists of six entities:
Figure 17.1 displays the relationships between these six functional entities.5
In archival storage, the OAIS reference model articulates a migration strategy based on four primary types of AIP migration that are ordered by an increasing risk of potential information loss: refreshment, replication, repackage, and transformation.6
OAIS is the lingua franca of digital preservation. The international digital preservation community has embraced it as the framework for viable and technologically sustainable digital preservation repositories. An LTDP strategy that is OAIS-conforming offers the best means available today for preserving the digital heritage of all organizations, private and public.
ISO 18492 provides practical methodological guidance for the long-term preservation and retrieval of authentic electronic document-based information, when the retention period exceeds the expected life of the technology (hardware and software) used to create and maintain the information assets. It emphasizes both the role of open standard technology-neutral formats in supporting long-term access and the engagement of IT specialists, document managers, records managers, and archivists in a collaborative environment to promote and sustain a viable digital preservation program.
ISO 18492 takes note of the role of ISO 15489 but does not cover processes for the capture, classification, and disposition of authentic electronic document-based information. Ensuring the usability and trustworthiness of electronic document-based information for as long as necessary in the face of limited media durability and technology obsolescence requires a robust and comprehensive digital preservation strategy. ISO 18492 describes such a strategy, which includes media renewal, software dependence, migration, open standard technology-neutral formats, authenticity protection, and security:
ISO 18492 provides practical methodological guidance for the long-term preservation of e-documents when the retention period exceeds the expected life of the technology that created it.
ISO 14721 (OAIS) acknowledged that an audit and certification standard was needed that incorporated the functional specifications for records producers, records users, ingest of digital content into a trusted repository, archival storage of this content, and digital preserving planning and administration. ISO 16363 is this audit and certification standard. Its use enables independent audits and certification of trustworthy digital repositories and thereby promotes public trust in digital repositories that claim they are trustworthy. To date only a handful of ISO 16363 test audits have been undertaken; additional time is required to determine how widely adopted the standard becomes.
ISO 16363 is organized into three broad categories: organization infrastructure, digital object management, and technical infrastructure and security risk management. Each category is decomposed into a series of primary elements or components, some of which may be more appropriate for digital libraries than for public records digital repositories. In some instances there are secondary elements or components. An explanatory discussion of each element accompanies “empirical metrics” relevant to that element. The “empirical metrics” typically include high-level examples of how conformance can be demonstrated. Hence, they are subjective high-level conformance metrics rather than explicit performance metrics.
Organizational infrastructure7 consists of these primary elements:
ISO 16363 is an audit and certification standard organized into three broad categories: organization infrastructure, digital object management, and technical infrastructure and security risk management.
Digital object management,8 which is the core of the standard, comprises these primary elements:
Technical infrastructure and security risk management primary elements9 include these:
ISO 16363 represents the gold standard of audit and certification for trustworthy digital repositories. In some instances the resources available to a trusted repository may not support full implementation of the audit and certification specifications. Decisions about where full and partial implementation is appropriate should be based on a risk assessment analysis.
ISO 16363 represents the gold standard of audit and certification for trustworthy digital repositories.
ISO 14721 specifies that preservation metadata associated with all archival storage activities (e.g., generation of hash digests, transformation, and media renewal) should be captured and stored in PDI. This high-level guidance requirement demands greater specificity in an operational environment.
Toward this end, the U.S. Library of Congress and the Research Library Group supported a new international working group called PREservation Metadata Information Strategies (PREMIS)10 to define a core set of preservation metadata elements with a supporting data dictionary that would be applicable to a broad range of digital preservation activities and to identify and evaluate alternative strategies for encoding, managing, and exchanging preservation metadata. Version 2.2 was released in June 2012.11
PREMIS enables designers and managers of digital repositories to have a clear understanding of the information required to support the “functions of viability, renderability, understandability, authenticity, and identity in a preservation context.” PREMIS accomplishes this through a data model that consists of five “semantic units” (think of them as high-level metadata elements, each of which is decomposed into sub-elements) and a data dictionary that decomposes these “semantic units” into a structure hierarchy. The five semantic units and their relationships are displayed in Figure 17.2.
Note the arrows that define relationships between these entities:
The PREMIS standard defines a core set of preservation metadata elements with a supporting data dictionary applicable to a broad range of digital preservation activities.
The PREMIS Data Dictionary decomposes objects, events, agents, and rights into a structured hierarchical schema. In addition, it contains semantic units that support documentation of relationships between Objects. An important feature of the PREMIS is an XML schema for the PREMIS Data Dictionary. The primary rationale for the XML schema is to support the exchange of metadata information, which is crucial in ingest and archival storage. The XML schema enables automated extraction of preservation related metadata in SIPs and population of this preservation metadata into AIPs. In addition, the XML schema can enable automatic capture of preservation events that are foundational for maintaining a chain of custody in archival storage.
A digital file format specifies the internal logical structure of digital objects (i.e., binary bits of 1s and 0s) and signal encoding (e.g., text, image, sound, etc.). File formats are crucial to long-term preservation because a computer can open, process, and render file formats that it recognizes. Many file formats are proprietary (also known as native), meaning that digital content can be opened and rendered only by the software application used to create, use, and store it. However, as IT changed, some software vendors introduced new products that no longer support earlier versions of a file format. In such instances these formats become “legacy” format, and digital content embedded in them can be opened only with computer code written expressly for this purpose. Other vendors, such as Microsoft, support backward compatibility across multiple generations of technology so Microsoft Word 2010 can open and render documents in Microsoft Word 95. Nonetheless, it is unrealistic to expect any software vendor to support backward compatibility for its proprietary file formats for digital content that will be preserved for multiple decades.
Many digital file formats are proprietary, meaning that content can be viewed and controlled only by the software application used to create, use, and store it.
In the late 1980s, an alternative to vendor-supported backward compatibility emerged to mitigate dependence on proprietary file formats through open system interoperable file formats. Essentially, this meant that digital content could be exported from one proprietary file format and imported to one or more other proprietary file formats. Over time, interoperable file formats evolved into open standard technology-neutral formats that today have these characteristics:
Because even open standard technology-neutral formats are not immune to technology obsolescence, their selection must take into account their technical sustainability and implementation in digital repositories. The PRONON program of the National Archives of the United Kingdom and long-term sustainability of file formats of the U.S. Library of Congress assess the sustainability of open standard technology-neutral formats.
The recommended open standard technology-neutral formats for nine content types listed in Table 17.1 are based on this ongoing work, along with preferred file formats supported by Library and Archives Canada and other national archives. Unlike PDF/A, several of these file formats (e.g., XML, JPEG 2000, and Scalable Vector Graphics [SVG]) were not explicitly designed for digital preservation. It cannot be emphasized too strongly that this list of recommended open standard technology-neutral formats (or any other comparable list) is not static and will change over time as technology changes.
The PDF/A file format was designed specifically for digital preservation.
PDF/A is an open standard technology-neutral format that enables the accurate representation of the visual appearance of digital content without regard for the proprietary format or application in which it was created or used. PDF/A is widely used in digital repositories as a preservation format for static textual and image content. Note that PDF/A is agnostic with regard to digital imaging processes or storage media. PDFA/A supports conversion of TIFF and PNG images to PDF/A. There are two levels of conformance to PDF/A specifications. PDF/A-1a references the use of a “well-formed” hierarchical structure with XML tags that enable searching for a specific tag in a very large digital document. PDF/A-1b does not require this conformance, and as a practical matter, it does not affect the accurate representation of visual appearance.
Since its publication in 2005, there have been two revisions of PDF/A. The first revision, PDF/A-2, was aligned with the Adobe Portable Document Format 1.7 published specifications, which Adobe released to the public domain in 2011. The second revision, PDF/A-3, supports embedding documents in other formats, such as the original source document, in a PDF document.
XML is a markup language that is a derivative of Standard General Markup Language (SGML) that logically separates the rendering of a digital document from its content to enable interoperability across multiple technology platforms. Essentially XML defines rules for marking up the structure of content and its content in American Standard Code for Information Interchange (ASCII) text. Any conforming interoperable XML parser can render the original structure and content. XML-encoded text is human-readable because any text editor can display the marked-up text and content. XML is ubiquitous in IT environments because many communities of users have developed document type definitions unique to their purposes, including genealogy, math, and relational databases. Structure data elements work with relational databases, so this enables relational database portability.
Tagged image file format (TIFF) was initially developed by the Aldus Corporation in 1982 for storing black-and-white images created by scanners and desktop publishing application. Over the next six years, several new features were added, including a wide range of color images and compression techniques, including lossless compression. The most recent version of TIFF 6.0 was released by Aldus in 1992. Subsequently, Adobe purchased Aldus and chose not to support any further significant revisions and updates. Nonetheless, TIFF is widely used in desktop scanners for creating digital images for preservation. With such a large base of users, it is likely to persist for some time, but Adobe's decision to discontinue further development of TIFF means that it will lack features of other current and future image file formats. Fortunately, there are tools available to convert TIFF images to PDF and PNG images.
The W3C Internet Engineering Task Force supported the development of PNG as a replacement for graphics image format (GIF) because the GIF compression algorithm was protected by patent rights rather than being in the public domain, as many believed. In 2003, PNG became an international standard that supports lossless compression, grayscale, and true-color images with bit depths that range from 1 to 16 bits per pixel, file integrity checking, and streaming capability.
Vector graphics images consist of two-dimensional lines, colors, curves, or other geometrical shapes and attributes that are stored as mathematical expressions, such as where a line begins, its shape, where it ends, and its color. Changes in these mathematical expressions will result in changes in the image. Unlike raster images, there is no loss of clarity of a vector graphics image when it is made larger. SVG images and their behavior properties are defined in XML text files, which means any named element in a SVG image can be indexed and searched. SVG images also can be accessed by any text editor, which minimizes on a specific software application to render and edit the images.
JPEG 2000 is an international standard for compressing full-color and grayscale digital images and rendering them as full-size images and thumbnail images. Unlike JPEG, its predecessor, which supported only lossy compression, JPEG 2000 supports both lossy and lossless compression. Lossy compression means that during compression, bits that are considered technically redundant are permanently deleted. Lossless compression means no bits are lost or deleted. The latter is very important for LTDP because lossy compression is irreversible. JPEG 2000 is widely used in producing digital images in digital cameras and is an optional format in many digital scanners.
PNG replaced GIF as an international standard for grayscale and color images in 2004.
JPEG 2000 is an international standard for compressing and rendering full-color and grayscale digital images in full size or as thumbnails.
MPEG-2 is an international broadcast standard for lossy compression of moving images and associated audio. The major competitor for MPEG-2 appears to be Motion JPEG 2000, which is used in small devices, such as cell phones.
First issued by the European Broadcasting Union in 1997 and revised in 2001 (v1) and 2011 (v2), BWF is a file format for audio data that is an extension of the Microsoft Wave audio format. Its support of metadata ensures that it can be used for the seamless exchange of audio material between different broadcast environments and between equipment based on different computer platforms.
WebARChive (WARC) is an extension of the Internet Archive's ARC format to store digital content harvested through “Web crawls.” WARC was developed to support the storage, management, and exchange of large volumes of “constituent data objects” in a single file. Currently, WARC is used to store and manage digital content collected through Web crawls and data collected by environmental sensing equipment, among others.
Implementing a sustainable LTDP program is not an effort that should be undertaken lightly. Digital preservation is complex and costly and requires collaboration with all of the stakeholders who are accountable for or have an interest in ensuring access to usable, understandable, and trustworthy electronic records for as far into the future as may be required.
As noted earlier, ISO 14721 and ISO 16363 establish the baseline functions and specifications for ensuring access to usable, understandable, and trustworthy electronic records, whether this involves regulatory and legal compliance for a business entity, vital records, accountability for a government unit, or cultural memory for a public or private institution. Most first-time readers who review the functions and specifications of ISO 14721 and ISO 16363 are likely to be overwhelmed by the detail and complexity of almost 150 specifications.
A useful approach that both simplifies these specifications and provides explicit criteria regarding conformance to ISO 14721 and ISO 16363 is the Long-Term Digital Preservation Capability Maturity Model® (DPCMM).13 The DPCMM, which is described in some detail in this section, draws on functions and preservation services identified in ISO 14721 (OAIS) as well as attributes specified in ISO 16363, Audit and Certification of Trustworthy Repositories. It is important to note that the DPCMM is not a one-size-fits-all approach to ensuring long-term access to authentic electronic records. Rather, it is a flexible approach that can be adapted to an organization's specific requirements and resources.
DPCMM can be used to identify the current state capabilities of digital preservation that form the basis for debate and dialogue regarding the desired future state of digital preservation capabilities, and the level of risk that the organization is willing to assume. In many instances, this is likely to come down to the question of what constitutes digital preservation that is good enough to fulfill the organization's mission and meet the expectations of its stakeholders. The DPCMM has five incremental stages, which are depicted in Figure 17.3. In Stage 1, a systematic digital preservation program has not been undertaken or the digital preservation program exists only on paper, whereas Stage 5 represents the highest level of sustainable digital preservation capability and repository trustworthiness that an organization can achieve.
The Long-Term Digital Preservation Capability Maturity Model (DPCMM) systematically organizes high-level conformance to ISO 14721 and ISO 16363.
The DPCMM is based on the functional specifications of ISO 14721 and ISO 16363 and accepted best practices in operational digital repositories. It is a systems-based tool for charting an evolutionary path from disorganized and undisciplined management of electronic records, or the lack of a systematic electronic records management program, into increasingly mature stages of digital preservation capability.
The goal of the DPCMM is to identify at a high level where an electronic records management program is in relation to optimal digital preservation capabilities, report gaps, capability levels, and preservation performance metrics to resource allocators and other stakeholders to establish priorities for achieving enhanced capabilities to preserve and ensure access to long-term electronic records.
Stage 5 is the highest level of digital preservation readiness capability that an organization can achieve. It includes a strategic focus on digital preservation outcomes by continuously improving the manner in which electronic records life cycle management is executed. Stage 5 digital preservation capability also involves benchmarking the digital preservation infrastructure and processes relative to other best-in-class digital preservation programs and conducting proactive monitoring for breakthrough technologies that can enable the program to significantly change and improve its digital preservation performance. In Stage 5, few if any electronic records that merit long-term preservation are at risk.
Stage 4 capability is characterized by an organization with a robust infrastructure and digital preservation processes that are based on ISO 14721 specifications and ISO 16363 audit and certification criteria. At this stage, the preservation of electronic records is framed entirely within a collaborative environment in which there are multiple participating stakeholders. Lessons learned from this collaborative framework serve as the basis for adapting and improving capabilities to identify and proactively bring long-term electronic records under lifecycle control and management. Some electronic records that merit long-term preservation still may be at risk.
Stage 3 describes an environment that embraces the ISO 14721 specifications and other best practice standards and schemas and thereby establishes the foundation for sustaining an enhanced digital preservation capability over time. This foundation includes successfully completing repeatable projects and outcomes that support the enterprise digital preservation capability and enables collaboration, including shared resources, between record-producing units and entities responsible for managing and maintaining trustworthy digital repositories. In this environment, many electronic records that merit long-term preservation are likely to remain at risk.
Stage 2 describes an environment where an ISO 14721–based digital repository is not yet in place. Instead, a surrogate repository for electronic records is available to some records producers that satisfies some but not all of the ISO 14721 specifications. Typically, the digital preservation infrastructure and processes of the surrogate repository are not systematically integrated into business processes or universally available, so the state of digital preservation is somewhat rudimentary and life cycle management of the organization's electronic records is incomplete. There is some understanding of digital preservation issues, but it is limited to a relatively few individuals. There may be virtually no relationship between the success or failure of one digital preservation initiative and the success or failure of another one. Success is largely the result of exceptional (perhaps even heroic) actions of an individual or a project team. Knowledge about such success is not widely shared or institutionalized. Most electronic records that merit long-term preservation are at risk.
Stage 1 describes an environment in which the specifications of ISO 14721 and other standards may be known, accepted in principle, or under consideration, but they have not been formally adopted or implemented by the record-producing organization. Generally, there may be some understanding of digital preservation issues and concerns, but this understanding is likely to consist of ad hoc electronic records management and digital preservation infrastructure, processes, and initiatives. Although there may be some isolated instances of individuals attempting to preserve electronic records on a workstation or removable storage media (e.g., DVD or hard drive), practically all electronic records that merit long-term preservation are at risk.
This capability maturity model consists of 15 components, or key process areas, that are necessary and required for the long-term preservation of usable, understandable, accessible, and trustworthy electronic records. Each component is identified and is accompanied by explicit performance metrics for each of the five levels of digital preservation capability.
The objective of the model is to provide a process and performance framework (or benchmark) against best practice standards and foundational principles of digital preservation, records management, information governance, and archival science. Figure 17.4 displays the components of the DPCMM.
Scope notes for each of the graphic elements in Figure 17.4 diagram are provided next for additional clarity. Numbered components in the model are associated with performance metrics and capability levels described in the next section.
1. Digital preservation policy. The organization charged with ensuring preservation and access to long-term and permanent legal, fiscal, operational, and historical records should issue its digital preservation policy in writing, including the purpose, scope, accountability, and approach to the operational management and sustainability of trustworthy repositories.
2. Digital preservation strategy. The organization charged with the preservation of long-term and permanent business, government, or historical electronic records must proactively address the risks associated with technology obsolescence, including plans related to periodic renewal of storage devices, storage media, and adoption of preferred preservation file formats.
3. Governance. The organization has a formal decision-making framework that assigns accountability and authority for the preservation of electronic records with long-term and permanent historical, fiscal, operational, or legal value, and articulates approaches and practices for trustworthy digital repositories sufficient to meet stakeholder needs. Governance is exercised in conjunction with information management and technology functions and with other custodians and digital preservation stakeholders, such as records-producing units and records consumers, and enables compliance with applicable laws, regulations, record retention schedules, and disposition authorities.
4. Collaboration. Digital preservation is a shared responsibility. The organization with a mandate to preserve long-term and permanent electronic business, government, or historical records in accordance with accepted digital preservation standards and best practices is well served by maintaining and promoting collaboration among its internal and external stakeholders. Interdependencies between and among the operations of records producing units, legal and statutory requirements, IT policies and governance, and historical accountability should be addressed systematically.
5. Technical expertise. A critical component in a sustainable digital preservation program is access to professional technical expertise that can proactively address business requirements and respond to impacts of evolving technologies. The technical infrastructure and key processes of an ISO 14721/ISO 16363–conforming archival repository requires professional expertise in archival storage, digital preservation solutions, and life cycle electronic records management processes and controls. This technical expertise may exist within the organization or be provided by a centralized function or service bureau or by external service providers, and should include an in-depth understanding of critical digital preservation actions and their associated recommended practices.
6. Open standard technology-neutral formats. A fundamental requisite for a sustainable digital preservation program that ensures long-term access to usable and understandable electronic records is mitigation of obsolescence of file formats. Open standard platform-neutral file formats are developed in an open public setting, issued by a certified standards organization, and have few or no technology dependencies. Current preferred open standard technology file format examples include:
Over time, new digital preservation tools and solutions will emerge that will require new open standard technology-neutral standard file formats. Open standard technology-neutral formats are backwardly compatible so they can support interoperability across technology platforms over an extended period of time.
7. Designated community. The organization that has responsibility for preservation and access to long-term and permanent legal, operational, fiscal, or historical government records is well served through proactive outreach and engagement with its designated community. There are written procedures and formal agreements with records-producing units that document the content, rights, and conditions under which the digital repository will ingest, preserve, and provide access to electronic records. Written procedures are in place regarding the ingest of electronic records and access to its digital collections. Records producers will submit fully conforming ISO 14721/ISO 16363 SIPs while DIPs are developed and updated in conjunction with its user communities.
The most complete trustworthy digital repository is based on models and standards that include ISO 14721, ISO 16363, and generally accepted best digital preservation practices. The repository may be managed by the organization that owns the electronic records or may be provided as a service by an external third party. It is likely that many organizations initially will rely on surrogate digital preservation capabilities and services that approximate some but not all of the capabilities and services of a conforming ISO14721/ISO 16363 trustworthy digital repository.
1. Electronic records survey. A trustworthy repository cannot fully execute its mission or engage in realistic digital preservation planning without a projected volume and scope of electronic records that will come into its custody. It is likely that some information already exists in approved retention schedules, but it may require further elaboration as well as periodic updates, especially with regard to preservation ready, near preservation ready, and legacy electronic records held by records-producing units.
2. Ingest. A digital repository that conforms to ISO 14721/ISO 16363 has the capability to systematically ingest (receive and accept) electronic records from records-producing units in the form of SIPs, move them to a staging area where virus checks and content and format validations are performed, transform electronic records into designated preservation formats as appropriate, extract metadata from SIPs and write it to PDI, create AIPs, and transfer the AIPs to the repository's storage function. This process is considered the minimal work flow for transferring records into a digital repository for long-term preservation and access.
3. Archival storage. ISO 14721 delineates systematic automated storage services that support receipt and validation of successful transfer of AIPs from ingest, creation of PDI for each AIP that confirms its “fixity”14 during any preservation actions through the generation of hash digests, capture and maintenance of error logs, updates to PDI including transformation of electronic records to new formats, production of DIPs from access, and collection of operational statistics.
4. Device and media renewal. No known digital device or storage medium is invulnerable to decay and obsolescence. A foundational digital preservation capability is ensuring the readability of the bitstreams underlying the electronic records. ISO 14721/ISO 16363 specify that a trustworthy digital repository's storage devices and storage media should be monitored and renewed (“refreshed”) periodically to ensure that the bitstreams remain readable over time. A projected life expectancy of removable storage media does not necessarily apply in a specific instance of storage media. Hence, it is important that a trustworthy digital repository have a protocol for continuously monitoring removable storage media (e.g., magnetic tape, external tape drive, or other media) to identify any that face imminent catastrophic loss. Ideally, this renewal protocol would execute renewal automatically after review by the repository.
5. Integrity. A key capability in conforming ISO 14721/ISO 16363 digital repositories is ensuring the integrity of the records in its custody, which involves two related preservation actions. The first action generates a hash digest algorithm (also known as a cyclical redundancy code) to address a vulnerability to accidental or intentional alterations to electronic records that can occur during device/media renewal and internal data transfers. The second action involves integrity documentation that supports an unbroken electronic chain of custody captured in the PDI in AIPs.
6. Security. Contemporary enterprise information systems typically execute a number of shared or common services that may include communication, name services, temporary storage allocation, exception handling, role-based access rights, security, backup and business continuity, and directory services, among others. A conforming ISO 14721/ISO 16363 digital repository is likely to be part of an information system that may routinely provide some or perhaps all of the core security, backup, and business continuity services, including firewalls, role-based access rights, data-transfer-integrity validations, and logs for all preservation activities, including failures and anomalies, to demonstrate an unbroken chain of custody.
7. Preservation metadata. A digital repository collects and maintains metadata that describes actions associated with custody of long-term and permanent records, including an audit trail that documents preservation actions carried out, why and when they were performed, how they were carried out, and with what results. A current best practice is the use of a PREMIS-based data dictionary to support an electronic chain of custody that documents authenticity over time as preservation actions are executed. Capture of all related metadata, transfer of the metadata to any new formats/systems, and secure storage of metadata are critical. All metadata is stored in the PDI component of conforming AIPs.
8. Access. Organizations with a mandate to support access to permanent business, government, or historical records are subject to authorized restrictions. A conforming ISO 14721/ISO 16363 digital repository will provide consumers with trustworthy records in “disclosure-free” DIPs redacted to protect, privacy, confidentiality, and other rights, where appropriate, and searchable metadata that users can query to identify and retrieve records of interest to them. Production of DIPs is tracked, especially when they involve extractions, to verify their trustworthiness and to identify query trends that are used to update electronic accessibility tools to support these trends.
Digital preservation performance metrics for each level of the five levels of the model have been mapped to each of the 15 numbered components described in the previous section. The performance metrics are explicit empirical indicators that reflect an incremental level of digital preservation capability. The digital preservation capability performance metrics for digital preservation strategy listed in Table 17.2 illustrate the results of this mapping exercise.15
Conducting a gap analysis of its digital preservation capabilities using these performance metrics enables the organization to identify both its current state and desired future state of digital preservation capabilities. In all likelihood, this desired future state will depend on available resources, the organization's mission, and stakeholder expectations. “Good-enough” digital preservation capabilities will vary by organization; what is good enough for one organization is unlikely to coincide with what is good enough for another.
Any organization with long-term or permanent electronic records in its custody must ensure that the electronic records can be read and correctly interpreted by a computer application, rendered in an understandable form to humans, and trusted as accurate representations of their logical and physical structure, substantive content, and context. To achieve these goals, a digital repository should operate under the mandate of a digital preservation strategy that addresses 10 digital preservation processes and activities:
Level | Capability Description |
0 | A formal strategy to address technology obsolescence does not exist. |
1 | A strategy to mitigate technology obsolescence consists of accepting electronic records in their native format with the expectation that new software will become available to support these formats. During this interim period, viewer technologies will be relied on to render usable and understandable electronic records. |
2 | Electronic records in interoperable “preservation-ready”* file formats and transformation of one native file format to an open standard technology-neutral file format are supported. Changes in information technologies that may impact electronic records collections and the digital repository are monitored proactively and systematically. |
3 | The organization supports transformation of selected native file formats to preferred/supported preservation file formats in the trustworthy digital repository. Records-producing units are advised to use preservation-ready file formats for permanent or indefinite long-term (e.g., case files, infrastructure files) electronic records in their custody. |
4 | Electronic records in all native formats are transformed to available open standard technology-neutral file formats. |
* The term “preservation-ready file formats” refers to open standard technology-neutral formats that the organization has identified as preferred for long-term digital preservation.
An alternative is to forgo this costly process in the hope that a future technology, such as emulation, will be widely available and relatively inexpensive. Meanwhile, the repository would rely on a file viewer technology, such as Inside Out, to render legacy electronic records into format understandable to humans with the exact logical and physical structure and representation at the time they were created and used.
A robust firewall that blocks unauthorized access with tightly controlled role-based permission rights will help protect the security of records in the custody of the repository.
A further enhancement to protect against a cataclysmic natural or man-made disaster is maintaining a backup copy of the repository's holdings at an off-site facility.
The design and implementation of a digital repository that operates under this digital preservation strategy can be carried out in several different ways. One way is to use internal expertise to build a stand-alone repository that conforms to these digital preservation strategy requirements. Typically, an internally built repository is costly, takes considerable time to implement, and may not meet all expectations because of technical inexperience. An alternative is to use the services and/or solutions offered by an external institution or supplier. A third-party solution is offered by Archivematica, a Vancouver, British Columbia, company that specializes in the use of open-source software and conformance to the specifications of ISO 14721. “Archivematica is a free and open-source digital preservation system that is designed to maintain standards-based, long-term access to collections of digital objects.”16 Another company, Tessella Technology & Consulting,17 has an ISO 14721-conforming digital preservation solution called Safety Deposit Box that has been implemented in a number of national archives. In June 2012, Tessella introduced Preservica,18 a cloud-based implementation of the Safety Deposit Box that runs on Amazon Web Services. It is likely that other repository solutions, preservation services, and cloud-based digital preservation services will emerge over the next few years. The digital preservation strategy discussed earlier can be used to assess the capabilities of these solutions.
Organizations face significant challenges in meeting their LTDP needs, especially organizations whose primary mission is to preserve and provide access to permanent records. They must collaborate with internal and external stakeholders, develop governance policies and strategies to govern and control information assets over long periods of time, inventory records in the custody of records producers, monitor technology changes and evolving standards, and sustain trustworthy digital repositories. The most important consideration is to determine what level of LTDP maturity is appropriate, achievable, and affordable for the organization and to begin working methodically toward that goal for the good of the organization and its stakeholders over the long term. In addition, organizations should focus on what is doable over the next 10 to 20 years rather than the next 50 or 100 years.
1. Consultative Committee for Space Data Systems, Reference Model for an Open Archival Information System (OAIS) (Washington, DC: CCSDS Secretariat, 2002), pp. 1–1.
2. Kate Cumming, “Metadata Matters,” in Julie McLeod and Catherine Hare, eds., Managing Electronic Records, p. 48 (London: Facet, 2005).
3. David Rosenthal et al., “Requirements for Digital Preservation Systems,” D-Lib Magazine 11, no. 11 (November 2005), www.dlib.org/dlib/november05/rosenthal/11rosenthal.html.
4. “ISO 14721:2003, 2012 Space Data and Information Transfer Systems—Open Archival Information System—Reference Model,” www.iso.org/iso/catalogue_detail.htm?csnumber=24683 (accessed May 21, 2012).
7. See ISO 16363:2012 (E), sections 3.1–3.5.2.
8. See ibid., sections 4.1–4/6/2/1.
9. See ibid., sections 5.1–5.2.3.
10. For a useful overview of PREMIS, see Priscilla Caplan, “Understanding PREMIS,” Library of Congress, February 1, 2009, www.loc.gov/standards/premis/understanding-premis.pdf.
11. Library of Congress, “PREMIS Data Dictionary Version 2.2: Hierarchical Listing of Semantic Units,” September 13, 2012, www.loc.gov/standards/premis/v2/premis-dd-Hierarchical-Listing-2-2.html.
12. Library of Congress, PREMIS Data Dictionary for Preservation Metadata, Version 2.1 (January 2011).
13. Charles Dollar and Lori Ashley are codevelopers of this model. Since 2007 they have used it successfully in both the public and private sectors. The most recent instance is a digital preservation capability assessment for the U.S. Council of State Archivists (CoSA). For more information about the model, see “Digital Preservation Capability Maturity Model” at www.savingthedigitalworld.com (accessed December 12, 2013).
14. ISO 14721 uses “fixity” to express the notion that there have been no unauthorized changes to electronic records and associated Preservation Description Information in the custody of the repository. See ISO 14721:2003 (E): 1.6.
15. For information about digital preservation capability performance metrics, visit “Digital Preservation Capability Maturity Model.”
16. Archivematica, “What Is Archivematica?” October 15, 2012, www.archivematica.org/wiki/Main_Page.
17. Tessella, “Tessella SDB” www.tessella.com/tag/safety-deposit-box/ (accessed June 28, 2012).
18. Tessella, “Preservica: Digital Preservation as a Service” January 2011, www.digital-preservation.com/wp-content/uploads/Paas-Description-V3-Alternate-Web.pdf.
* Portions of this chapter are adapted from Chapter 17, Robert F. Smallwood, Managing Electronic Records: Methods, Best Practices, and Technologies, © John Wiley & Sons, Inc., 2013. Reproduced with permission of John Wiley & Sons, Inc.
3.17.162.214