Repository design
In this chapter, we introduce the basic concepts and elements that comprise a repository. Repositories encapsulate not only the content being managed but also the various metadata elements and infrastructure that support the IBM FileNet Content Manager functionality. In addition, we describe the repository design elements and guidelines for using these elements.
We discuss the following topics:
4.1 Repository design goals
Repositories are the central components of IBM FileNet Content Manager solutions. In this chapter, when we talk about repositories, we do not mean just the low-level databases, file systems, and other technical components where pieces of data reside. We mean a more encompassing definition that includes not only those things, but higher level constructs, such as access control, relationships among objects, different types of business objects, and various ways that users and applications interact with them. Some solutions might have repositories composed of a single object store. Other solutions are composed of dozens of object stores, file systems, storage devices, and other resources.
Repositories store content, such as office documents, images, records, and other types of electronic content along with associated metadata. FileNet Content Manager repositories are capable of storing billions of documents and records, providing a centrally accessible, enterprise-wide library of information that can be centrally managed and used in many different ways.
The decomposition of FileNet Content Manager solutions into the various repository elements is designed to facilitate not only the separation of logical and functional purposes, but it also is designed to meet a number of additional goals. The architectural framework offers features that facilitate the specific goals of scalability, maintainability, securability, well-behaved enterprise citizenship, and flexibility for future function and growth. Each section in this chapter explains specific features of these elements.
Security features are present at almost every level and manifestation of the repository elements and, in many instances, in multiple ways. These features provide various security granularities from broad to specific and individualized levels. See
Chapter 5, “Security” on page 151 for additional details of security features.
4.2 Object-oriented design
IBM FileNet Content Manager follows an object-oriented design (OOD) paradigm. Every element represented in the system exists as an object. For example, an element can be a content object that contains the metadata and reference for a specific document, or it can be the definition of a document class that defines what the metadata objects look like.
The following design aspects are complementary high-level approaches to repository design:
•Content decomposition. Examine the types of content across the organization, the purpose of the content, and then group the content based on shared properties and metadata. For example, policy documents can be grouped together, separate from checks, statements, and claims. The perspective of this approach focuses on the content and types of content contained in specific documents.
•Grouping content by relationship. Examine specific relationships between documents, such as checks related to a specific claim, which in turn is related to a specific policy. The perspective of this approach focuses on formalized relationships between documents.
•How content is used and accessed. This analysis can reveal content relationships that are not formally captured in metadata. For example, a spreadsheet listing appraisers in a specific geography is usually accessed along with claims. Customer service representatives typically look at all documents relating to a specific user or geography. The perspective of this approach focuses on how people access and use content to complete their tasks.
•Business processes associated with the content. The business processes that use the content, the document lifecycles, and the workflows all give a perspective that is based on the functionality of the documents. The perspective of this approach allows the grouping of content based on what it does. Content can also be combined with other content to create new content. For example, a report combines data from various sources and is generated by a report generation process.
4.2.1 Design approaches
Two basic directions from which to approach repository design are bottom-up and top-down. Both approaches offer specific benefits and advantages, and each approach carries with it certain limitations that can make it unusable in a specific situation. Because solution design is an iterative process, and because it can include a reasonably large scope, it is not uncommon for both approaches to be integrated and applied to different areas of the design as appropriate. It is better to employ both design approaches to the solution, focusing on both of their strengths, and reconciling the approach perspectives as they meet in the middle.
Regardless of which design approach is used, in what combinations, or with other design approaches, the ultimate design goals remain the same. There must always exist a specific set of clear business requirements that is driving the solution.
Bottom-up design approach
Approaching the design from the bottom up has the advantage of enabling the use of existing content, organization, business knowledge, and expertise that are either explicitly or implicitly present within the organization. Involving business users and subject matter experts (SMEs) greatly enhances the utility and usability of the resultant design.
Designing the repository from the bottom up means analyzing the existing content and processes in use by the organization and synthesizing the abstract entities from this information. Repeated applications of grouping the resultant entities based on a specific set of characteristics and then synthesizing the next layer up by abstracting these groupings yields the resultant design. Each level of organization of entities allows a different facet of design detail characteristics to be focused on and separated out from the others. During the requirements gathering period, collect the inputs from all groups that will be involved in using, building, testing, training, or operating the system.
Recommendations: Interview the business users and SMEs and record their comments as a set of initial requirements for the system. Use a proof-of-concept (POC) to validate the requirements and design in the early phases of the project.
|
The bottom-up approach has the advantage of allowing you to work with existing, well-understood content and with workers who have expert knowledge about that content. Often, as the design grows from the bottom up, it becomes more difficult, especially for the knowledge workers, to abstract further away from the concrete details with which they are used to working.
The bottom-up approach has the disadvantage of taking all of the implicit knowledge about how the existing problem space is approached, including any artificial constructs that were used for historic or other reasons. Some of those might be contrary to a good design. It is frequently difficult to get past those legacy design decisions in order to understand the true underlying requirements.
Top-down design approach
Approaching the design from the top down has the advantage of allowing the design to be formalized from a clean start. Any existing faults in the system, historical processes, procedures that no longer add business value, and any preconceived notions of what is expected can be avoided. This allows the enterprise viewpoint to be fully exercised and elevates the considerations for areas, such as future flexibility, growth, and overall integration structure, to be fully considered.
This process typically starts by utilizing not SMEs, but solution domain experts. They understand the technology and architecture of enterprise content management systems. The top-down approach works its way down to the level where SMEs must be consulted for the final and essential details.
Designing from the top down involves understanding the global picture and decomposing the various levels of the design through clear design goals or specific design choices. It is also an iterative process, driving from the most abstract toward the most concrete levels. By designing from the top down, the specific order of design characteristics can be approached in the manner that makes the most strategic sense for the organization.
The top-down approach has the advantage of developing a design that does not include any artificial non-technical barriers, for example, historical organizational structures. It produces a design that emphasizes the strategic requirements of the solution. This often results in the most flexible and adaptable design for the future.
The disadvantage of the top-down approach is the difficulty in mapping existing content and processes into the new design that is developed. As the design iterations approach the more concrete aspects and need to be mapped directly to concrete business entities, the process can become conceptually and politically difficult for knowledge workers, depending on historic organizations.
Recommendations: Even if you start with a top-down approach, keep SMEs informed and involved at an appropriate level. Let them know that they will be vitally involved as you get to the more concrete layers. An information vacuum can create genuine misunderstandings that can make everyone’s job more difficult.
|
4.2.2 Design processes
Producing the best possible design requires coordination and cooperation from all of the major areas that the solution touches. In addition, all of the major areas that will be directly affected by the solution need to be involved and committed to the goals. However, that is not always possible to achieve, so designing as close to a perfect solution as possible is the next best goal. There are a number of design processes and concepts that have been shown to be extremely useful in producing an effective repository design.
The two key elements necessary are the team that undertakes the design and the specialized pieces of information that are needed to make the correct design decisions.
The design team itself can consist of one or more architects with the specific responsibility of producing the design. Regardless of the number of individuals in the design team, there is a clear set of roles and responsibilities that must be represented. These roles cover both the technical facets of the design as well as the business facets. The team is usually led by a technical architect who has the direct responsibility for the overall solution. The team is either populated with architects and representatives from the following areas or it includes contacts in the following areas who can provide feedback and direction as needed to the team without being full-time team members:
•P8 Content Manager architect technical role
This is the architect who has the ultimate responsibility for the overall repository and solution design. This role must be filled by a full-time member of the design staff who has expert knowledge of the P8 Content Manager product.
•Enterprise architect technical role
This is the architect who is responsible for overseeing the technical fit of the solution into the existing solution portfolio. This role must be filled by someone who has expert understanding of the current technology across the enterprise.
•Application architect technical role
This is the architect who has direct responsibility for the specific application or applications being addressed at this phase of the design and who is responsible for tracking the business requirements into the solution space.
•Enterprise security technical role
This is someone who has expert understanding of the security environments and models that are used in the enterprise infrastructure. The purpose of this role is to assure that all existing security policies are followed and to provide support as needed for security requirements outside of the P8 Content Manager solution itself.
•Legal business role
This role must be filled by someone who has expert knowledge of the legal requirements of the business sphere in which the solution exists. They provide guidance about requirements and restrictions on the system that are imposed for legal, as opposed to business value, reasons.
•Knowledge worker business roles
These roles represent the directly affected business workers whose content and processes are being integrated into P8 Content Manager and who have the inherent and implicit knowledge of the business that is not usually captured in any other manner.
4.3 Repository naming standards
Prior to designing the repository, initial thought must be given to the conventions that will be used to standardize naming across all of the types of objects that will be in the repository. At this stage, concern yourself with the standardization across the organization and design elements of the repository as opposed to the content objects themselves that will be placed into the repository by users. Through a well-thought-out naming scheme, you can avoid many potential problem areas at the beginning of the project as opposed to discovering them during the lifetime of the repository. All objects that are created as organization and design objects need to be named as descriptively as possible. When there are hierarchal relationships between objects, it makes sense to capture this relationship in the naming standard as well. For example, Company XYZ has a site that is called Upper Bay. One of the virtual servers in that site is named Upper Bay-Accounting.
There are a number of standard references to different labels and points that must be considered in every case. These are presented followed by the callout of several naming constructs for specific objects that have been shown to be useful.
Recommendations: Put the naming standards in place prior to the creation of any design or organizational objects in a repository and ensure that they are adhered to throughout the lifetime of the repository. Ensure that names are as descriptive as possible with consideration for the consumer of the label.
|
4.3.1 Display name
Display name is used to indicate that this label will be displayed on the user interface components. These display names will be utilized by the users of the system. These names are intended for human consumption and must have the proper white space and punctuation to make them the most meaningful to their intended audience. Display names are localizable, so consideration needs to be given to whether you will localize display names for your custom classes and properties.
4.3.2 Symbolic name
In contrast to display names, symbolic names are generally used programmatically to refer to particular objects. For that reason, symbolic names of various types must usually be unique within the type. For example, all symbolic names of classes within an Object Store must be unique. Symbolic names are not localizable.
Because of the uniqueness constraints for symbolic names, there is a convention for naming prefixes used by Content Manager and other ECM products. The purpose of this convention is to minimize the chances of a collision between names used in future product releases and your private symbolic names.
•Cm: Reserved for Content Platform Engine
•Dita: Reserved for Content Platform Engine for the FileNet P8 DITA Add-on
•RM: Reserved for IBM Enterprise Records
•EDM: Reserved for IBM eDiscovery Manager
•EDISC: Reserved for IBM eDiscovery Manager
•ICC: Reserved for IBM Content Collector
•Clb: Reserved for Social Collaboration
•CmAcm: Reserved for IBM Case Manager
•CmMcs: Reserved for Master Content Server
Check the product documentation for the latest list of reserved prefixes. You need to be aware that additional prefixes might be used by third parties who provide components for use with Content Platform Engine.
Recommendations: Use a unique prefix for symbolic names that you create. The choice of prefix is yours. It is typical to use something short but meaningful.
|
4.3.3 Uniqueness
Object names across the entire design generally have a requirement for uniqueness. Unique naming tracks with appropriate naming, that is, when proper consideration is given to naming objects, the uniqueness typically follows. Problems can arise when overly abstract names are given to an object where the same name more appropriately maps at a higher level in the hierarchy.
An example of naming an object Email implies that it is utilized high in the naming hierarchy whereas we expect a name, such as AgentCustomerEmail, is a good choice at a low level.
4.3.4 Taxonomy
Taxonomy is the establishment of categorization based on naming. Having a specific pattern that is applied to names with well-understood definitions for each name part facilitates an organized taxonomy. Giving initial thought to taxonomy and developing a taxonomy prior to the actual naming simplify the naming task and accent the self-descriptiveness of the names.
The best known example of a taxonomy is the scientific classification of living organisms, designating a name pattern that contains elements, such as species, family, phylum, and others.
The collection of metadata must achieve a balance between how it will benefit the user and the effort required to generate it. You need to consider the following information while collecting the metadata:
•What problems are you trying to solve?
•What kind of content and metadata do you need to solve the problem?
•How will you collect the data?
The primary purpose for standards and controls for taxonomies and metadata is to achieve consistency throughout the organization in the description of content objects to facilitate search and retrieval and overall object control.
4.3.5 Consistency
Consistency is important so that as the base of people who will be using the names is broadened, it leads to better understanding and less confusion as the system moves forward in scope and in age. Establishing good consistency standards is beneficial in the end. Consistency is facilitated by the complete application of the ideas that are already presented.
4.3.6 Object stores
Object stores are the highest point of naming for a certain repository. Make sure that you indicate the part of the solution that an object store represents when you name it.
For example, Company XYZ with a single object store can name its object store XYZ Enterprise. Another company ZYX has two object stores and it can name the two object stores, ZYX Operations and ZYX Support. The object stores represent repositories for all content pertaining directly to the business of ZYX: one repository for all of their internal administrative content and one repository for support organization content.
4.3.7 Storage areas
Storage areas are where the content is stored. There are various types of storage areas, including file system, database, and fixed content. Each type can represent a number of varieties, each with specific characteristics. Naming the storage areas in a manner that encapsulates the type and characteristics of the storage area is useful because the storage areas are accessed and applied throughout the lifetime of the system.
For example, Company XYZ has three storage areas in use for the Company XYZ repository. The first storage area is a file store hosted on the network accessible protected storage segment of a storage area network (SAN) by a Network File System (NFS) mount. The second storage area is a fixed storage area that links to the company’s image management system. The third storage area is a file store on a nearby set of inexpensive disk drives, also through an NFS mount. These three storage areas are named NFS-RAID, IMAGES, and NFS-CHEAP.
4.3.8 Document, custom object, and folder classes
When naming these objects, consider the inheritance hierarchy to both clarify the lineage of a specific object as well as to distinguish two leaf objects that might be the same type at first glance but have totally different lineages.
For example, memos from Engineering are classified under the XyzOpsDevCommunicate document class. Memos from Human Resources are classified under the XyzSupHrCommunicate document class.
Recommendations: Prefix the document class with the common prefix for the system, department name, and the purpose so that it is easy to find out the purpose of the class by looking at the name.
|
4.3.9 Property templates
There are special considerations for property templates because a certain property template can be widely used across many different objects. The names chosen for the property templates need to be self-descriptive of both the characteristics of the property template as well as the intended use of the template.
An example of three property templates is XyzAgencyName, XyzFirstName, and XyzLastName. These templates have multiple usages across different objects in a generic way.
Recommendations: Use the Category field in the property template for categorizing the properties.
Property template names need to follow the standard naming scheme and topology established at the enterprise level.
Property template names need to be generic enough that they can be used in a number of design classes but not so generic that they cannot be given a meaningful name.
|
4.3.10 Choice lists
Choice lists are ways to restrict the possible values of an integer or string property. They are used to limit the entries that the user will fill in for a property template. Choice lists need to be descriptive, informative names.
Recommendations: If you create applications that use IBM Content Navigator, considering using External Data Services (EDS) to control user interface elements instead of choice lists. An EDS is more flexible for application developers.
|
4.4 Populating a repository
In the solution domain, there are two major containers for data: the global configuration database (GCD) and the repositories, which are illustrated in
Figure 4-1 on page 92. There is only a single GCD that encapsulates all of the configuration of the domain and at least one, but possibly many, repositories in the system.
Figure 4-1 Storage objects in a domain
A repository contains a single object store and potentially one or more storage areas as shown in
Figure 4-2 on page 93. An
object store contains definitions, configuration information, and metadata for the content that is stored in the repository. The storage areas store the actual content.
Figure 4-2 Repository contents
There are four major stages involved in the population of a repository: three design stages and one production stage. The three design stages include organizational design, described in
4.5, “Repository organizational objects” on page 95, repository design, described in
4.7, “Repository design objects” on page 98, and repository content design, described in
4.8, “Repository content objects” on page 117. The final stage in repository population is the actual test or production usage of the repository. The following sections describe the design stages and their relationships.
During all of these design phases, there are certain commonalities that are universally, or nearly universally, utilized in the objects of the design.
4.4.1 Generic object system properties
Generic object system properties are properties that are found in the lowest level of the object-oriented hierarchy from which all other objects are extended. All of the system properties are available to all the objects and do not need to be replicated in any custom properties. Therefore, you must understand what is available in order to leverage these properties where applicable.
Here, we list several of the system properties that have potential application in other places of the design:
•Class description
The class description contains the description of the class from which this object is instantiated.
•Display name
This label is intended for display to the user.
•Descriptive text
This text describes the purpose and meaning intended for this object.
•Is hidden
This is a Boolean value that indicates whether the object is hidden in its current context. This property can affect the user interface. Hidden objects or classes are generally of interest to application developers for special purposes but not of interest to users.
•Symbolic name
This label is used for internal, programmatic references to the object.
•ID
This immutable global unique identifier (GUID) can be used to reference a specific object throughout its lifetime.
•Is content-based retrieval (CBR)-enabled
This is a Boolean value that indicates if content-based retrieval is enabled in the current context of the object.
In addition to the set of properties just covered that applies to all objects in the system, there is a set of properties that appears in many of the objects that is important to mention at this level. The following properties are present in most objects:
•Auditing enabled
This property indicates whether the object has its auditing enabled. This is a switch that enables and disables all audit logging for this specific object and its scope. Many events can be audited and controlled at a more granular level.
•Permissions
This property contains the access control list (ACL) for the object. An ACL consists of a number of access control entries (ACEs). A single ACE contains either an individual or group from the directory and the authorizations that entity has in relation to the object (See
Chapter 5, “Security” on page 151 for more details).
4.4.2 Creating design elements
There are many design element types, such as document classes, custom object classes, folder classes, property templates, and GCD objects, that must be used in cooperation to achieve the best design. Each of these elements exists for a specific purpose and encapsulates a specific set of information. Because solutions are composed together from these elements, complex relationships can be created between them that must be maintained for system integrity and consistency. Most of the complexities of the relationships are handled by the underlying engine and removed from the concerns of administrators and application designers.
Modifying and removing design elements can be a tricky procedure, given the complex relationships that are possible. This is especially noticeable when attempting to remove a design element that might be used or referenced from a number of other design elements at differing levels of the design. It is always best to be as thorough as possible in the system design before actually creating the elements in the Content Manager. This thoroughness avoids most of these difficult situations.
Recommendations: Complete the design as much as possible prior to actually creating the design elements in the system.
|
4.5 Repository organizational objects
The solution space is divided into a number of logical divisions. Each division serves a specific purpose. The composition of all of these divisions provides a powerful solution that allows the requirements of any implementation to be clearly and succinctly decomposed.
Figure 4-3 on page 96 shows the logical relationships among the decomposition elements, domain, sites, virtual servers, server instances, and object stores.
Figure 4-3 Repository organizational objects
All of the logical elements comprising a domain can be administered and managed through Content Platform Engine Administration tools. Naturally, all of the elements can also be manipulated using the APIs, but it is generally simpler to use the administration tools for most situations.
All these repository organizational objects were discussed in the
Chapter 3, “System architecture” on page 37.
4.6 Global configuration database (GCD)
All of the repository organizational objects are contained in the GCD. The GCD is the single container that encapsulates all of the configuration information for a domain. The GCD is the logical representation of the domain, and it contains the subsystem configuration, which consists of the other organizational elements for sites, virtual servers, and server instances. In addition, it contains the specific configuration information for each object store’s database space, directory configuration, text search server information, trace log configuration information, and other information that is accessible to all the object stores inside that domain.
Figure 4-4 gives a visual representation of the GCD layout.
Figure 4-4 Global configuration database contents
4.7 Repository design objects
There are a number of elements that constitute a repository design. Each of these elements encapsulates a specific view, purpose, and role in the complete design. The division of responsibility between some of these elements is clear while others are highly dependent on the specific environment and application. There are a large number of design decisions that must be made to achieve a final design that is both efficient and scalable.
4.7.1 Object stores
In a similar manner that a domain encapsulates an entire repository solution, an
object store is the basic component of a repository that contains not only all of the content that has been committed to P8 Content Manager, but all of the additional information and functional objects associated with that content. The number, type, and location of object stores that are needed for an organization are important design considerations (see
4.10, “Considerations for multiple object stores” on page 133 for additional details). Any object store is associated with a specific site and the storage areas associated with that site. The object store contains definitions for various classes that structure metadata, as well as actual metadata objects along with their connections to the content where applicable. An object store can contain all of the content for the entire enterprise or can be segmented from the overall enterprise design and assigned to a specific set of the overall problem. Regardless of the purpose, the object store contains the entirety of all of the definitions required for use by users and any applications that will access it.
Figure 4-5 on page 99 shows a graphical representation of the scope of an object store.
Figure 4-5 Object store contents
An object store is conceptually an object like all entities that make up a repository and that has specific characteristics. Object stores are created through the use of administration tools. The best practice is to utilize the wizard for object store creation, which simplifies the interface and ensures that all settings necessary at creation time are both set and synchronized where applicable.
Recommendations: If your design calls for more than a single object store, create a metastore that can contain all of the design objects that are common across all of the stores and replicate this as changes are made. If a meta object store is used, do not roll this store out into production, because it is strictly a development object store.
When creating an object store, always set the object store administrator to a valid administrator logon and grant the administrator all permissions.
|
4.7.2 Storage areas
Storage areas for repositories can be hosted on a wide range of storage devices and mediums, from SCSI drives, to fiber-attached SAN devices, to secure immutable storage units, as well as others.
In addition to storage media type, there are a number of logical storage types, such as database stores, file stores, fixed content stores, and cached content temporary stores. Each of these logical types has implications for performance and functionality that must be considered when determining specifically where to store content.
Storage areas have specific features to optimize the storage for space and other enterprise requirements.
Content compression
Content that is uploaded to the storage area is compressed if content compression is enabled for the storage area. Only the content that can be compressed beyond the compression threshold will be compressed. Content compression uses blocked-compression technology to divide the uploaded content into distinct blocks, which are compressed in memory before being written to disk. If encryption is also enabled, the block is first compressed and then it is encrypted in memory before being written to disk. Enabling content compression on a storage area will not encrypt the existing content.
Encryption of content
The content stored in the storage area can be encrypted using the storage area configuration. Content encryption helps protect the confidentiality of the content if it is accessed outside of Content Manager. A new key is generated every time that the encryption is enabled on the storage area. The most recent key is used to encrypt the new content. These encryption keys are stored in the object store database in a secured way.
Enabling content encryption on a storage area will not encrypt the existing content. In content replication, the external repository receives the non-encrypted content. Also, the decrypted content will be submitted for indexing purposes.
By moving content from one storage area to another, you can enforce the content encryption, re-encrypt with a latest key, or store non-encrypted content.
Suppression of duplicate content elements
The suppression of duplicate content can reduce the storage space that is required to store content. Content Platform Engine suppresses duplicate content by checking the existing content before adding new content to the storage area. If identical content exists, the new content is stored as a reference to the existing content. If no identical content exists, the new content is added in the normal manner.
Duplicate suppression is not available for fixed content devices, but those devices usually offer their own native duplicate suppression features.
Suppressing duplicate content might decrease storage space requirements but it also slightly increases processing time. To help you determine whether your space savings make the trade-off worthwhile, Content Platform Engine provides storage statistics for each storage area.
Database store
There is a single database store per object store, where the database store is part of the same database as the Object Store itself. The database store can be used to store content, but the content will be stored as a database binary large object (BLOB). Depending on the size of the content, this is not an effective use of the database and can have serious impacts on the database performance, especially with the large content.
File store
Recommendations: Using the database store can help with operational efficiency in some cases, but it almost always represents a performance cost over file storage.
|
There can be multiple file stores per object store with each one a separate directory structure on the server. The file store can be on local storage media or can be a mount point for remote, or networked, storage media. This is the typical location that is used for content with different file stores of different media types used for different content where appropriate.
Recommendations: The media type and cost must be clearly understood from the file store name to eliminate content storage area errors.
|
Fixed content store
There can be multiple fixed content stores per object store. This content type is designed to provide access to other content storage systems, such as an image repository, while leveraging the power of the Content Manager metadata management system. Fixed content stores can represent a physical storage appliance or can represent federated content.
Content cache store
A
content cache store is a special store that allows local caching of content that is permanently stored in another storage area. A content cache is typically used for storage areas in a different geographic site. A content cache store allows local access to content that is frequently accessed, or in an active state of processing, to be available without degrading the network connection to the remote content and increasing the performance for these local operations. The cached content store provides a performance enhancement for remote content access but does not provide any type of high availability solution for the content. See
Chapter 7, “Business continuity” on page 217 for more details about high availability solutions.
Although it is typically thought of as part of the configuration of a distributed environment, a content cache store can also give performance benefits locally if it is faster than the permanent storage for the content.
Recommendations: The Content Platform Engine guarantees that application access to cached content always follows the normal security controls, and it also guarantees that applications will not access stale content. Therefore, you can use a content cache area wherever it provides a benefit. You do not have to worry about compromising security or data integrity.
|
4.7.3 Document classes
Document classes are the design objects that, when instantiated, will contain most of the business content of the system. Most of the detailed design process is concerned with developing the correct set and hierarchy of document classes.
Document classes are inherited from a common top-level document class that contains all of the basic properties that the system needs. Although it is not technically necessary, it can be useful to create an immediate child subclass of document for enterprise-wide use. A top-level subclass for the enterprise can contain all of the metadata items that are the same across all document objects in the enterprise, either by requirement or policy.
The first level of document class design is concerned with the common enterprise objects, as opposed to specific application objects. The result of this first round of design is a hierarchal document class tree that contains all of the common enterprise document classes that can be leveraged by specific applications, because they are included in the Content Manager solution. A reasonable number of properties need to be defined in each class. It is easier to administer and expand a design where each document class is concerned with a specific aspect of the design. The resultant tree is typically neither extremely narrow, nor extremely wide. A narrow tree usually indicates that the class design has focused too specifically on an aspect and has been too exclusive. A wide tree usually indicates that there are too many aspects of the design encapsulated at a level.
Another test that can be applied to the resultant design is to see how various changes to the design can be made. If there are properties that have historically changed somewhat frequently or there are any properties that are projected to change, see what changes need to be made to the design to accommodate the changes. The ideal is to address a change with a change in a single class. This is a good indication that you have the proper level of design encapsulation. The types of changes to consider are property redefinitions, property additions, property deletions, class additions, class modifications, class deletions, security updates, functional changes, and organizational changes.
Recommendations: Avoid making many subclasses for a custom document class. Changes to a higher level custom document class will be propagated to all its subclasses.
|
Adopting an enterprise perspective allows the document class designs to facilitate greater information sharing and collaboration across the enterprise. In addition to assisting in breaking down information silos, this makes the overall design much more usable as well. You must always take usability into consideration during all the design phases. The use of SMEs at this phase can greatly assist you in meeting the unspoken requirements and usability goals of users.
As a key design object in the system, there are lots of additional components on which the document classes are dependent. Most of these dependencies are covered in the specific sections for the dependent elements. Probably the most important dependency is the usage of the property templates in the class designs. This dependency underscores the need to be clear and concise in the property template definitions and consistent with naming and topology across the entire design.
Finally, try to avoid designing for the current organization without being modular enough to accommodate change. Avoid carrying over limitations of the current system that might have been design flaws in the current system or limitations of the tools that are used to support it. Take into account any current or future processes in which the content is utilized. That is, always consider business process automation in the design. Remember that there will always be additional applications and functional areas that the system will need to support that are not currently identified or even identifiable.
There are three focus areas that the document class design typically follows: design based on organization, design based on content, and design based on function. Although these are the major design approaches that are used, variations on these themes as well as modifications and combinations of these approaches are also successfully used. The correct approach to use is highly dependent on the specific details of your corporation and the application that is supported by P8 Content Manager:
•Design based on organization
Design based on organization starts with the first level of decomposition after the enterprise root document class, which is groupings around how the corporation is organized. This can be reflected in line of business (LOB) objects, support and business value objects, or any other high-level structure that represents your organization. The subsequent layers of the hierarchy then follow the organization down into smaller and smaller groupings. Each level can also have classes that capture content-specific aspects where the document content that they represent has consistency across the entire organization from that root down the hierarchy. Eventually, the lowest level represents document content classes that correspond to specific functional areas or specific content.
This facilitates future changes that occur at the organizational level by capturing these aspects as high in the tree as feasible and letting these properties and attributes be inherited down the hierarchy.
•Design based on content
Design based on content starts with the first level of decomposition after the enterprise root document class. The first level includes high-level abstractions of the content types that will be stored in P8 Content Manager. This often follows record plans where they have been established. Lower levels of the hierarchy allow the capture of more and more concrete aspects of the content types until the resultant leaf nodes are declared.
This approach facilitates communication across the enterprise, because all of the properties of the document classes will be the same regardless of where in the organization they are used. You do not capture the organizational aspects of the corporation. This design approach can have significant political ramifications dependent on the culture of your corporation.
•Design based on function
Basing design on function starts with the first level of decomposition after the enterprise root document class, consisting of abstractions of the functions that are carried out in the corporation without regard to the organizational structure. As the document class hierarchy extends down, more and more concrete functional aspects are captured, as well as content-specific aspects for the content types that will be used.
This approach captures many of the functional aspects of the corporation, which typically mirror the organizational structure, but in a more abstract perspective of focusing solely on the function, business value, and processes for which the content is used. This approach is sometimes viewed as a blending of the purely organizational approach and the purely content approach.
Document classes are created through a wizard interface in Content Platform Engine Administration tools in the metastore. After the metadata of the metastore is finalized, all the metadata will be exported and imported to the other development, test, and production systems using the deployment tool as described in
Chapter 9, “Deployment” on page 271.
Recommendations: There needs to be a single, top-level document class that extends the base document class and from which all other document classes will be derived.
All property templates, choice lists, storage policies, and storage areas need to be created prior to creating any document classes that utilize them.
Each document class encapsulates a single design aspect.
Create all the metadata in the meta object store and export and import the metadata from the metastore by using IBM FileNet P8 Deployment Manager as described in Chapter 9, “Deployment” on page 271.
Never skip the step of designing high-level abstract objects that are for aspect encapsulation and that most likely are never instantiated.
|
Document class characteristics
Document classes have the following characteristics:
•Have metadata
•Are containable
•Are versionable by both content and metadata
•Hold content
4.7.4 Folder classes
Folder class objects are the design objects that, when instantiated, provide aggregation or containment for other objects. The characteristics and usage of folder objects must not be mistaken for, or confused with, the foldering features and concepts provided by a file system. P8 Content Manager folder classes provide containment by reference, which allows any specific object to be contained in multiple folders at the same time. Most of the same considerations that are given to creating document object classes (
4.7.3, “Document classes” on page 102) also apply to designing folder classes:
•Single top-level class that all others are derived from.
•Single design aspect captured per class.
•Design with changes in mind.
•Design in modularity.
•Do not repeat any mistakes that the current system or processes have.
A key design decision that needs to be made is whether the main access mechanism for content follows the search paradigm (represented in
Figure 4-6) or follows the browse paradigm (represented in
Figure 4-7 on page 108). Both of these paradigms offer their own strengths and weaknesses, and this decision directly affects how folder classes will be used and instantiated.
Search paradigm
The model for the search paradigm is represented in
Figure 4-6 as a dialog box requesting some information and returning a set of content that meets the criteria specified in the dialog. The best analogy is accessing a database. Information is retrieved from a database by formulating a query, which returns a set of data elements that matches the criteria in the query.
Figure 4-6 Searching for content
The search paradigm is powerful, because it does not rely on the user needing to know where the content is in the system or the name of the object that contains the content. Searching also returns a set of objects as an atomic operation; the maximum size of this set can be controlled as well. This can include objects that are in diverse places in the repository. Effective use of the search paradigm requires the selection of meaningful distinguishing properties for the objects that have meaning to users. It also requires meaningful document classes that are understood by users as well.
The search paradigm can be fronted with various methods of compiling the search criteria and usually is best served by designing searches or through custom interfaces. It is usually a faster and more reliable method of finding content than is offered by the browse paradigm.
Recommendations: Use a search paradigm to search for documents with as much metadata information as possible to get a small result set. As much as possible, avoid using wildcard searches, which give you a large result set.
|
Browse paradigm
The model for the browse paradigm is represented in
Figure 4-7 as a typical file system structure. There is some meaningful relationship between the sets of folders that lead the user to sets of content in an understandable way. The best analogy is a file system tree structure. Although the analogy presented to help understand the browse paradigm is a file system structure, a file system folder is
not the same as a P8 Content Manager folder, which supports multiple filed locations.
Figure 4-7 Browsing for content
The browse paradigm relies on the users who add the content to be thoughtful and knowledgeable in the manner in which the content is filed. This potentially includes filing the same content object in multiple folders. There is also a requirement that the name of the content object has a meaning in its context that is understood by users.
The browse paradigm can increase the time that it takes for the system to search for content, but it is well suited to users who all understand the basic concepts of foldering and are used to using foldering for file system access. The browse paradigm typically takes longer for users to find content than the search paradigm, and it requires users to have inherent knowledge to be able to reliably find content.
Recommendations: In most cases, the search paradigm offers a much better model for performance and maintenance. Avoid too many layers of too many folders (keep the total number to tens of folders, not hundreds), which can impact retrieval performance. There needs to be a single, top-level folder class that extends the base folder class and from which all other folder classes will be derived.
All property templates and choice lists must be created prior to creating any folder classes that utilize them. Create all the metadata in the metastore and export and import the metadata from the metastore using the deployment tool.
When using a browse paradigm, make sure to file the documents in a meaningful foldering hierarchy. Each folder class encapsulates a single design aspect.
|
Folder class characteristics
Folder classes have the following characteristics:
•Have metadata
•Are containable
•Are not versionable
•Are not content
•Are containers
4.7.5 Custom object classes
Custom object classes are design objects that contain metadata without content and provide no containment. They are designed to be versatile general-purpose objects that can be subclassed to perform a variety of functions, such as security proxies, or configuration objects for workflow processing. Custom objects are database objects that do not have any content.
Most of the same considerations that are given to creating document object classes (
4.7.3, “Document classes” on page 102) also apply to designing custom object classes:
•Single top-level class from which all others are derived.
•Single design aspect captured per class.
•Design with changes in mind.
•Design in modularity.
•Do not repeat any mistakes that the current system or processes might have.
Recommendations: There must be a single, top-level custom object class that extends the base custom object class and from which all other custom object classes will be derived.
All property templates and choice lists need to be created prior to creating any custom object classes that use them. Each custom object class encapsulates a single design aspect.
|
Custom object class characteristics
Custom object classes have the following characteristics:
•Have metadata
•Are containable
•Are not versionable
•Hold no content
4.7.6 Custom root classes
Custom object subclasses are essentially a collection of properties. The disadvantage with instances of custom object class is that these will be stored in the Generic table in the object store database. This might cause a performance degradation if several instances of disjoint custom object classes exist in the Generic table. Querying for one instance of the subclass might be slowed because of the presence of a large number of instances of other custom object subclasses. Sometimes, it is not always practical to create an index on the Generic table since many of the columns might have null values because those columns relate to other classes.
To overcome the issues with custom objects, users can use the custom root classes. Every immediate subclass of the custom abstract root class has a separate table in the database. The table name will be generated from the symbolic name of the custom root class.
Three kinds of custom root classes can be created: abstract persistable, abstract queue entry, and abstract sequential. All these base custom root classes are abstract and cannot be instantiated directly. Also, Content Manager does not let users update the properties of these classes. Users create subclasses from these abstract classes in order to use them in the application. It is from the immediate subclasses that the tables are created. Any instances of subclasses of the custom root classes will be saved in the same table as the custom root class table:
•Abstract persistable
It is similar to a custom object class, which is a collection of properties without any content associated with it. The instances of this class cannot be filed into any folder. The immediate subclasses of abstract persistable are each saved in a different table.
•Abstract queue entry
The subclasses of abstract queue entry are intended for the queues managed by the sweep framework. This class has additional properties that are required for the sweep-based queue operations. The abstract queue entry classes will follow the security model for queue item and replication. Access is defined by the default instance permissions. There is no owner or permission property on these instances.
•Abstract sequential
Abstract sequential is for the external applications’ queue and log processing. It provides a single increasing sequence number property that can be used to process the entries in the order in which the transactions were created.
Important: The two subclasses created by any custom root class are disjoint because they are in separate table. Deleting the class definition for a custom root class will drop the associated tables.
|
4.7.7 Property templates
Property templates are used throughout the design as established containers for properties. A property template contains a name, a property data type, and a set of attributes. This enables the definition of common properties to occur once and be utilized throughout the design in a uniform manner. Properties, such as FirstName, LastName, and PolicyNumber, are typical generic property templates in a design.
There are two types of properties: the system properties that come preinstalled in P8 Content Manager and custom properties that you create for your specific installation. All of these properties can be utilized in any definitions as you think appropriate. Typically, there is a rich set of system properties associated with the base classes. The system properties that are, by default, associated with a class must be examined to both prevent duplication of information and to understand what is available to be leveraged by your class definitions.
Property templates must always have a data type associated with them. The data type can have a cardinality of either single value or multi-value for all data types.
Recommendations: Property templates need to follow a standardized naming scheme and topology established at the enterprise level. Property templates need to be generic enough that they can be used in a number of design classes, but not so generic that they cannot be given a meaningful name.
Avoid the creation of property templates that are named in such a manner that it might be confusing to know which template to use. Avoid the creation of property templates that encapsulate the same informational data but have distinct names.
|
4.7.8 Choice lists
Choice lists are defined to limit users from being able to enter free-form text or integer data into a property. Choice lists protect against typing mistakes and other human errors. It is not always appropriate to use a choice list, because there must be a well-understood, mostly static set of data that the property can take.
Choice lists can consist of levels of groupings of values to make it easier for the correct value to be selected. In multi-value properties, the user can select multiple entries from the choice list.
Recommendations: Group choice list elements logically with the user experience in mind. Limit the number of elements in each group to a small enough set that it can be easily displayed and scanned.
Avoid assigning the same value to more than one item in a choice list. Do not use choice lists for properties where the values are expected to change frequently.
|
4.7.9 Annotations
Annotations allow users to link additional information or comments to other objects, such as documents. These annotations can be in any format, such as text, audio, video, image, highlight, and sticky note. An annotation’s content does not necessarily have to be the same format as its parent document and can be published separately. Document annotations are uniquely associated with a single document version; they are not versioned or carried forward when their document version is updated, and a new version is created.
You can modify and delete annotations independently of their annotated object. However, you cannot create versions of an annotation separately from the object with which it is associated. By design, the annotation will be deleted whenever its associated parent object is deleted. Annotations receive their default security from both the annotation’s class and the parent object. You can apply security to annotations that is different from the security applied to the parent.
The content of annotations is stored in a storage area, as defined by the default area for the annotation class. The storage area used by the annotation class needs to be appropriate for the type of content associated with the annotations. That is, if a large content is being used for annotations, it must not be stored in the database storage area.
4.7.10 Document lifecycles
Document lifecycles allow for the fact that a certain document exists in a number of states throughout its lifetime.
Figure 4-8 on page 114 shows a sample state diagram for the typical document lifecycle in the XYZ corporation.
Figure 4-8 Sample document lifecycle model
In this example, documents are in one of three states:
•Personal documents not being shared or collaborated on
•Workgroup documents that have a limited scope of sharing and are intended for collaboration
•Corporate records that have meaningful business value to the company
In the first two states, a document can be revised and remain in its current state, reach its end of life and be destroyed, or be promoted to a higher state. In the workgroup collaboration state, documents can also be processed in some automated way, such as through IBM Case Foundation (previously known as IBM FileNet Business Process Manager). In the final corporate document state, a document can also be demoted back to the workgroup for revisions and updates.
While the figure captures the states and transitions between the states that a document can take, it also illustrates how IBM FileNet Content Manager document lifecycles can be extremely useful. A document lifecycle allows for the definition of the states in which a document can be and then can associate that document with a set of security templates that depend on the state that the document is in. This controls the access of the document as it progresses from being a personal document, to a workgroup document, to a corporate document.
Document lifecycles are contained in two design classes: the lifecycle policy class and the lifecycle action class:
•Lifecycle policy class
The definition of the document’s states. The policy also identifies the lifecycle action that executes in response to the state changes.
•Lifecycle action class
Action that the system performs when a document moves from one state to another.
Document types in the Content Platform Engine have default lifecycle policies. You can also assign a default lifecycle policy to any new document class. When you create a document using a class with an associated lifecycle policy, the document uses it as a default lifecycle policy. This can be overridden at creation time by assigning a different lifecycle policy to the document.
Recommendations: Assign lifecycle policies to a document class whenever possible, instead of assigning them to individual documents. This practice helps the operator select the correct policy by choosing the document class associated with the desired lifecycle policy. This practice also prevents problems that can occur if you need to delete a lifecycle policy.
|
4.7.11 Events and subscriptions
Content Platform Engine Administration tools enable you to define events that extend the functionality of an object store, which enables you to configure objects to perform actions in response to specific activities that occur on each object defined on a Content Platform Engine server.
An event consists of an event action and a subscription. An event action describes the action to take place on an object. A subscription defines the object or class of objects to which the action applies, as well as which events trigger the action to occur. Subscriptions can be assigned to individual objects. However, it is more efficient if they are assigned to classes instead. Assigning subscriptions to classes ensures that a set of common objects is managed consistently. Assigning subscriptions to classes can also limit the number of subscriptions that run simultaneously, which can affect system performance.
Recommendations: Add properties to subclass event actions and subscriptions. Keep event actions short to ensure quick completion. This is especially true for synchronous subscriptions where the subscription processor waits for an event action to complete before moving on to subsequent processing.
Do not rely on priority to guarantee the order of execution for subscriptions. Ensure that you thoroughly test your events and subscriptions before implementing them.
Set up each event action with code stubs that specify each event trigger (Create, Update, Delete, CheckIn, CheckOut, File Event, Unfile Event, Mark for soft delete, and recover to restore from recycle bin), even if you do not define functions for every trigger. The subscription controls which of the triggers call an action. You need to prepare the action to handle all triggers gracefully.
|
4.7.12 Marking sets
Marking sets are intended for records management applications. They allow access to objects to be controlled based on the values of specific properties. The ACL for an object with a marking set is a combination of the settings of its original ACL and the settings of the markings constraint mask for each marking that is applied to it. The result of this combination is the effective security mask. It is important to note that marking sets are only subtractive in nature, that is access can only be denied or removed through marking sets. Refer to
5.3.6, “Markings” on page 167 for more information about marking sets.
The general mechanisms of marking sets include:
•A marking set is defined that contains several possible values called markings.
•Each marking value contains an ACL that defines who can assign that specific value to an object property, who can modify that value, who can remove that specific value, and who will have access to the object to which it is assigned.
•The marking set is assigned to a property definition that is assigned to a class. All instances of that class must have this marking property set to one of the markings defined in the marking set.
•The value for the marking properties can only be assigned by users authorized by the associated marking.
•Markings do not replace conventional access permissions on an object, but rather are coequal with them in determining access rights. If an object has one or more markings applied to it in addition to one or more permissions in its ACL, access to that object is only granted if it is granted by the permissions and by the markings.
The number or size of markings in a single marking set is limited by available system memory. To perform an access check on a marked object, the entire marking set and all its markings must be loaded into memory. This is not going to work if there are millions of markings. For this reason, limit the number of markings in a marking set to no more than 100.
Recommendations: Marking sets need to contain no more than 100 markings. Marking sets are domain objects. Having more marking sets affects the performance and the memory footprint of Content Platform Engine server.
|
4.8 Repository content objects
Part of the repository design process also involves how the content will be organized and laid out in the repository. You need to decide how to structure the objects that are instantiated from the design classes.
This section describes points for you to consider when laying out the content in the object store.
4.8.1 Folder objects
Folder objects can be participants in both sides of aggregation by reference. Because the foldering concept in IBM FileNet Content Manager is done by reference, and a containable object can be referenced in multiple locations at the same time, this can be an extremely powerful tool to meet sophisticated requirements. In general, if the search paradigm is followed, folder objects serve a purpose for actual reference aggregation and an additional layer of security for those aggregated objects. For many clients, a primary reason for installing a central repository is to bring scattered information into an organized structure.
Referential containment relationships
A folder is a container that can hold other objects. These objects can be custom objects, documents, and folders and their subclasses. Child folders are typically directly contained. That is, their containment model is a one-to-many relationship.
A containing folder can contain multiple child folders, but each child folder is directly contained within at most one parent folder. Custom objects and documents are always referentially contained. For referentially contained objects, their containment models a many-to-many relationship. A referentially contained object can be contained within multiple folders, and can also be contained multiple times in the same folder.
There are two types of referential containment relationships: dynamic and static. A static referential containment relationship is a relationship between a folder and a custom object, a specific document version in a version series, or a folder. A dynamic referential containment relationship is a relationship between a folder and the current version of a document. In this case, the current document version is the released version, or else the current version, otherwise, the reservation version.
Filed as opposed to unfiled
In an IBM FileNet P8 Content Manager (P8 Content Manager) repository, content objects can be added to a repository in two ways: without reference to a folder structure or into a particular folder (or set of folders). We refer to these options as
unfiled and
filed (see
Figure 4-9). Unlike a file system, repository folders do not represent physical locations in the repository. Content is added, indexed, and is accessible whether or not it is filed in a folder.
Figure 4-9 Repositories can be filed or unfiled
One of the primary benefits of filing into a folder is browsing. Browsing allows users to traverse a folder structure and locate content inside a folder. Hopefully, all the content in any specific folder relates to a particular activity or function. Another advantage is that in a P8 Content Manager repository, content can be filed in more than one folder at a time. There is one master copy of the content, and references filed in multiple folders point back to the single master.
Remember that with P8 Content Manager, users can always search for and view any content that meets search criteria whether or not the content is filed in a folder. Folders are simply a convenience for users who want to browse for repository content.
There are use cases where unfiled content makes sense.
Table 4-1 is a decision table for the filed option as compared to the unfiled folder option.
Table 4-1 Folder options and their impact
Folder option
|
Impact
|
Unfiled content (does not use folders)
|
•Content is only accessible by search.
•There is no need to organize repository content using folders.
•Transactions that add content are slightly faster.
•Appropriate for high-volume image applications where access will be by search only.
|
Filed content (uses folders)
|
•Users can browse or search for content.
•There is a need to organize the repository content.
•A single version of a content object can be filed in more than one folder.
•Appropriate for lower volume applications, or applications where users will be manually adding content.
|
Organizing unfiled content
The P8 Content Manager repository can act as a receptacle for high-volume archive systems for image (scanned paper) or email messages. For these applications, folders and an organization scheme are not a priority. The “add content” transaction in P8 Content Manager is slightly faster when foldering is not required. In this type of solution, transaction rates and efficient searching are the most important criteria.
In solutions of this kind, searching becomes the primary mechanism for content retrieval. For this reason, the metadata that identifies the content when it is added to the repository is vital.
The metadata set that is collected for each content item must include all properties necessary to identify and retrieve the content. This set must include the usual properties, such as content title, content subject, and date collected, in addition to application-specific properties, such as customer name, customer ID, and account number.
Organizational metadata elements
You must also consider another set of metadata. You can add metadata properties that provide organization for the content. Organizational metadata identifies the type of content, the division or department to which it belongs, and potentially, the record series that controls its retention. The following list shows examples of organizational metadata properties:
•Division
•Department
•Function
•Activity
•Document type
•Record type
Adding organizational metadata tags to repository content is a valid method of providing a central structure to repository content without using folders. You can add the same elements that create an efficient folder structure to unfiled repositories as organizational metadata properties.
Repository folder structures
The design of a central repository is an opportunity to place scattered content into an organization-wide filing system. One of the primary functions of a repository is to offer ease of access; users must be able to quickly locate information with a minimum of effort.
Several parameters contribute to a well-designed repository folder structure:
•Is the structure self-explanatory? Is it easy to locate information?
•Does the structure work for all groups in your organization?
•What about groups that want to create their own folder structure?
•Does the structure avoid placing too many folders in a single subdirectory?
We will consider these questions as we move forward in this section.
An organization-wide folder structure
A central repository folder structure must make sense for all groups in your organization. During implementation, it is not necessary to build out the entire folder structure; the first three levels are sufficient. The goal for the first three folder hierarchical levels is a structure that is accessible at first glance to any member of your organization.
Recommendations: When designing a central repository folder structure, we suggest that you start with the first three levels of the structure. Build this out for your entire organization.
|
The first three levels of the folder hierarchy form the central organization scheme for your repository. Three levels are not an absolute rule; four or five levels might be necessary for large organizations. The idea is to create a structure that provides an organizational foundation. Depending on your organization, there are several approaches for organizational schemes. The best way to illustrate this concept is through examples.
Example: By organizational chart
The first example is a folder structure that follows a company organizational chart. In this organizational scheme, as shown in
Figure 4-10, the folder levels represent:
(1) Department → (2) Activity → (3) Document type
Figure 4-10 An organizational folder structure
Example: By geographical location
Another example is a repository that stores construction project records. For this organization, construction projects are organized by location (see
Figure 4-11). In this scheme, the folder levels are:
(1) Region → (2) Construction project → (3) Document type
Figure 4-11 A geographical folder structure
Example: By function
The next folder structure is based on function. This structure is appropriate for records systems that are typically organized by the function of the document, the activity with which it belongs, and the record category under which it needs to be filed. In this scheme, as shown in
Figure 4-12 on page 123, the folder levels are:
(1) Function → (2) Activity → (3) Document type
Figure 4-12 A functional folder structure
Recommendations: Create a folder structure that makes sense for your entire organization. Develop your folder structure to at least the third folder hierarchical level. This structure forms the framework for your repository organizational scheme.
|
Beyond the third level
The goal of the three-level folder hierarchy is to impose an organization-wide structure for repository content. But many times, individual groups have their own requirements for folder structures and want to organize their content without system-enforced rules. For these groups, simply release the organizational rules for any folders created under the third level.
Folder-inherited security allows repository administrators to restrict the creation of folders in the first, second, and third folder hierarchical levels and grant folder creation privileges to group owners below this level. This enforces the integrity of the organizational scheme, while still allowing individual departments to organize content to their own satisfaction.
In the example in
Figure 4-13 on page 124, folder creation rights to levels 1 - 3 need to be reserved for system administrators only. Folder creation rights under the accounting folder need to be granted to the accounting group manager, and folder rights under projects need to be granted to the IT group manager.
Figure 4-13 Folder creation rights in an organization-wide folder structure
Avoiding an excessive number of subfolders
It is possible to create too many subfolders under a parent folder. For all implementations, avoid creating more than 100 - 200 subfolders under any specific repository folder.
In any foldering application, large numbers of subfolders create performance problems. The system slows down when users open the parent folder and an excessive number of subfolder entries must be queried and returned from the database. The design goal is to create a deeper hierarchy rather than an overly shallow structure.
Recommendations: Limit the number of subfolders at every folder level. Create a hierarchy of subfolder levels rather than using many folders at the same level. Do not create a folder that contains more than 100 - 200 subfolders. Use a search paradigm wherever is appropriate, and limit the search result size.
|
4.8.2 Other objects
When you define other objects in the repository, consider how they are intended to be used and leverage their unique abilities. Another aspect of object repository layout is in the storage media that the content will use. Try to provide a range of storage media and use it appropriately.
4.9 Storage media
Recommendations: Try to give meaningful name properties to objects to assist users in navigating through collections of documents returned in a search or a browsing session.
Try to match content with storage media in a meaningful way. Internal memos and other short-lived pieces of content without lots of business value can be stored on a simple network-attached storage (NAS) device. Content that is critical to the business operation can utilize a high-speed, highly available storage subsystem that also has a higher cost associated with it.
|
A P8 Content Manager repository stores data in two areas: the object store and the storage area. The
object store is a relational database that stores repository configuration: object references, properties, choice lists, and object relationships. The
storage area holds actual content: electronic media files. Object stores can be configured to use three distinct types of storage (see
Figure 4-14 on page 126):
•Database store
•File store
•Fixed store
Figure 4-14 P8 Content Manager object store storage options
When choosing a storage method for your content, remember that each of these storage methods can be configured on a per document class basis.
4.9.1 Catalog
The catalog is a relational database that is specified at installation time. The catalog can be created on any supported relational database management system (RDMS). Refer to the product documentation for information about supported brands and versions.
The catalog database stores all of the P8 Content Manager configuration information. If you expand the object store view using Content Platform Engine Administration tools, the object tree that displays is pulled from the catalog database. The catalog stores the following information:
•Configuration information
•Object references
•Object properties
•Object security lists
•Choice lists
•Property values
•Document (content) links
•Search definitions
Database indexes for custom properties
P8 Content Manager does not support writing to the catalog database through direct SQL commands. Interaction with the catalog database must be handled by using Content Platform Engine Administration tools or through the application programming interface (API).
Important: P8 Content Manager does not support direct writes (updates) to the P8 Content Manager catalog database.
|
The exception to this rule is custom property indexes. You can create a database index for any class property except system-owned properties. These database indexes, also known as single indexes, are stored within the object store database. For properties that users search frequently, single indexes reduce processing time for queries on this property.
Note: When selecting a property to index, the object store search must be case-sensitive, or the index is not created correctly. You must create additional indexes in Oracle and DB2 to avoid full table scans.
|
4.9.2 Database stores
P8 Content Manager can be configured to store content inside a relational database. With this configuration, P8 Content Manager converts document content into binary large objects (BLOBs) for storage in the database.
Database storage areas are useful when the size of your object store is not large in terms of the number of documents and the sizes of those documents. Smaller documents of about 10 MB or less have performance advantages in a database storage area when compared to other storage area types. Do not store any document that is over 100 MB in a database storage area.
4.9.3 File stores
With a file store, P8 Content Manager stores content files on a shared network drive or a NAS device. A file store is the most common object store configuration. To organize the files on disk, P8 Content Manager sets up a managed hierarchy of directories on the specified drive.
Note: The file names of the content written to this directory will be based on Global Unique Identifiers (GUID) associated with the document.
|
Figure 4-15 File store directory structure
A file storage area consists of a hierarchy of folders on a local or shared network location:
Share The shared folder serves as the parent directory to one or more file storage areas.
Root The root directory of the file storage area is the top-level directory for content storage. A single parent shared folder can contain one or many file storage area root folders.
Content The directory where all committed content element files are stored in a large hierarchy of subfolders.
During the creation of the storage area, you have the option of creating a small (23 x 23 = 529 directories) or large (23 x 23 x 23 = 12,167 directories) file storage area. The choice of one or the other is typically determined by the anticipated growth and the need for physically grouping the documents for storage management, backup, or disaster recovery purposes. A large file storage area is more suited for storing a large number of small content elements that contain single-page scanned documents or small emails. A small file storage area is more suited for a smaller number of content elements with a larger average size, such as content element files with embedded images, spreadsheets, and graphics.
Documents are stored among the directories at the leaf level using a hashing algorithm. We suggest that the best practice is to limit the number of content element files in a leaf directory to fewer than 5,000. For a small directory structure, the upper limit is around 2,500,000 content element files. For a large directory structure, the limit is about 60,000,000 content element files. The number of documents for the file store depends on several factors, such as the type of content being stored, the size of the content, and the type of the file store being used. After that, create multiple storage areas. With larger file stores, the following issues can arise:
•Larger file stores take more time to perform the weekly full backup. After that, consider differential and synthetic backups to avoid full backups or implement a SAN/NAS-based snapshot and replication.
•Consistency checker takes a long time to run.
•P8 Content Manager does not have a hard limit on the number of files in the file store, but the more documents you have, the larger the constraint placed on the file system.
P8 Content Platform Engine administration tools offer the capability to set a limit on the number of content elements and size for a certain file storage area. When either limit is reached, the file storage area is closed and the new content is directed to the next open file storage area in the storage policy. To help manage a large storage space across multiple storage areas, Content Platform Engine farmed or rolling storage areas can be implemented, as indicated earlier.
SAN versus NAS for file storage
Although NAS and SAN are used somewhat interchangeably in this book, there are some operational issues with SAN that typically make it inappropriate to use for a file storage area. Before describing these issues, we provide a better definition of SAN versus NAS as described by the IBM office of the CTO. From an operating system perspective SAN is seen the same as local disk or directly attached storage (DAS).
“Both local SCSI disks and SAN are accessed at the block level, so SAN typically looks like ordinary locally attached SCSI disk to the operating system. Disk I/O is done at the block or sector level. A driver translates those SCSI calls into calls through a host bus adapter over a Fibre Channel network to the SAN device on the other end of the Fibre Channel. The Fibre Channel network is a specialized optical network used just for connecting SAN devices to one or many servers through the host bus adapters in those servers. A host bus adapter is what you called a Fibre Channel card.”
“Network file shares, on the other hand, are accessed at the file level with operating system file system calls. So, network shares look like a local file system, with folders and files, not like local disk, to applications. Underneath the operating system, a network protocol is used to extend local file system calls over a standard LAN to the network file server. The act of binding a remote network file system into the local file system folder hierarchy is called a mount. The network protocols most commonly used for extending file system calls over the network to the remote file server are CIFS for Windows based clients, and NFS for UNIX based clients. Network shares can be provided by a network file server, or by specialized storage devices called NAS devices. NAS devices plug directly to a LAN and are dedicated to providing network shares to other computers on that LAN.”
- IBM office of the CTO
A SAN Fibre Channel device, a logical unit number (LUN), can be physically connected to multiple server nodes at a time. However, you need to have a combined operating system/file system that supports concurrent access to a shared LUN to be able to share the physical device without using a network file system, such as NFS.
With standard (non-parallel) file systems (that is, 99% of them, such as UNIX file systems (UFS) and journaled file systems (JFS)), the Linux or IBM AIX® operating system is written to control the disk directly, and as efficiently as possible, assuming sole ownership of the device. It can do whatever it wants. Think of a SCSI disk or a directly attached device where there is normally only one SCSI master on a SCSI bus, and that is the computer/disk controller. Therefore, the OS uses cached copies of data structures that are on disk.
Typically, the file system information (inodes, files, and directories) is not necessarily constantly in synch with the state of the physical device as the caching occurs. However, the operating system tracks all the moving parts and ensures that everything stays consistent within the scope of what it controls. This assumes that no other system is attempting to access the same blocks from a different port. This leads to two issues:
•The OS might make several updates to a structure that is in memory without writing it out to disk, for efficiency purposes, for example, an inode. So, another computer reading the disk is unaware of the latest updates.
•Similarly, when the OS writes the updates out to disk, it simply writes out the whole block. If another computer has updated that block since it was first read in, the other computer’s updates will be overwritten when this computer finally does its write. This problem, in particular, rapidly results in a corruption of the file system that most likely causes both computers to crash.
So, even though Content Platform Engine can handle and control concurrent access at the file level, it cannot control all I/Os going to a certain device. And, it has no access to a level lower than the file system, that is, to the physical disk block level. Content Platform Engine works by using the API provided by the file system. The fundamental problem is whether the file system underlying that API supports concurrent access to the disk. So, this is why Content Platform Engine needs to rely on a network file system, such as NFS, to go beyond the scope of any one machine’s operating and file system to handle concurrent access to a single physical device.
Thus, a SAN or network protocols simulating a SAN, such as iSCSI, cannot be used for a file storage area if the Content Platform Engine servers run on different host machines. A SAN cannot control concurrent write access to the same directory if the requests come from different host machines. A SAN can only be used as a file storage area if all the Content Platform Engine servers that are writing to it are on the same host machine. In this case, the operating system on the host machine can control the concurrent write access to a common file store directory structure.
4.9.4 About storage policies
A
storage policy provides mapping to a specific object storage area and is used to specify where content is stored for a specific content class. P8 Content Manager supports the mapping of storage policies to one or more storage objects; therefore, each storage policy can have one or multiple storage areas as its assigned content storage target (see
Figure 4-16 on page 132). This concept is known as
farming.
A storage policy can be used to distribute an I/O load through farming, and it can be used to provide continuous storage availability by pre-provisioning storage areas in a standby state. When a storage area referenced in a storage policy becomes full, a standby storage area, if available, is automatically opened to maintain the same number of open storage areas for the policy.
Figure 4-16 Storage policies
Recommendations: Use storage policies with the class definition to save the content.
|
Farming
A storage area farm is a group of storage areas that acts as a single logical target for content storage. Storage area farms increase the throughput by distributing the I/O load across all the open storage areas. With storage area activation, the Content Platform Engine maintains the same number of open standby storage areas for the policy, therefore, better distributing the I/O load across the standby storage areas. With farming, Content Platform Engine provides load-balancing capabilities for content storage by transparently spreading the content elements across multiple storage areas. Therefore, the storage policy functions as both the mechanism for defining the membership of a storage area farm and also the means for assigning documents to that farm.
Create separate file storage areas to ensure efficient document management. For example, you can create a file storage area to group documents with the same deletion or backup requirements. Map storage areas with documents by modifying the storage policy property on document classes.
Recommendations: Use Content Platform Engine administration tools to configure storage policies and storage area farms.
|
4.9.5 Using fixed storage devices
Fixed storage devices are large capacity third-party storage devices that feature hardware-level content protection. Examples of fixed storage devices are EMC Centera or NetApp Snaplock. Fixed content systems potentially provide extremely large storage capacity, as well as write-once hard drive technology.
Fixed content stores compared to file stores
Before deciding on a fixed content store, review the following considerations:
•Content stored in the fixed storage area is accessed via the Content Platform Engine using a third-party API rather than the file system API.
•Read/write access to the repository can be slower in a fixed content store than access to Content Platform Engine’s file storage area.
•For fixed content store, the repository might be write-once, which does not allow any changes to the content. This is exactly the same as normal file storage areas in that the document content can never be changed after it is added to the repository. The document content can only be revised, and new versions can be added to the repository.
•The repository might not allow the deletion of content except through third-party device tools. The repository can support a retention period for content, which means that the deletion of the content is not allowed until the retention period has expired.
•A fixed storage area requires a small file storage area to be used as a staging area before it is stored into a fixed device.
The fixed content system can limit the number of concurrent connections to the server, which means that there are fewer connections allowed than current read/write requests normally supported by the Content Platform Engine. This might result in decreased performance, but not error conditions.
For more information about content storage management and storage farming, see this developerWorks article:
4.10 Considerations for multiple object stores
There are several valid use cases for deploying multiple object stores. A single object store can handle a catalog containing over a billion objects. Using multiple storage policies, there is a virtually unlimited amount of storage.
Except under extreme conditions, size is not a factor in the decision to add additional object stores.
Multiple object stores are warranted in the following situations:
•An object store is subject to high ingestion rates or frequent update procedures and needs to be segregated for performance reasons.
•Content must be separated for security reasons.
•User groups are separated by a large geographic distance.
For performance reasons
If an object store will be the target of high-volume ingestion rates, such as those produced by Capture, Datacap, ICC, or Email Manager, it makes sense to separate that object store from others that are dedicated to document lifecycle use. Users who search for and check out documents for editing will experience better performance if the object store they use is not busy handling high-volume automated processes. There are two common examples of this situation where multiple object stores are used: email archiving and IBM FileNet Records Manager solutions (see
Figure 4-17 on page 135).
The IBM Records Manager object store that hosts record information is subject to processing intensive database activity during retention and disposition processing. In addition, record objects are small and best suited for database stores. For these reasons, records need to be stored in a separate object store.
Recommendations: Set up a separate, database object store for IBM FileNet Records Manager. This object store is commonly called the file plan object store (FPOS).
|
Figure 4-17 Two solutions with multiple object stores
Use the Data Source sharing feature. For more information, see the “Sharing Data Sources” and “Creating a Database Connection” topics in Administering Content Platform Engine:
For security reasons
Another reason to implement multiple object stores is a requirement to strictly separate content for security reasons. Although it is possible to keep classified content secure by using marking sets and security policies, certain content must be kept absolutely separate. In these situations, install a second object store for classified content.
Here are a few situations where secure object stores are a solution:
•Board of director-level content
•Secret or top secret government content
•Public-facing Internet-accessible libraries
•Service companies that offer enterprise content management services to multiple customers
By geography
Many organizations have large offices in several countries. Wide area network (WAN) links are expensive over large distances and typically have low bandwidth and high latency. It is not always practical for offices in this situation to share the same P8 Content Manager system. One solution to this situation is two separate repositories managed by two separate P8 Content Manager systems as shown in
Figure 4-18.
Figure 4-18 Two separate repositories
In this solution, an organization has installed two separate P8 Content Manager systems in two distant offices, usually under a single p8 domain. Users in each office have high-speed access to the local repository for document retrieval and editing. Users in remote offices can still search and retrieve content in the remote office repository, but because this activity is less frequent than local access, traffic over the WAN link is reduced.
By functional group
In the organizations with several functional units, each functional unit, such as Human Resources (HR), Legal, and Marketing, might want to have a separate object store. This separation of data offers the flexibility for the users to control the documents and implement the security and access rights for the users in the organization. This separation also allows the line of business (LOB) applications to use only the data pertaining to that business unit.
By the size of the object store
Content Platform Engine object stores can handle data from tens of millions to hundreds of millions of objects. The number of objects that an object store can handle depends on several factors, including the database used, storage areas, rate of ingestion of objects, the metadata defined, and the kinds and size of data stored. After the object store reaches its maximum capacity on any of the factors, it is advisable to create a new object store.
4.11 Retention management and automatic disposal
P8 Content Manager can be configured to set the retention on the instances of the annotations, custom objects, documents, and folders. The retention date on the object prevents the deletion of the object until the retention date is passed. Organizations might have to keep the documents for a certain period of time for legal, regulatory, and contractual reasons. Setting a retention value on those objects helps ensure that the objects are retained in the system until all the obligations are met. Also, having a retention date on the objects removes the possibility of the accidental deletion of objects until the retention date has passed.
Automatic disposal lets you run the disposition task to remove all the objects in the Content Manager based on the criteria specified. The following sections describe retention management, automatic disposal, and best practices.
4.11.1 Retention management
To prevent an object from being deleted, you can specify a retention setting for the object.
A retention setting can be applied either statically or based on events. Static retention for an object is set once. After a fixed amount of time, the static retention requirement is satisfied and the object can be deleted. Event-based retention is dynamic, allowing you to change the retention setting for an object after the occurrence of a business event. Event-based retention closely aligns the retention policy of the data with the business requirements for that data.
The default retention can be set at the class level. Instance retention can be set at the object level in the Content Manager. Events, such as Content Platform Engine events, can be used to set and alter the retention.
Class-level retention
The retention period you specify at the class level is applied to newly created object instances and to document instances when the document is checked in. Class-level retention can be set to the annotation, custom object, document, and folder classes. Users need to have the modify retention permissions at the object-store level to set the class-level retention.
Object-level retention
The objects of annotation, custom object, document, and folder can be set with the retention date at the time of object creation. By default, objects inherit the retention value from the parent class. To set or change the retention date for an object, users need a special modify retention permission. These special permissions are not required if the retention is defaulting from the class.
For static retention, users have to specify the retention date at the time of creating the object or the exact value has to be specified in the class as the retention date. Event-based retention is a scheme where you first set the retention to Indefinite, which means it cannot be deleted, but the expiration date is not defined. When a business event occurs, you initiate or trigger the retention by changing the retention value from Indefinite to a specific date. Another special value, Permanent, never lets the document delete from the object store. The retention value can only be set to a greater value than the current value. Content Platform Engine does not allow you to reduce the retention period.
Retention modes
There are two retention modes that are supported by storage areas in Content Platform Engine: aligned and unaligned. Only annotations and documents can have the content in Content Manager. In aligned mode, the retention value is reflected on the content stored on the fixed content device so that users cannot directly delete the content from the fixed store and from the Content Platform Engine. In unaligned mode, all the retention features are supported by the Content Platform Engine, and the content in the fixed store is not stored under retention. In unaligned mode, users can delete the content directly on the fixed store, because the enforcement of retention happens in Content Manager, not by the fixed store.
Note: Use aligned mode when you want strict enforcement of retention on the content object in the document. Use unaligned when all you need is enforcement by Content Platform Engine.
|
4.11.2 Automatic disposition
Automatic disposition is a process that runs in the background and deletes objects automatically after their retention dates have expired. Disposition uses the sweep framework in Content Manager. Many enterprise content management applications involve searching among a large set of objects for some subset that meets specific criteria, and then performing an operation on those objects that match the criteria. Content Manager calls that type of procedure a sweep.
The sweep framework is a policy-based framework. A sweep job is like a sweep policy, except it only executes a single sweep. A sweep policy executes repeated sweeps. A disposal policy has a target class that defines the set of objects that will be examined, a filter expression that defines the subset of the objects that will be disposed of, and a schedule that determines the time of day and the days of the week it will execute.
Automatic disposition is declared by using the disposition policy. The disposition policy requires a class, filter, and schedule to run. A disposal policy can run in one of three possible modes. Normal mode where the disposition action is taken on the resulting objects. Preview mode shows the preview of the objects where this disposition policy is going to execute. Preview only counters mode shows the count of target objects for this disposition policy. The disposition policy runs as it is scheduled.
Note: Initially, set the sweep to preview or preview only counters mode to determine the objects that are going to be disposed of. Use the FileNet System Monitor to monitor the sweep activity and sweep rate.
|
Note: A policy-based sweep combines all policies, including the retention update policy and disposal policy for a certain base class, into a single sweep.
|
4.11.3 Retention update
The retention period for instances of the four classes that support retention can be altered by using either a retention update policy or a retention update job. A retention policy is used for recurring retention updates, for example, changing retention from Indefinite to five years from the creation date on instances that have been in the system for more than 90 days. A retention update job is used for one-time only operations. For example, extend the retention date by one year for instances of a certain class.
Retention update job
Retention update jobs allow the retention period to be altered on retainable objects based on the class and property state of a candidate object. On a retention update job, the new retention date can be explicitly specified. Or, it can be computed by specifying the name of a date property on the Sweep Target Class, an offset, and the time units in which the offset is expressed.
Retention reduction job
Normally, a retention update job is used to extend retention on instances. In some limited use cases, a retention reduction allow job can be used to reduce retention. An example is to handle a change to a regulation. A retention job supports two modes of operation: retention reduction allow and retention reduction prevent. The default mode of operation is retention reduction prevent mode.
Retention update policy
Retention update policy is a policy-based implementation of retention update that allows the retention period to be altered on retainable objects based on the class and property state of a candidate object. Retention update policy can be specified with the specific retention date or the new retention date can be computed from the base retention date, offset, and offset units.
Note: You can create a covering index for better performance of policy-based and job sweeps on the base table being swept. The covering index needs to include all the columns that are included in the selection list and define the object_id as a unique value.
The selection list for sweep is in the P8 database trace logs with Sweep.SQL. A sweep that includes the overflow table has to include a covering index for the overflow table.
|
4.12 P8 Content Manager searches
There are several methods of searching for content in the P8 Content Manager repository. The methods can be divided by the purpose of the search:
•User-invoked searches
•Content-based searches
•Repository maintenance searches
P8 Content Manager offers a set of tools for each purpose.
4.12.1 User-invoked searches
Users can create and invoke P8 Content Manager searches through FileNet Workplace XT and IBM Content Navigator. FileNet Workplace XT and Content Navigator are web applications. FileNet Workplace XT and Content Navigator offer two types of searches: search and stored searches.
Search
Search can be customized in FileNet Workplace XT and Content Navigator by individual users. Search appears when users log in to P8 Content Manager. Using FileNet Workplace XT or Content Navigator search is an ideal tool for user-invoked ad hoc searches for repository content. Users can search any properties and can add any system or custom property to the criteria display.
Note: When users modify their search criteria, the system remembers the settings and will display them again on the next visit to the site.
|
Stored searches
FileNet Workplace XT and Content Navigator offer a tool for designing search templates for more sophisticated content searches. Search Designer offers the following enhanced features:
•Cross-object store searches
•Search criteria expressions (AND/OR options)
•Preset criteria for filtering search results
•Searches that appear as links on a browser favorites or bookmarks menu
Use Search Designer to create stored searches and cross-repository search. Stored searches can also be accessed as web links, which makes it easy to add the stored searches as favorites to the browser.
4.12.2 Content-based search
P8 Content Manager supports content-based retrieval (CBR) for documents, annotations, folders, custom objects, and their properties. With CBR, you can search an object store for objects that contain specific words or phrases embedded in document or annotation content. With CBR, you can also search an object store for objects that contain specific words or phrases embedded in string properties of objects that have been configured for full text indexing.
The Content Platform Engine uses the IBM Content Search Services (CSS) server for indexing and searching the documents. Content-based searches can be performed from all P8 Content Platform Engine client search tools. Content Platform Engine has the capability to fail over during indexing and search, and has supporting configurations with no single point of failure.
Indexing process
The indexing process begins at the Content Platform Engine when CBR-enabled objects, such as documents, are created or updated. The Content Platform Engine stores indexing data for the CBR-enabled objects in the indexes created and managed by the CSS servers. Each index is associated with a distinct index area in the object store. During an indexing process, the system can write to multiple indexes across the index areas. When an index’s capacity is reached, the index is automatically closed and a new index is created.
The Content Platform Engine queries the items from the index request table to identify documents that are queued for indexing and then groups index requests pertaining to the same target index into an index batch. The binary documents in this batch are converted to text by the text extraction processes, then the entire batch is submitted to a CSS server for indexing.
Text extraction happens in an external process running outside the Content Platform Engine. Text extraction runs on the Content Platform Engine server by default. All the Content Platform Engine servers in a site can dispatch the requests for indexing. This allows the Content Platform Engine to share the text extraction load among all the available servers because the text extraction processes can be CPU-intensive and disk I/O-intensive. Text extraction throughput can be configured by using the Content Platform Engine administrative tools. The text extraction processes are also known as text filters. The number of text filters for Content Platform Engine can be increased or decreased based on the CPU utilization and the available system memory of the server. During text extraction, the text filter process writes the intermediate temporary data to the text filter’s temporary directory. Having this temporary directory on a fast I/O device will increase the performance of the text filters.
Note: When IBM Content Collector is used with the Content Platform Engine, text extraction for the binary documents occurs on the CSS server via the IBM Content Collector plug-in.
|
After the batch of documents is processed by the text extraction processes, the text file batch is submitted to the CSS server. After the CSS index server receives the index batch, the preprocessing functions begin.
Preprocessing functions consist of these tasks:
•Document construction
•Language identification
Ensure accurate processing and optimal performance by specifying a default language for an object store. If one language cannot be definitively identified for the content of an object store, set it to the list of languages that is contained by the object store’s documents.
•Tokenization
Tokenization creates the tokens from the extracted text. The language of the document plays a key role in identifying the tokens. After the tokens are created from the document, the index for the document is updated or created with these tokens in its respective index area.
The full-text indexes in the Content Platform Engine have a stickiness to the IBM Content search servers for the purposes of indexing. This stickiness is never changed until the lease expiry time has been reached. The lease expiry time is the time since the last server performed indexing on the full-text index has reached a time threshold or the content search server for which it is sticky became unavailable. The Content Platform Engine server changes the stickiness of an index to the least loaded index server to load balance between the available index servers.
Content Platform Engine can optionally be configured to group the CSS servers and index areas accessing the common shared index area root directory into a group called an affinity group. Each index area can be attached with an affinity group. The CSS server can optionally be assigned to a single affinity group. If all the servers in the affinity group are on the same machine accessing the local index area root directory, any failure to the machine causes all the servers to go down and the high availability for these indexes might be at risk. The load balancing for indexing happens between the servers in the affinity group. Also, the load balancing and failover occur between the CSS servers that are not associated with any affinity group. For indexing purposes, it is important to assign multiple servers for the affinity group. The Content Platform Engine servers perform the indexing load balancing based on the active index servers and the workload on each server.
Important: Content Platform Engine load balances and fails over within the affinity group and among the indexes that are not part of any affinity group. Consider the high availability feature when using affinity groups.
|
Content Platform Engine performs the index partitioning based on the partitioning property value configured on the object store. Index partitioning is the way to group the indexing information for the objects whose partitioning property value is the same. Indexing partitioning improves the performance of CBR searches when the search criteria contains the partitioning property by including only indexes that match the partitioning values specified. Without partitioning, performance might be worse because the Content Platform Engine has to search a larger number of collections. For a CBR search to work on the partitions, consider the searches that are going to be run when you are creating a partitioning property.
Important: Specify the complete list of languages used within the object store’s content. Identifying the correct language for a document improves the tokenization and search.
Use CSS in a dedicated mode, such as index or search. Avoid using it in the dual mode of index and search. By default, each CSS server is configured to use four CPUs for indexing. Consider this configuration when deciding how many CSS server instances to create on the system.
|
Search process
Users can submit CBR queries against the full-text index with its criteria: CBR-enabled properties or terms that exist in the content. Search requests are initiated through the Content Platform Engine administration tools or other client applications using the Content Engine API and include a full-text expression that is submitted to the CSS search server. The content-based search expression is highlighted in the following query:
SELECT d.This FROM Document d INNER JOIN ContentSearch c ON d.This = c.QueriedObject WHERE CONTAINS(d.*,'lion AND tiger')
The search server uses word stems, synonyms, and stop words to improve search efficiency and accuracy. It searches for and identifies the stem for all word terms included in a full-text search expression. A stop word is a word or phrase that is ignored by the search server to avoid irrelevant search results caused by common expressions. The search server uses these definitions on the index and runs the full-text search. The results are returned to the Content Platform Engine server, which then joins the results with other tables in the query and runs the query. The stop words do not affect the indexing, and they appear in the indexes created by the CSS server.
The Content Platform Engine server runs the searches concurrently. The search server configuration on the domain allows full-text indexes to be searched in parallel to satisfy user queries. With content-based searches, you can search the content based on the words and phrases, string properties of a CBR-enabled object, partitioned properties, and also by using the XML and XPath queries.
Recommendations: If query criteria includes partitionable properties, consider using index partitions to reduce the number of indexes to search by increasing the speed of the search. Index partitioning increases the number of indexes created. If there are no partitionable properties on the query criteria, we advise that you not use the partition. Search with order by rank reduces the performance, so use this search only when required. Ranking is determined by the CSS server.
Content-based searches tend to be slower when searches are run concurrently with indexing. Dedicate servers for content search because I/O and memory are the important factors in the search operation. We advise that you have up to 6 GB of memory for each CSS server. Content-based searches on property will perform better than the content and XML searches.
|
Index areas
An index area is a file system directory that contains CSS indexes. Each object store can have multiple index areas and each index area can have multiple indexes. The index contains the indexing information for the objects that belong to the same indexable base class or subclasses of the base class. Index areas can have different states: OPEN, CLOSED, STANDBY, or FULL. Unrelated to the status of the index area, all the indexes in the index areas might be searched.
Important: Index area root directories need to be unique among index areas even if the root directory path is on the local disk.
|
An affinity group is a group of CSS index servers and index areas. The servers in a group access only those index areas in the same group. The servers that are not in a group access only those index areas that are not in a group. Although the configuration of affinity groups is optional, it is a good practice to have multiple CSS servers assigned to an affinity group and to have the root directory local to the CSS servers indexing. All the servers in the affinity group must have read/write access to the root directory.
Configure the number of index areas to less than or equal to the number of CSS servers in the indexing mode. During content ingestion, you might want to have an equal number of indexing servers and index areas to keep all the indexing servers busy. You might want to consider having additional indexing servers for failover purposes.
Important: The location specified for the index areas needs to be accessible for read and write for all the CSS servers. If an index area is assigned to an affinity group, it needs to be accessible to all the servers in the affinity group.
|
4.12.3 Searches for repository maintenance
Content Platform Engine Administration tools feature a query tool that can be used for detailed report generation or for maintaining an object store repository. With the Query Builder tool, you can create a search query and apply bulk actions on the objects returned in the result set. With Query Builder, you can perform these functions:
•Find objects using property values as search criteria.
•Create, save, and run simple searches.
•Create and save search templates that will prompt for criteria when launched.
•Launch search templates that are provided with each Content Platform Engine and Content Platform Engine Administration tools installation. These templates are provided to assist with managing the size of your audit log and for managing entries in the QueueItem table.
•Create, save, and run SQL queries.
•Searches can be combined with bulk operations that include the following actions (available on the Query Builder Actions tab):
– Delete objects.
– Add objects to the export manifest.
– Undo checkout (for documents).
– Containment actions (for documents, custom objects, and folders): file in folder and unfile from folder.
– Run VBScripts or JScripts (Query Builder Script tab).
– Edit security by adding or removing users and groups.
– Lifecycle actions: set exception, clear exception, promote, demote, and reset.
In Query builder, there are two ways to construct searches: Simple View and SQL View. Select view from the toolbar to select a view style:
•Simple View offers a point-and-click interface where you can select tables, classes, and criteria from drop-down lists.
•SQL View translates anything that you create in Simple View. This is a one-way translation only; you cannot translate an SQL View into a Simple View. SQL View presents the query in an SQL text window that you can then directly edit or load any *.qry files that you have saved on the network.
Both views construct a query that can be bundled with the other Query Builder features: bulk operations, scripts, and security changes. Both views support Search Mode and Template Designer Mode.
Tip: To aid administrators using SQL View, the P8 Content Manager help files contain P8 Content Manager database view schema.
|
Search templates and template designer mode
Search templates are like simple queries except when search templates are loaded from the Content Platform Engine Administration tools Saved Searches node. Then, they prompt you for search criteria and whether you want to include any defined bulk operations.
IBM FileNet provided search templates are installed with every Content Platform Engine or Content Platform Engine Administration tools-only installation into a folder on the local server named SearchTemplates. This folder is in the FileNet installation directory. Any queries placed in this folder appear in the Content Platform Engine Administration tools Saved Searches node as long as they have .sch as a file name extension.
Querying object-valued properties
One of Content Platform Engine’s powerful search features is the ability to retrieve an object when provided another object that is a member of one of its object-valued properties. For example, you can find a document that has a particular security policy by using the identifying ID of the security policy in the search criteria.
Multiselect operations
Multiselect (or bulk) operations perform an operation on all objects returned in the search results from the query builder query. This feature is useful for object store maintenance activities.
With multiselect operations, you can perform the following actions on multiple files at the same time:
•Delete
•File to folder
•Unfile from folder
•Undo checkouts
•Change lifecycle states
•Add to security ACLs (you cannot delete existing entries)
•Run an event action script
For example, assume that several documents had been checked out by someone who left your company. Using multiselect operations, you can search for all documents that were left checked out by that person and undo these checkouts in one operation. To do this, you use the Query Builder to construct a search to find all documents currently checked out under the former employee’s system login name.
4.12.4 CBR query optimization
CBR query optimization specifies how searches that contain both a content-based retrieval (CBR) search and a relational search on a database are executed. By default, the Content Platform Engine always performs the CBR search first and the database search second. The CBR-first approach is most efficient when there are few full-text hits. Efficiency decreases, however, when there are many full-text hits, and there are fewer database hits than full-text hits.
To provide control over how combined searches are executed, the CBRQueryOptimization property can be set on the object store. As an alternative to the default CBR-first option, you can set the property to the dynamic switching option. In dynamic switching mode, the Content Platform Engine dynamically determines whether to issue the CBR search first or the database search first, optimizing performance for these types of searches.
In dynamic switching mode, the Content Platform Engine switches from CBR first to database first based on an estimated number of CBR hits. The estimate is compared to a threshold value, set in the CBRQueryDynamicThreshold property. If the number of full-text estimated hits is less than or equal to the CBRQueryDynamicThreshold value, the CBR search is executed first (CBR-first search). If the number of full-text estimated hits is larger than the CBRQueryDynamicThreshold value, the database search is executed first. The dynamic switching operation is affected by various search options, including requests for rank ordering. The CBRQueryRankOverride property on the object store determines how the server responds to CBR search requests for rank order and can affect server performance.
Best Practice: To ensure that database-first searches execute efficiently, set database criteria on indexed properties. The database-first searches require database indexes on at least one property in the WHERE clause for good performance. Otherwise, queries perform inefficiently during the database-only portion of the search.
Specify limiting conditions on database criteria to achieve relatively small hit counts. Database-first searches are most effective when the database hit count is relatively small.
|
4.13 Conclusion
In this chapter, you learned about the basic concepts and elements that comprise a repository and repository design. While designing the system, ensure that you have someone to look after the design of the repository. Use the prefix for the symbolic names in the repository that uniquely identifies your solution and does not interfere with Content Platform Engine symbolic names and naming conventions. Create a meta object store and import the metadata from the meta object store to other object stores. Ensure that you create a prototype to validate your design before implementing it. Consider the best practices and performance considerations before finalizing the design.