IBM Content Assessment scenario
Organizations dealing with an explosion of unstructured content are facing increasing issues in organizing and decommissioning content. IBM leads the market in addressing this issue with Content Assessment.
This chapter provides details about the IBM Content Assessment offering of which IBM Content Analytics is a critical part of the equation. This chapter introduces its content analytical and classification capabilities that empower an organization to decide which content to decommission while preserving and using the content-centric portions with business value. It explains how to use analytics to intelligently unlock information, discover, explore and automatically classify business content, increasing productivity and reducing the manual effort involved.
This chapter includes the following sections:
12.1 Content Assessment offering
In the normal course of business, you deal with huge amounts of unstructured information. How much of the unstructured information is relevant to the business or task at hand? The general estimate is that a large number, around 80% or more, of your data can be unnecessary. For example, it can be over-retained, irrelevant, redundant, or duplicated.
How can you know which data has business value and which data is unnecessary? What can you do to preserve only that part of the content that has business value, is risk-related, or requires life-cycle governance? What can you do to be prepared to quickly address legal compliance requests on collecting unorganized content?
In the context of Content Assessment, content decommissioning is defined as the action of deleting irrelevant business content. You assess the content of your data, decommission business irrelevant content, and collect information for further legal compliance requirements especially in the legal area of electronic discovery (eDiscovery). Consider the following approaches:
Preserve all the content. This approach is expensive and can expose your company to legal issues in the future.
Delete all the content. This approach represents an infringement to compliance regulations and can expose your company to legal issues in the future.
Smartly identify, select, and preserve only what you need. This method is the best approach, but it is complex because you are required to deal with such a large amount of content.
Content Assessment empowers content decommissioning through exploration and insight. It helps you in the process of accessing, visualizing, and analyzing content across the organization to evaluate whether it is necessary to your business.
At the time of this writing, the following products are included in the Content Assessment offering and their functionality for the Content Assessment use case:
Content Analytics V2.1
 – Crawl and analyze content from various enterprise content sources.
 – Allow exploration of content from these various sources.
 – Export meaningful subsets of documents.
IBM Classification Module V8.7
 – Integrate into Content Analytics to enhance analytics by providing categorization and clustering.
 – Integrate into Content Collector for File Systems to enhance collection-time decision making for each document.
IBM Content Collector V2.1.1
 – Gather the documents that are exported from Content Analytics into an IBM Enterprise Content Management (ECM) repository.
12.1.1 Concepts and terminology
To understand the Content Assessment offering, you must understand the following key concepts and terminology associated with the offering and technology:
Dynamic analysis A broad marketing term for the interactive exploration capabilities of Content Analytics. Companies are interested in content assessment simply to understand their information better so that they can improve their decisions about what to do with their content.
Dynamic collection The ability to identify and preserve files crawled by Content Analytics and collected using Content Collector within the context of an eDiscovery scenario. This concept addresses specific requests for a company to collect and preserve files related to a special case.
eDiscovery readiness
Within the context of content assessment, the state of being proactive about your information governance. Rather than being reactive, be proactive about the governance of your content. Use offerings and products, such as Content Assessment, Content Collector, Classification Module, and Enterprise Records, to proactively govern and control your information. In this state, you can be confident and ready to respond to legal inquiries.
Content decommissioning
Within the context of Content Assessment, the action of deleting irrelevant business content.
12.2 Overview of Content Assessment
Content Assessment assists you in discovering valuable business information that is buried beneath irrelevant, obsolete, and duplicate content. It also helps you preserve and prepare your content for efficient eDiscovery.
With Content Assessment, you can identify the necessary information and decommissions the unnecessary information by using the following steps:
1. Dynamically analyze what you have.
Make rapid decisions about business value, relevance, and disposition.
2. Decommission content that is unnecessary.
Save costs and reduce risk by eliminating obsolete, over-retained, duplicative, and irrelevant content and the infrastructure that supports it.
3. Preserve and use the content that matters.
Collect valued content to manage, trust, and govern throughout its life span in an enterprise-grade ECM platform. Uncover new business value and insight by integrating with other content analytics solutions. Collect content in response to legal requests for information.
Through text analysis, Content Assessment gives you the tools to efficiently investigate the content and make informed decisions. You can use the analysis capabilities of Content Assessment to help you achieve the following goals:
Aggregate. Gather data from multiple content sources and types by using the large variety of crawlers that Content Analytics brings into Content Assessment.
Correlate. Use the deep analysis of content that surfaces trends, relationship patterns, concepts, and anomalous associations, which is a set of unique capabilities that Content Analytics brings to Content Assessment. The correlations rely on various additional advanced text analysis provided by the Unstructured Information Management Architecture (UIMA) pipeline. They include text classification and information extraction provided by Classification Module.
Visualize. Content Analytics provides easy-to-use, feature-rich views to quickly dissect large corpus of content and derive insight from your content
Explore. Content Analytics interactively investigates content with faceted navigation and drills down to surface new insights and understanding.
This section explains these Content Assessment capabilities through the following scenarios:
Dynamically analyzing and collecting your content
12.2.1 Content decommissioning scenario
This section describes a scenario in which you must use Content Assessment to decommission your content. The goal of the scenario is to reduce costs and risk by retaining only the necessary content.
To illustrate the power and benefits of Content Assessment, this scenario highlights a big company, Fictitious Software Company A (referred to as the Company), that stores data and content for many years.
The Company is a software company that produces thousands of different products, with a significant number of releases. Over time, the Company’s product releases change, and some of them are discontinued. In many cases, the product names are changing over time. The Company has stored several terabytes of document content over the last 20 years.
To be compliant with new regulation rules, the Company is required to store valuable business data for a certain period depending on the nature of the documents. The Company spends a lot of time and money in locating documents that the lawyers request to support a pending lawsuit.
The Company is aware that it has a huge amount of content irrelevant for the current business and for the compliance regulations. The Company wants to decommission the content in a responsible way, for example, so that the important business-related information is not disposed.
The Company decides to prepare itself more proactively for other potential lawsuits and to reduce the effort invested today in searching and making available the necessary documents to its lawyers.
The Company decides to acquire an ECM compliance suite of products to help them manage their large amount of content and reduce the risks that they face today with their unorganized content. Their problem then becomes how to decide what and how they must organize their data to be compliant and easier to manage.
To solve their problem, the Company uses the Content Assessment offering as follows:
1. Preclean the data set to remove documents that do not contain analyzable text. This pre-analysis can be performed through different methods:
 – Filter the files by their extensions and remove the file types that cannot be analyzed such as images. The remaining subset of documents has the potential for being text-analyzable.
 – Set up a search collection in Content Analytics to identify the nontext data that cannot be analyzed by using metadata information such as the file type, file size, last access date, version, and directory. After the nontext data is identified, remove these files from the data set.
2. With some of the unnecessary documents removed, create a text analytics collection to further decommission documents. When creating the text analytics collection, enable the following functionality:
a. Select the Add default facets for content assessment check box to automatically add the following default facets that are displayed in the text miner application:
 • File Extension
 • File Size
 • Last Modified Date
 
Default content assessment facets: After the collection is created, the default content assessment facets cannot be enabled.
b. Under Advanced options (Figure 12-1 on page 489), complete the following selections for the collection:
i. In the Terms of interest field, select Enable automatic identification of terms of interest. With this function, you can identify relationships between nouns and nearby verbs and adverbs that exist in your content.
ii. In the Duplicate document detection field, select Enable duplicate document detection. This function identifies documents that might contain the same or nearly the same content. The duplicated documents are good candidates for being decommissioned and removed from your data set.
For details about duplicate document detection, go to the IBM Content Analytics Information Center at the following address, and search on duplicate document detection:
Figure 12-1 Enabling the duplicate document detection and terms of interest functions
3. Create the crawler to your content source. By using the crawler capabilities of Content Analytics, the Company accesses various content sources, making the content available for inspection. Alternatively, instead of using crawlers, you can use the import from comma-separated value (CSV) file functionality to help import your filtered data into Content Analytics without having access to a repository. For more information, see 10.1, “Importing CSV files” on page 388.
4. Set up the clustering functionality for the collection. The clustering functionality is used as a method to gain quick insight into the content. For details about clustering, see 8.3, “Document clustering” on page 343.
5. Open the text miner application to gain a deeper view of your content:
 – Use the Facets, Deviations, and Trends views as explained in Chapter 6, “Text miner application: Views” on page 217, to better understand your content.
 – Enable the Named Entity Recognition annotator, and discover location names, persons names, and company names that can be found in the Company’s content.
 – Define custom facets and assign them values by using words lists.
 – Extract information by using patterns, such as social security numbers.
In this scenario, the Company uses all the power of text analytics and the integration with Classification Module that Content Analytics offers. As a result, the Company gets a series of facets, such as the following examples, that reveal information about their content:
 – Intellectual Property
 – Legal Concepts
 – Sentiment Analysis
 – Products Analysis
 – Identify Personal and Corporate information (such as credit card numbers and various dates)
6. Inspect the content, and view it from different angles by using the text miner application. When the Company decides which data to preserve, the Company exports it by using the export capabilities to XML files of Content Analytics with the binary original content.
7. Reiterate this process from time to time in order to collect data to be further stored in an ECM repository.
In this scenario, the Company has two business requirements:
 – Decommission content. The Company will invest a one-time effort to discover business-important data and preserve the data into the ECM repository. The Company will dispose of data from the old file shares or other old and obsolete data.
 – Periodically collect data and make it ready for legal eDiscovery if needed in the future. The Company needs to declare records according to the Company’s file plan that is defined during the Content Assessment process.
In Content Assessment, all activities can be inter-related, for example, content decommissioning, on-demand dynamic analysis for legal cases, periodic dynamic collection of content for eDiscovery readiness, and records declaration.
12.2.2 Dynamically analyzing and collecting your content
This section describes two scenarios in which you must use Content Assessment to dynamically analyze and collect your content.
In the first scenario, the dynamic analysis scenario, you must react rapidly to an eDiscovery request for legal compliance. You invest a lot of effort and resources in finding and retrieving the necessary documents. With proper eDiscovery enablement, you can reduce this effort significantly.
You start a content analysis task based on legal demand and when needed. The goal is to reduce eDiscovery costs and risk by performing eDiscovery collection across a broad range of enterprise sources as needed (in response to legal demands).
In the second scenario, the dynamic collection scenario, Content Analytics is used to collect content that is not currently being managed in an ECM repository. Content Analytics is also used to search and analyze the content according to an eDiscovery task order, or according to a broad search criteria, and export the content for ingestion. Content Collector is used to ingest the content and make it available for IBM eDiscovery tools, such as IBM eDiscovery Manager and IBM eDiscovery Analyzer.
The clustering and classification techniques of Classification Module are used in a support role to supplement the capabilities of both Content Analytics and Content Collector. The goal is to quickly investigate your data and address the legal requirement by providing the required data.
12.3 Content Assessment workflow
A major effort in planning for compliance readiness with Content Assessment is identifying what is important and has value for your business and disregarding what is unnecessary. Content Assessment helps you to explore your content and assists you in making informed decisions.
Figure 12-2 on page 492 illustrates a typical Content Assessment content decommissioning workflow:
1. Crawl, analyze, and index the content using Content Analytics.
2. Organize the content into meaningful hierarchies by using Classification Module, which can be deployed within Content Analytics.
3. Search and mine content by using the text miner application.
4. Export a resulting subset of documents including the associated metadata to a file share.
5. Use Content Collector to collect this exported content including the associated metadata into the ECM repository.
6. Dispose of the original content.
Figure 12-2 Typical content decommissioning workflow
12.3.1 Decommissioning content
This section describes the procedures for content decommissioning by using Content Assessment. Content Analytics is used to gather and analyze content and to export the content with real business value. Content Collector is used to ingest the content that is exported by Content Analytics into an ECM repository. Classification Module is used to categorize the content for Content Analytics to provide additional facets for analytics. Classification Module is also used to categorize content for Content Collector to target the content to appropriate locations within the ECM repository or to assign additional metadata to the content. Next the unnecessary content is decommissioned.
Crawling, parsing, and indexing content with Content Assessment
To decommission content with Content Assessment, you must first crawl, parse, and index the content by using Content Analytics:
1. Identify the content sources that you want to explore. In this scenario for the Company, all documents are stored in the file system.
2. With the Content Analytics component of Content Assessment, crawl the content sources. Take a small sample for the content in the initial exploration step:
a. Create a text analytics collection. In the collection name field, type Fictitious Software Company A.
b. Configure the crawler. In this scenario, we configure a “Windows file system” crawler to crawl a sample of the Company’s document stores on different files shares. Point the file crawler to the root location because Content Analytics will recursively crawl the subfolders. You can control the depth of the recursion. Figure 12-3 shows the Crawler Configuration Summary window.
Figure 12-3 Content Analytics Crawling Configuration Summary window
c. Enable the terms of interest feature when creating the text analytics collection from the text analyzable content.
With the terms of interest functionality, you can identify relationships between nouns and nearby verbs and adverbs in the text. To see how to enable the terms of interest function, see Figure 12-1 on page 489. For details about terms of interest, see 8.1, “The power of dictionary-driven analytics” on page 322, and 8.2, “Terms of interest” on page 326. Also, go to the IBM Content Analytics Information Center at the following address, and search on terms of interest:
 
Export capability: If you plan to export, enable the export capability, and specify the full path for the destination folder. For more information about export, see Chapter 10, “Importing CSV files, exporting data, and performing deep inspection” on page 387.
3. Configure the Parse and Index options of Content Analytics.
Before starting the crawler, set up Content Analytics to add some of the file system metadata as facets so that they can be used in the analysis tools. For Content Assessment, use the Named Entity Recognition annotator in the parsing pipeline.
To configure the Parse and Index options, follow these steps:
a. Go to the Parse and Index tab of the Administration application.
b. Turn on the Named Entity Recognition annotator if you think your data contains proper names, company names, or places. You can first inspect a small amount of your data to help you understand whether running this particular analysis can be beneficial. See 11.1.4, “Named Entity Recognition annotator” on page 452.
c. Configure the Search fields. Go to the Field Definition page, and configure the existing fields or create new ones.
d. Configure facets, and map them with the corresponding search fields. In this scenario, we add four facets and map each of them to a search field. In the Add a facet section (Figure 12-4), complete these steps:
i. For Facet path, enter DocInfo.
ii. For Facet name, enter Document Characteristics.
iii. Select Visible in the text miner.
iv. Click Add.
Figure 12-4 Adding the Document Characteristics facet
v. Add four facets under the Document Characteristics facet: Size, Extension, Filename, and Directory (Figure 12-5). For information about how to add facets, see “Creating facets and mapping search fields to facets” on page 106.
Figure 12-5 Facet tree with four document characteristics
e. Map the facets and the search fields as follows:
 • Size (facet) to filesize (search field)
 • Extension (facet) to extension (search field)
 • Filename (facet) to filename (search field)
 • Directory (facet) to directory (search field)
4. Run the crawler, parser, and indexer.
Collecting content is the first step for content decommissioning and dynamic collection. For dynamic collection, you might want to set additional filters in the crawlers to look for specific types of content. Usually, the gathering of content for both use cases is the same.
5. Attempt to explore the data by using the text miner application of Content Analytics. See Chapter 5, “Text miner application: Basic features” on page 143, for information about application usage of text miner application.
Organizing content with Classification Module
You can use the Classification Module annotator to supplement native capabilities of Content Analytics by providing additional metadata for analytics and adding new facets. For this scenario, we use the following Classification Module knowledge bases and decision plans to initiate Content Assessment discovery:
Content Assessment decision plan (Figure 12-6)
Figure 12-6 Decision plan in Classification Module
Knowledge bases:
 – CodeKB
 – EnterpriseKB
 – Products
 – Legal Concepts
 – SentimentKB
Figure 12-7 shows the knowledge bases in Classification Module.
Figure 12-7 Knowledge bases in Classification Module
 
Goal: From the content, discover facts that must be reviewed about the existing products. For example, analyze the documents to discover legal concepts that are relevant for the Company’s management. They attempt to identify personal and corporate information from within unstructured content such as credit card numbers and various dates.
To use the Classification Module annotator, follow these steps:
1. Verify that the Classification Module services are running. Make the Classification Module knowledge bases and decision plans available.
2. Enable the Classification Module integration. See Chapter 9, “Content analysis with IBM Classification Module” on page 357.
a. Configure the connection of Content Analytics to Classification Module in the Content Analytics administration console.
vi. Click the Parse and Index tab.
vii. Click the Edit mode icon.
viii. Click Configure Classification Module.
ix. For the URL of the Classification Module server field, type the web service URL of your Classification Module, and click Next.
b. For Decision plan, select Content Assessment.
c. For Classification Module fields, select the following fields:
 • ContentCategory
 • IsCandidateForDeletion
Each Classification Module field that you select on this page is mapped to a search field with the same name.
d. Map the Content Analytics facets to the new search fields.
Add new facets to present the text analytics information. For this example, we add patterns, such as Social Security Number (SSN).
e. Add a facet under the Document Characteristics facet called IsDeleteCandidate. The label is “Deletion Candidate.”
When a user selects the Deletion Candidate facet in the text miner application, the system selects documents that have been identified as deletion candidates by Classification Module.
f. Under the Edit a Facet section, map the remaining fields that you added earlier. When you are finished, click OK to go back to the Parse and Index configuration tab.
3. Enable Classification Module in the parsing pipeline for Content Analytics. Click My collection to enable the Classification Module annotator. Click OK.
4. Index your data. Deploy the resources and reindex for the changes to the annotators and dictionaries to take effect.
Searching and mining content using the text miner application
Explore the content by using the text miner application of Content Analytics:
1. Start the Search Engine.
2. Using the text miner application, explore the content by running queries, and navigate through the facets. You can view the data from different aspects, such as entities extracted from the content (for example, people’s names, grammatical phrases, or regular expressions such as SSNs).
Go to the facet navigation pane to select which facets (or entities) you want to analyze. These facets can be extracted directly from the content itself or mapped to categories defined by IBM Classification Module, as described in Chapter 11, “Configuring annotators” on page 449.
When exploring the data in this example, you see the following facets among others:
 – SSN
 – Phone Numbers
 – Credit Card Numbers
For example, in the Facet Navigation pane (Figure 12-8), expand the Document Characteristics facet and expand the document content.
Figure 12-8 Expanding the Document Characteristics facet
You see different credit cards values. This information is extracted by the Classification Module decision plan. You can review the respective documents. These documents have business value for the Company in this example. Therefore, we must preserve them.
3. Export both the binary files and the XML documents by using the Export functionality (Figure 12-9), which is explained in Chapter 10, “Importing CSV files, exporting data, and performing deep inspection” on page 387.
Figure 12-9 Options to Export Searched Documents panel
Requirements for exporting data to a network file share: To export data to a network file share that is accessible by Content Collector (read/write permission), you must configure Content Analytics. You must have Content Collector and FileNet P8 machines in an Active directory domain, and the Content Collector service must be running as a domain account.
4. Query your data set, and navigate through the facets to better understand your content.
5. Decide if you need to add more informational entities. Define more facets, and perform the following steps as necessary:
a. Enhance the dictionary for the existing facets.
b. Add more facets and associate them with the new dictionaries.
c. Add more facets and define the new pattern matching rules. See 11.1.5, “Dictionary Lookup and Pattern Matcher annotators” on page 452.
d. Enhance the pattern matching rules.
6. Use Classification Module to apply text classification and advanced metadata extraction.
7. Use the text miner application to identify related concepts. Refine the rules as needed.
8. Inspect a data sample with the Classification Module tools, and build a knowledge base and decision plan that incorporate all the feedback gathered by the subject matter expert analyst.
9. Review the results. Reiterate steps as necessary.
10. When you reveal content with interesting business value that you decide to preserve, export it.
Decommissioning the content
In this example, the people who manage the content have decided that one of the web content servers is full of unnecessary or redundant content. The only content that they want to keep are pages that deal with the petrochemical industry. The Company wants to copy this content to a newer server, and then the old content server can be taken out of service or decommissioned.
For this scenario, perform the following steps for content decommissioning:
1. Search the entire crawled content set. Use the IsDelete field to classify any document that is no longer necessary.
The Classification Module was trained with a set of documents that are related to the petrochemical industry. For those documents that are not related to the petrochemical industry, the Classification Module sets the isDelete field of the documents to Yes. The isDelete field from the Classification Module is then mapped to the Deletion Candidate facet in Content Analytics.
By checking on the documents under the Deletion Candidate in the Content Analytics view (Figure 12-10 on page 501), you see a list of documents that are candidates for deletion.
Figure 12-10 The Deletion Candidates facet
2. In the Content Analytics view (Figure 12-11), sort the content by Deletion Candidate criteria. This particular facet only has one category, Yes, meaning that the documents are related to the Petrochemical industry.
Figure 12-11 Visualizing the results for isDelete documents
3. As shown in Figure 12-12, expand the Category facet, and click Legal. Then click the Facets tab.
Figure 12-12 Viewing documents related to the Legal facet
Notice the four different legal categories. Classification Module has been trained to identify content that belongs to these categories. Therefore, you see a breakdown of all documents relating to the Petrochemical industry that are considered different types of legal documents.
You can use the Export icon (shown in Figure 12-12) to export this information to an external XML file. An application can then read this XML file and move the content associated with it to a different server, making it possible to decommission the existing server. In this task, do not click Export.
12.3.2 Performing dynamic analysis
Dynamic analysis is a major use case of Content Assessment. Dynamic analysis refers to locating content across various repositories for eDiscovery.
Current eDiscovery tools require that content to be analyzed must be stored in a content repository such as FileNet P8 or IBM Content Manager. This type of storage can be feasible for email messages. But, what if your content is spread across multiple repositories such as Sharepoint, web servers, Lotus Domino, or Documentum? What if the total volume of content is so large that you do not want to replicate it all in your FileNet P8 or IBM Content Manager repository?
 
The Company scenario: Assume that the Company is involved in litigation over the leaking of intellectual property information that occurred in its headquarters in early 2008. The Company’s lawyers have asked you to find all content that might be applicable and to store it in a FileNet P8 repository where they can use eDiscovery Analyzer for further discovery.
To do dynamic collection, follow these steps:
1. Clear any previous queries, and start the text miner application.
2. Click the Facets tab (Figure 12-13). In the Facet Navigation pane, expand the Category facet, and select Legal documents.
Figure 12-13 Reviewing documents categorized under the Legal facet
3. As shown in Figure 12-14, select the content related to Intellectual Property, and then click the Add to search with Boolean ADD icon.
Figure 12-14 Choosing documents with the Intellectual Property keywords
4. Find the geographies that might be mentioned in the content for Intellectual Property. In the Facet Navigation pane (Figure 12-15), expand the Named entity facet, and select the Location facet.
Figure 12-15 Viewing documents within the Location facet
5. Focus on the documents related to the Company’s main office. Select NY and click the Add to search with Boolean ADD icon.
6. Click the Time Series tab (Figure 12-16) to see how these documents are distributed by time. Change Time scale from Year to Month.
Figure 12-16 Viewing documents for NY over time
7. Because you only want the content from the first quarter of 2008, draw a box around the bars for the first three months, as shown in Figure 12-17. The bars turn a darker color when they are selected. Click the Add to search with Boolean ADD icon.
Figure 12-17 Focusing on documents from first quarter of 2008
Now you have the subset of the content that you want to store in the trusted repository for further analysis by the eDiscovery tools.
8. Click the Export icon.
9. From the drop-down list, select Crawled content and parsed content with analysis results. We want the crawled content so that we can collect and store it to FileNet P8. We also want the analysis results in case that information can be used to determine information such as record plans or storage locations. Leave the Schedulable radio button set to No.
The Export utility is configured to store the exported content and metadata in the C:Export directory.
10. Navigate to the C:Export directory (Figure 12-18) using Windows Explorer. There are two subdirectories. A new directory with the current date and time has been created to store the exported metadata.
Navigate the directory until you see a list of XML files.
Figure 12-18 Exporting a directory for content and metadata
Figure 12-19 shows an example XML file and the type of data stored within it. The path to the original file tag called Id is where Content Collector collects the original file from.
Figure 12-19 Content Analytics exported XML file
12.3.3 Preserving and using business data
This section explains how to preserve and use valuable business data. After inspecting the content and identifying the relevant business data, you need to preserve it and make it ready for further exploration and usage. You also need to preserve your valuable business data.
To assess the content and preserve the important data, follow these steps:
1. Search and mine the content:
a. Crawl data from various data sources, which is especially important for the dynamic analysis scenario, by using the crawling function in Content Analytics.
b. Narrow down specific subsets of documents that you want to preserve by using different search queries in Content Analytics.
c. Identify the specific area in the data that you want to preserve and store for further eDiscovery operations by using the text miner application in Content Analytics.
2. Preserve the content by using the Export function in Content Analytics. Choose one of the following export options depending on the content inspection action you took previously:
 – Search result documents to export results of search queries that led to the content that you want to preserve and further store it in an ECM repository.
 – Analyzed documents to export documents along with their metadata, textual data, and any annotations added during the document processing pipeline. Examples include dictionary-based facets, classification-based facets, and pattern-based facets. The metadata and annotations can be further used when storing the documents in the ECM repository to populate the required field or to take advanced action, such as declaring them as records.
 – Crawled documents, which are documents that have been crawled but not parsed or analyzed. You can export metadata, binary content, or both. The binary document is the format that will be further stored into the ECM repository. The metadata, along with the analyzed data, is used to populate specific fields in the ECM repository.
You have now used Content Analytics to access your content by using crawlers, viewed content, and analyzed it by using text miner and Classification Module. After deciding what data is important to preserve, you export the relevant data by using the export capabilities of Content Analytics.
During the process of refinement of the text miner resources and Classification Module resources, you can gradually build new dictionaries, rules, and new decision plans and knowledge bases with the Classification Module. When moving to the next step of using your data, you use Content Collector to collect the content and store it under FileNet P8 or IBM Content Manager CM8.
The content decommissioning scenario is primarily a one-time activity. However, the dynamic collection might need to be continuously run. In this scenario, the user has a specific request to collect and preserve files related to a specific case. In this case, after the initial dynamic analysis, the user sets up a Content Collector Task Route and invokes the Classification Module to collect data and store it into the ECM repository while analyzing it on the way. If you refined your Classification Module decision plans and knowledge bases, make them available for the data collection.
Configuring Content Collector to collect exported content
To configure Content Collector to collect exported content, complete the following tasks as explained in the sections that follow:
Preparing resources
To prepare for the resources, follow these steps:
1. Verify that the Content Collector services are up and running.
2. Use Content Collector for File Systems to upload the content into the FileNet P8 or Content Collector repositories. Content Collector for File Systems, with Classification Module, uploads the documents that are exported by using Content Analytics and organizes them based on the recommendations of Classification Module to the ECM content repository.
3. Use the Classification Module resources, such as decision plans and knowledge bases that you built and refined previously. See Chapter 9, “Content analysis with IBM Classification Module” on page 357, for information. The knowledge base or decision plan can be derived from the initial resources used during the information gathered by using the data inspection with Content Analytics. Alternatively, you can build new Classification Module resources as a result of the content understanding.
Creating a task route in Content Collector for File Systems
To create a task route in Content Collector for File Systems, follow these steps:
1. Start the Content Collector Configuration Manager (Figure 12-20).
Figure 12-20 Content Collector Configuration Manager
2. Define the file system metadata that is used to help collect content. Use the elements in the XML files (exported from Content Analytics) to direct Content Collector where to find the files it needs to collect:
a. Go to the Navigation panel (lower-left corner of the Content Collector Administration GUI) and select Metadata and Lists (Figure 12-21).
Figure 12-21 Selecting Metadata and Lists
b. In the left panel, select File System Metadata, and click Add to add the new file system metadata (Figure 12-22).
Figure 12-22 Adding the new file system metadata
c. Type a name for the new file system metadata, such as Customer Metadata List for Content Assessment.
d. Change Format Type to XML.
e. Select the XML elements that you want to use. Click the Wizard icon to start the XML file wizard.
f. Follow the Content Collector V2.1.1 documentation for defining the XML fields mapping. Figure 12-23 shows an example of such a mapping. Save your metadata mapping.
Figure 12-23 XML field mapping
3. Create the task route. You can create a task route from scratch or use one of the existing sample task routes as a starting point.
For this example, we follow these steps:
a. From the New Task Route window, select From Task Route file.
b. In the Task Route Name field, type Task Route to Collect Data.
c. From the Template list, select FS to P8 Archiving (Associate Metadata) - Complete as a starting point.
d. From the Detected Dependencies section, select the metadata mapping that you created for the File System Metadata mapping. In this case, for the FileNet P8 4.x connection, we select P8 4.x ICC Connection. If you do not have a valid FileNet P8 4.x connection to select, follow the Content Collector documentation instructions to fix this problem.
Figure 12-24 shows the configuration. Notice that the letters in the figure correspond to the substeps for step 3.
Figure 12-24 Defining a new task route
4. Modify the task route:
a. In the Designer pane, click the FSC Collector icon. This icon represents the part of the file system from which you will collect content.
b. Select the Collection Sources tab, and click Add.
c. Browse to the c:Export directory, and select Monitor sub-folders. For the Folder depth field, type 4, indicating for the system to monitor at least four levels deep for the subfolders (Figure 12-25). Click OK.
Figure 12-25 Adding a collection source folder and specifying subfolder depth
5. Define the tasks in the Task Route:
a. Add a decision point.
b. Define the FSC Associate Metadata task.
c. Define a decision point, and connect the FSC Associate Metadata task to the P8 4.x Create Document task (Figure 12-26).
Figure 12-26 The FSC Associate Metadata task
d. Click the part of the link that connects the decision point to the P8 4.x Create the Document task. In the Rule definition panel (Figure 12-27), in the Name field, type the name of the rule Is not an XML file. Under Evaluation Criteria, select Configure Rule, and then click Add.
Figure 12-27 Defining the ‘Is not an XML file’ rule
e. In the Edit Condition Clause window (Figure 12-28), complete these steps:
i. In the Metadata Type field, select Custom Metadata List for Content Assessment.
ii. For the Property field, select extension.
iii. For the Operator field, select Not equal.
iv. For the Value field, enter XML.
v. Click OK.
Figure 12-28 Editing the conditional clause for the rule
f. Add the FSC Post Processing task:
i. Select FSC Post Processing from the list of possible FSC Collector tasks in the toolbar.
ii. Drag the task to the far left side of the Designer window.
iii. Right-click the decision point, and select Add Rule. A second link is displayed, called “2. Rule,” that comes from the decision point.
iv. Click the 2. Rule link, and drag it down so that it connects to the FSC Post Processing task that you just dragged to the Designer.
Figure 12-29 shows the changes.
Figure 12-29 Adding the FSC Post Processing task
g. Click the 2. Rule link (shown in Figure 12-29) that connects the decision point to the FSC Post Processing task. From the Rule pane on the right, change Rule name to Is an XML file. Select Configure rule, and click Add.
h. In the Edit Conditional Clause window (Figure 12-30), complete these steps:
i. In the Metadata field, select Custom Metadata List for Content Assessment.
ii. For the Property field, select extension.
iii. For the operator, select Equal.
iv. For the Value field, select Literal, and enter XML.
v. Click OK.
Figure 12-30 Adding the ‘Is XML’ rule
i. Select the FSC Post Processing task that is part of the “Is an XML file” rule, and edit the task:
i. Select Delete File from the Post Processing Options on the right side.
ii. Clear the Replace the file with a shortcut check box.
j. Select the P8 4.x Create Document task that is part of the “Is not an XML file” rule. Then edit the task as shown in the P8 4.x Create Document window (Figure 12-31):
i. Select the Set content retrieval name check box option.
ii. In the Retrieval name metadata mapping field, select FSC Metadata from the left box, and select Metadata file path from the right box.
Figure 12-31 Content retrieval name metadata mapping
k. Store the collected content in a FileNet P8 directory by updating the FileNet P8 4.x File Document in Folder task:
i. Select the FileNet P8 4.x File Document in Folder task. Select the Create folder if does not exists option, and click Add (Figure 12-32).
Figure 12-32 Selecting the folder path to store content
ii. Select Demo Folder, and click OK.
iii. Under File in Folder Options (Figure 12-33), select the sample folder options definition, and click Remove to remove it.
Figure 12-33 Selecting folders to store file content
l. Configure the audit setting for the FSC Post Processing task:
i. Click the Audit Log icon that is connected to the task.
ii. For auditing, select Customer Metadata Lists for Content Assessment Lab, FSC Metadata, and P8 4.x Create Document.
iii. Right-click the FSC Post Processing task that is part of the “Is an XML File” rule (on the left), and select Add Link Out.
iv. Click the newly added link, and connect it to the Audit Log task that is previously defined (Figure 12-34).
Figure 12-34 Linking the Audit Log task to the task route
6. Optional: To complete the Task Route, use Classification Module for additional classification.
Figure 12-35 shows an example of a completed Task Route that you can use for Content Assessment, especially for the dynamic analysis and dynamic collection scenario.
Figure 12-35 Completed task route
When invoking Classification Module, you must map the output fields from the decision plan to the Content Collector metadata fields, as shown in Figure 12-36.
Figure 12-36 Mapping decision plan fields with Content Collector fields
The next step is to run Content Collector to collect the data in FileNet P8.
Collecting the exported Content Analytics data
The final step is to run the appropriate Content Collector services so that the task route that you just defined collects and stores the data exported by Content Analytics.
Open Services from the Windows toolbar, and start the IBM ICC Task Routing Engine service (Figure 12-37).
Figure 12-37 Running the Content Collector service
You can view the collected data in FileNet P8 that connects to the repository by using FileNet Workplace XT.
12.4 Records management and email archiving
An important part of a compliance ECM project is defining the Records Management framework, which includes the records classes and the file plan that governs them. In general, this project requires a good knowledge of regulatory rules, your company policies, and your content. You use IBM Content Collector integrated with Classification Module to inject the valuable business documents selected by the content assessment process into an ECM system and to declare records where necessary.
As part of the Content Assessment “preservation and organization” step, you define your file plan and the specific rules that govern the Records Declaration process of the content items. During the analysis step of the assessment process, you identify specific content groups (subsets) that you plan to organize in the ECM repository. At this stage, you use the two components of Content Assessment to proceed with the Records Management, Content Collector and Classification Module, as explained in the following steps:
1. Build a task route in Content Collector to implement the rules that you have defined for your Records Management.
2. Define a Classification Module decision plan with logic that relies on the content. As part of the Content Assessment phase, you have already identified and exported a sample group of documents to illustrate the different categories of content:
a. Export a sample set of documents for each category to a separate subfolder. Use this folder structure to build a Classification Module knowledge base that is to be used as part of the decision process.
b. Build a Classification Module decision plan to incorporate the business logic that you decided to use to declare records. You define a series of meaningful fields to make the records declaration decision.
3. By using the Classification Module task, map the Classification Module result fields that contain the content-based decisions to the Content Collector metadata fields.
4. By using the Records Declaration task, execute the declaration of records at the document injection time.
 
Email archiving: For email archiving, you can use a similar workflow, assuming that you have Content Collector for Email. The Content Assessment focuses on crawling the email messages and analyzing them by using Content Analytics and Classification Module. After the assessment process is finalized, you can build the Classification Module decision plan, knowledge base, and the Content Collector task route in Content Collector for Email.
12.5 Preferred practices
Content Assessment is a complex task. Businesses are faced with a constant explosion of unstructured content. Companies are frequently running into barriers because of significant imposed user burdens. The text analysis function of Content Assessment gives you the tools to efficiently investigate the content and make informed decisions. The text miner application of Content Analytics (bundled with Content Assessment) plays a major role in this activity.
Consider the following tips to help you cope with this complex task:
Use the Content Analytics component of Content Assessment to crawl the content sources. Take a small (if possible random) sample set of content in the initial exploration step so that you can take a quick view of the nature of your data. Follow the iterative process provided in 12.3, “Content Assessment workflow” on page 491.
Inspect the data and try to determine the easy targets:
 – Understand the distribution of the content based on the file extension.
 – See if the file names have a pattern that can be used to further organize the content.
 – Activate Named Entity Recognition, and explore the facets it populates to understand if interesting business information can be inferred.
 – Activate terms of interest to identify relationships between nouns and nearby verbs and adverbs that exist in the content.
 – Activate duplicate document detection to identify documents that might contain the same or nearly the same content. The documents that have a duplicate document are good candidates for decommissioning and removing from your data set.
 – Engage Document Clustering to easily obtain immediate insight into your content.
For initial inspection, use Content Analytics and the Classification Module annotator with the default knowledge bases and decision plans that are deployed, such as “Personal versus Business Content” or “Content Assessment.”
Inspect the exported sample data with the additional text miner tools provided by Classification Module. For example, use Taxonomy Proposer for Clustering with Classification Workbench to build your decision plan and train your knowledge bases according to the specific data discovered.
Export a part of the data, and explore it in the Classification Module Workbench. As a result, build the Classification Module knowledge bases and decision plans by using Classification Module Workbench.
In the Content Collector and Classification Module integration, use the new Classification Module knowledge base and decision plan that is tailored toward your compliance project business needs. The knowledge base and decision plan are for content that will be organized, enhanced with metadata, managed under Records Management according to the compliance policies of your company, and ready for eDiscovery.
12.6 Summary
Information chaos creates many challenges from expensive law suits to serious difficulties in managing your business efficiently. Content Assessment assists you in discovering valuable business information that is buried beneath irrelevant, obsolete, and duplicate content. It also helps you preserve and prepare your content for efficient eDiscovery.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.218.5.12