Chapter 6. Text miner application: Views

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Text miner application: Views

Chapter 5, “Text miner application: Basic features” on page 143, provides information about the basic features of the text mining application, with a specific focus on the search and discovery features. This chapter focuses on the text miner application views, their features, and their functions.

Specifically, this chapter includes the following sections:

•Views

•Facets view

•Trends view

If you are already familiar with the user interface of the text miner application, including its views and search and discovery features, proceed to Chapter 7, “Performing content analysis” on page 279.

6.1 Views

The text miner application provides the following views to assist in text mining. These views (shown in Figure 6-1) are generated dynamically based on your query and facet selections.

Documents view Shows a list of documents that match your query.

Facets view Shows a list of keywords for a selected facet.

Time Series view Shows the frequency change over time.

Trends view Shows sharp and unexpected increases in frequency over time.

Deviations view Shows the deviation of keywords for a given time period.

Facet Pairs view Shows the correlation of keywords from two selected facets.

Connections view Shows the correlation of keywords from two selected facets in a graphical way.

Dashboard view Shows a configured dashboard layout with one or more graphs or tables in a single view.

Figure 6-1 All views available in the text miner application

For information about the general user interface of the text miner application, see 5.1, “Overview of the text miner application” on page 144. For information about the search and discovery features of the text miner application, see 5.2, “Search and discovery features” on page 152.

6.2 Documents view

The Documents view shows a list of documents that match your query or facet selections. You use this view when you want to see the contents of an individual document. By default, ten results are displayed per page in the Documents view. You can click the page buttons to navigate between results pages or to move to a specific page.

Each entry in the document list contains the following information:

•A dynamic summary of the document based on your query terms

•The source and date of the document

•A thumbnail of the document (if configured)

•A document link displayed as the title, which, when clicked, shows the original document from its crawled source

Figure 6-2 shows the Documents view with the sample documents. The default search condition is *:*.

Figure 6-2 Documents view showing the results based on the default search condition

6.2.1 Understanding the Documents view

In the Documents view, you can see the following information:

•The total number of documents that match your query. In the example shown in Figure 6-2 on page 220, 852 documents are returned.

•The query that is used to produce the result. The query is hidden when the search box is hidden.

•Selected fields such as Source, Date, Title, and Thumbnails. The fields displayed as columns are based on your configuration settings on the Results Columns tab under Preferences.

•The detail of each document when you click the Show detailed properties icon.

•The document content when you click the link created in the Title field.

6.2.2 Viewing the document contents and facets

When you click the source icon, the Document Analysis window opens on top of the Documents view. This window contains details about your document. It shows the individual field values for the document and all annotations made to the document during text analysis. In the Document Analysis window (Figure 6-3), the Analytics Facet is listed in the left pane, and the Metadata Facet is shown in the right pane.

Figure 6-3 Document Analysis window

When you select the Analytics Facet, the corresponding keywords are highlighted in green in the structured and unstructured fields of your document where the values occur. For example, we select the Product analytics facet called “milk chocolate”, as shown in Figure 6-3 on page 221. As a result, the keyword “milk chocolate” is highlighted in green within the Metadata Facet pane. This function helps you to understand the data that determined the facet.

6.2.3 When to use the Documents view

The Documents view is useful when you want to see the details of the individual document after investigation in other views, such as the Trends view, or after searching the collection. By using the Preferences configuration, you can control how the search result is displayed in the Documents view.

6.3 Facets view

The Facets view shows a list of keywords that are displayed for a selected facet. The frequency count and correlation value accompany each keyword. This view is useful for seeing the keywords that make up a given facet in your data.

In the Facets view, frequency and correlation are shown as a bar chart. Each column is described as follows:

Keyword The entity that is associated with the selected facet. It can be a word, a pattern of text, or a fielded value.

Frequency Indicates the number of documents found that contain the given keyword.

Correlation Indicates how the keyword is interrelated to the documents that are matched by your query.

You can sort the output by frequency or correlation.

Default sort order in the Facets view: By default, the documents in the Facets view are sorted by frequency in descending order. You can modify the default value by using the Preferences window.

Important: Even if you select correlation as the default sort order, the list of facets that are displayed are chosen by frequency. Therefore, if you leave the default of 100 keywords, you get only the 100 most frequent keywords. This means that sorting by correlation will not include values that are more highly correlated but not in the top 100 by frequency. This method of sorting is by design. The rationale is that it, if the frequency of the keyword is low, the results is not of interest.

Figure 6-4 shows the view when you select the Product facet from the Facet Navigation pane. You can select any facet that you configured.

Figure 6-4 Facets view with the Product facet selected

You can limit the scope of your analysis to one or more keywords by selecting them using their corresponding check boxes and adding them to the query. When you select a keyword, the Boolean search operator (AND, OR, and AND NOT) icons are highlighted and become active, as shown in Figure 6-5.

Figure 6-5 Facets view with the search operators highlighted when a keyword is selected

Search operators: See 5.2.1, “Limiting the scope of your analysis using facets” on page 153, for details about the search operators.

After you click a specific Boolean search operator, the view is updated with the new search results and the result counts. At any time, you can go back to the Documents view to see the content of the documents that match the given query.

6.3.1 Understanding the Facets view

To effectively use the Facets view, you must understand the difference between frequency and correlation, which are described in 1.3, “Important concepts and terminology” on page 9. As a review, the frequency value counts the number of occurrences in documents. The correlation value measures the amount of uniqueness of the high frequency as compared to other documents that match your query.

Although the frequency value is useful, it might not always be as revealing as the correlation value. For example, high frequency counts for a particular model of car can be attributed to the overall popularity of the car: More cars of that model are sold than any other models. A high correlation value means that something is unusual, and further investigation and analysis are required.

When your query is set to the initial wildcard expression *:*, all correlation values are set to 1.0. The reason is because all of the documents in the collection are being compared to themselves, resulting in a correlation value of 1.0. After you add a keyword to your query or enter a search expression, the correlation values are recalculated and reflect the degree of uniqueness of this keyword to the other documents in the corpus that match the query.

6.3.2 Using the Facets view

You use the Facets view when you want to see a list of keywords that are associated with a given facet. By default, the Part of Speech (POS) facet and Phrase Constituent facet are defined and populated automatically by IBM Content Analytics. Consequently, you can see various verbs, nouns, adjectives, adverbs, and phrases that are used in your text. You can use these words as keywords for dictionary and pattern rules. You can also use these words as facets to help you drill down further on a set of documents for analysis.

The Time Series view shows the frequency of change over time. Correlation and deviation values are not presented in this view. This view is primarily used to analyze frequency and to select a range of documents for analysis for a given time period.

The Time Series view always shows the frequency of distribution for the documents that match your query for a given period of time. For example, consider a situation where you select vanilla ice cream for the Product facet and add the keyword with the AND operator. In this case, you can see how the vanilla ice cream documents are distributed across the selected time scale (year, month, day) along with their computed frequencies.

In another example, the Time Series view in Figure 6-6 shows the frequency of distribution when you select the Product facet and select Month for Time scale.

Figure 6-6 Time Series view showing results sorted by month

6.4.1 Features in the Time Series view

With the flexibility of the Time Series view, you can change the time scale (year, month, and day), use the Zoom in and Zoom out features, and focus on specific date ranges by using the Date facet. This section provides details about each the features of this view.

Changing the time scale

You can change the time scale of the graph when you analyze the data. From the drop-down menu, you can select the year, month, day, month of year, day of month, day of week, month of year, day of month, and day of week.

Each time scale calculates the sum of all documents for that particular time unit. For example, if you select month as the time scale, you see a bar for each month in the range of months as bounded by the documents that match your query. If your search result set spans two years, 24 months are shown.

If you select month of year, you see 12 bars; if you select day of month, you see 31 bars; and if you select day of week, you see 7 bars. For each result, you see the sum of all documents that fall on that particular date increment. For example, if you select the day of week time scale, you see seven bars, with the first bar representing the total of all documents in the result set that fall on Sunday. This feature conveniently shows the days of the week, month, or months of the year in which the most documents (or events) occur.

Zooming in and out

You can use the Zoom in and Zoom out feature especially when the bar chart becomes too busy to distinguish the exact values. As shown in Figure 6-7, you can select an area that you want to look into by dragging and zooming in on that area. After you select the area, click the Zoom in icon.

Figure 6-7 Time Series view showing the selected area to zoom in

Figure 6-8 shows the result of zooming in. When you click the Zoom out icon (highlighted in Figure 6-8), the view goes back to the original graph (Figure 6-7 on page 228).

Figure 6-8 Time Series view showing the results of zooming in on the selected area

Changing the Date facet

When you configure the Date facet to contain multiple date fields, you can select a different field for the Date facet to analyze the data in the Time Series, Trends, or Deviations views. By changing the field for the Date facet, the time scale for the graph is automatically updated to use the new date field.

For example, after the Time Series view is displayed based on the reported date of the document, you might want to see the same data in the Time Series view based on the date the incident occurred to give you another analysis perspective. Figure 6-9 shows selecting the Date facet value in the Time Series view.

Figure 6-9 Selecting the Date facet in the Time Series view

Configuring multiple date facets: See “Optional: Configuring the date facet” on page 113 to configure the date facet.

6.4.2 Understanding the Time Series view

The Time Series view shows the distribution of documents that match your query over a period of time. The y-axis shows the number of documents (frequency), and the x-axis shows the time scale that you selected.

When you hover the mouse cursor over a particular bar in the chart, a pop-up window opens that details the data for that unit, namely the frequency and specific date value.

6.4.3 Using the Time Series view

The Time Series view helps you to see how frequent your documents change over time and on specific months or days. The distribution always represents the documents that match your current query. By building separate queries for different combinations of facets (or search expressions), you can start to compare their frequencies of occurrences. This view also helps you to identify a frequency trend such as sudden increases for the selected facet.

Frequency counts alone, while useful, might not be that revealing as discussed earlier. The Trends and Deviations views are better suited for this task, and they take into account time and are much more useful in spotting anomalies in your data. The Time Series view is more useful for selecting a specific date or a range of dates to limit the scope of your analysis.

6.5 Trends view

The Trends view shows sharp and unexpected increases in frequency of a facet over time. The Trends view is similar to the Time Series view in that it shows the frequency distribution of documents as a bar graph across a given time frame. You can change the time scale to year, month, or day. It also provides the same zoom in and zoom out functionality as the Time Series view. However, the Trends view has significant differences in helping you to gain additional insight from your data.

First, the Trends view requires the selection of a facet from the Facet Navigation pane on the left side. You do not need to add the facet to your query. After you select a facet, the Trends view shows a list of individual bar graphs, one for each keyword of the selected facet. Each individual bar graph behaves similarly to the Time Series graph, but also highlights trends in your data that deviate from the normal distribution.

You can use this view to analyze future predictions of sharp increases in frequency. Figure 6-10 shows the Trends view when you select the Product facet and default date facet with the month time scale.

Figure 6-10 Trends view with the Product facet, sorted by high frequency and month

6.5.1 Features in the Trends view

The following features are the same as those features described in “Features in the Time Series view” on page 227:

•Changing the time scale

•Zooming in and out

•Changing the Date facet

The time scale options do not include the ability to select the month of the year, day of the month, or day of the week. The Trends view also includes the features in the following sections.

Changing the Charts per page indicator

You can change the number of charts per page by sliding the bar as highlighted in Figure 6-11. This feature is helpful when you want to view and compare multiple charts at a time on a page or focus on only one chart. The size of the chart varies depending on the number of charts viewed per page.

Figure 6-11 Trends view showing the Charts per page indicator

Showing selected charts or showing all charts

Each individual bar chart comes with its own selection check box. The Show selected charts icon (highlighted in Figure 6-12 on page 234), when clicked, reduces the view of charts to only those charts that are selected. For example, you might have a total of 48 charts, and eight charts are shown per page on six pages. You are only interested in seven charts scattered across various pages. By selecting the charts you are interested in using the check box and clicking the Show selected charts icon, only the six selected charts are shown on a single page for comparison.

You can revert to the original chart view by clicking the Show all charts icon (also highlighted in Figure 6-12).

Figure 6-12 Trends view showing the Show selected charts and Show all charts icons

Combining selected charts or showing separate charts

You can combine multiple selected charts into one chart by clicking the Combine selected charts icon (highlighted in Figure 6-13). This function aggregates all the charts into a single chart with each keyword given a different color.

You can revert to the original chart view by clicking the Show separate charts icon (highlighted in Figure 6-13).

Figure 6-13 Trends view showing the Combine selected charts and Show separate charts icons

Filtering the result by keyword

When you type a keyword in the Filter field box, the charts whose keywords contain the input filter are displayed in the view. The view is updated dynamically, so that you can see the filter result immediately.

6.5.2 Sort criteria

You can select from the following sort criteria to assist in the investigation of your data:

•Highest frequency (default)

This criterion lists, in descending order, those graphs with the highest frequency counts. The keywords graph at the top of the list contains a time period with the greatest frequency count. This method offers a quick way to see which keywords contain the most occurrences. Remember, frequency counts are not always as informative, but rather expected. For example, when looking at the sales of snow shovels, you expect a high number to be sold during the winter months.

•Highest index

This criterion lists, in descending order, those graphs with the highest deviations from their expected averages within the current time frame selected. This method offers a quick way to see which keywords are operating out of the norm the most.

For example, you might have a facet that tracks the frequency of car part failures. The highest index might list, at the top of the list in descending order, the car part with the highest deviation (index) from its expected trending average from any other car part (and their deviations). The highest index can occur anywhere in the time period selected as opposed to the last index, which only focuses on the last time increment in the series.

•Latest index

This criterion is similar to the highest index criterion but only uses the latest time unit (year, month, or day) to determine which keywords have the highest index. This method is a quick way to see which keywords most recently are operating out of the norm.

•Name

This criterion alphabetically sorts the graphs by name in either ascending or descending order. This method is a quick way to find particular keywords that you are interested in.

6.5.3 Understanding the Trends view

In the Trends view (Figure 6-14), the frequency of the selected time period is shown as a bar chart. It is scaled accordingly from 0 to the maximum frequency count along the vertical y-axis on the left side.

On the right side of the y-axis is a scale that represents the increase indicator. The increase indicator is a scale to measure the increase ratio of the frequency for a given time interval as compared to the expected average frequency that is calculated based on the changes in the past time interval frequencies. This expected change in frequency is estimated by using a modified Poisson distribution. The increase indicator is shown as a blue line graph in the chart, as shown in Figure 6-14.

The bar chart is in color (to highlight) whether the frequency deviation is higher than what was expected within reasonable limits. This result means that the actual value is greater than the estimated value by a certain amount. Figure 6-14 shows a brighter orange color for December 2008. As you can see from the blue line in the graph, the increase indicator shows a sudden jump, which means that you might want to conduct additional investigation and analysis for that time period.

Figure 6-14 Trends view: Pine juice with frequency count and increase indicator

Calculating the index in the view

How is the increase indicator that is shown as a blue line in the chart calculated? To calculate the index (that is, the increase indicator), you must calculate the average global frequency, which is the average frequency of all searched documents over the given time period. You must also calculate the average keyword frequency over the same period.

In simple terms, to calculate an average, you add all the frequencies together and divide by the number of dates in the series. Because this calculation is too simplistic, you also want to account for the decay factor. Decay means that you want the past to become decreasingly relevant, the more distant it gets. That is, each frequency that is calculated is weighted according to a decay constant. It is a matter of both adding all the frequencies together and equally contributing values in the average calculation. The decay factor is applied to each frequency first. Thus, if the decay value is 0.85 (the default setting for the increase indicator in the text miner application), the frequency of the nth–4th date contributes less (about half) to the calculation as the frequency for the nth date. Both the global and the keyword time series average frequencies are weighted this way.

With the weighted average frequency time series, you can accurately estimate the frequency count for a future date. That is assuming that the time series is constant and you factor in the variation within all searched documents using the weighted global average frequency. This estimate gives you the ability to collate the increase index, to the degree to which the frequency of this keyword has increased or will increase, for a particular date.

Given this description of how the expected frequency index is calculated, it is obvious why the first four time intervals of all your graphs show no change and are flat with no highlighted colors showing. The algorithm needs the first four intervals to start the calculation of the Poisson distribution.

6.5.4 When to use the Trends view

With the Trends view, you can detect sharp and unexpected increases in the frequency of a given keyword. Usually a sharp increase indicates that you need to do more investigation. If the increase indicator index is higher than the threshold, the bar chart is in color, so that it is easier for you to discover the anomaly.

6.6 Deviations view

The Deviations view shows the deviation of keywords for a given time period. The Deviations view is similar to the Trends view in that it requires the selection of a facet and shows the corresponding individual graphs for each keyword. The controls across the top function are the same as in the Trends view with one exception. Three additional selections are available from the time scale pull-down menu, namely the month of year, day of month, and day of week. These additional selections provide greater insight into cyclic changes in your data.

The greatest difference between the two views is what the graphs are trying to convey when certain bars are highlighted, indicating something of interest. In particular, the Trends view alerts you when a keyword is trending up or down by an unexpected amount, and the expected amount is calculated based on the past history of frequency changes. This view is more focused on the trending of frequency counts over time.

The Deviations view is focused on how much the frequency of a given keyword deviates from the expected average for the given time period (not previous periods). The expected average takes into account all the averages of the other frequency counts for the given time period. The Deviations view is useful for identifying patterns that occur cyclically and alerts you when those cyclic patterns have an unexpected change. You can use this view to show seasonal patterns in your data or patterns that occur on a monthly or weekly basis.

Figure 6-15 shows the Deviations view when you select the Product facet with Time scale set to Month, the Date facet set to the default of date, and Sort set to High frequency.

Figure 6-15 Deviations view showing sorting by high frequency and month

6.6.1 Features in the Deviations view

The following features are functionally the same as the features for the Trends view:

•Changing the Charts per page indicator

•Showing selected charts or showing all charts

•Combining selected charts or showing separate charts

•Zooming in and out

•Changing the Date facet

•Sort criteria

– High frequency

– High index

– Latest index

– Name (ascending)

– Name (descending)

See 6.5, “Trends view” on page 231, for a detailed description of their functionality. Notice that the time scale feature in the Deviations view is identical to the same function in the Time Series view. As in the Time Series view, this time scale includes options for selecting the month of year, day of month, and day of week. You select these options when you want to see seasonal changes or monthly and weekly changes.

6.6.2 Understanding the Deviations view

In the Deviations view (Figure 6-16 on page 241), the frequency of the selected time period is shown as a bar chart and measured at the y-axis on the left side. The deviation index score is measured at the y-axis on the right side. The deviation index score is the standardized residual. It is referred to as index in the chart.

The deviation index score indicates how the actual value deviates from the expected value for a given time frame. The blue line in the chart shows the deviation index scores. The bar chart is displayed in color if the deviation index score is higher than the threshold, which indicates that the actual value is greater than the expected value.

For example, we select the Product facet and filter with the keyword “apple,” choose High index for Sort, and select Month for Time scale, as shown in Figure 6-16.

Figure 6-16 Deviations view showing the frequency count and deviation index

In the graph, the bar chart in 2008-11 (November 2008) for apple juice (bottle) is highlighted with orange, which indicates that the index amount is relatively high. Also, notice the yellow highlighted bars for 2008-12 for apple juice (bottle), and the graph below apple juice for the months 2008-05, 2008-06, and 2008-11.

Also notice that the frequency for apple juice (bottle) on 2008-11 is highlighted as orange with a frequency of 4. This frequency is less than the frequency of 7 for apple juice on the same month 2008-11, which is highlighted in yellow. This result occurs because the deviation index score for apple juice (bottle) is higher than for apple juice and warrants the stronger red color.

How is the deviation index score calculated? It is calculated by using the frequency counts of each keyword in the given time period and the total number of frequency counts during the given time period.

If you go back to the Time Series view, select the Product facet, and select the Month as time scale, you see that the following results:

•The total number of documents (all keywords) for 2008-11 is 78 from the Time series view.

•The total frequency count (for the entire time period) for apple juice (bottle) is 12 from the Deviations view.

•The total frequency count (for the entire time period) for apple juice is 49 from the Deviations view.

•The frequency of 2008-11 for apple juice (bottle) is 4 from the Deviations view.

•The frequency of 2008-11 for apple juice is 7 from the Deviations view.

The expected value for apple juice (bottle) and apple juice is calculated based on those values. It indicates which value is expected for the frequency of the keyword statistically. That is, the expected value multiplies the actual frequency of the selected keyword by the ratio of the selected keyword frequency in the entire frequency.

In this case, the expected value of 2008-11 for apple juice (bottle) is 1.0. (For 12 months, the total frequency count is 12, and therefore, the monthly count is about 1.0.) The expected value of 2008-11 for apple juice is 4.5. (For 12 months, the total frequency count is 49. Therefore, the monthly count is a little more than 4.)

The actual frequency value of 2008-11 for apple juice (bottle) is 4, which is greater than the expected value of the keyword “apple juice (bottle),” which is 1.0. Its delta ratio is greater than the one for apple juice. Notice that we do not compare the delta of the actual frequency and the expected value. Content Analytics calculates the deviation index score itself based on these values.

The bar chart is in color based on the deviation score index so that you can easily determine which keyword in the selected facet is worth further investigation.

6.6.3 Using the Deviations view

The Deviations view is helpful when you want to see the deviation of the keyword for the selected facet within the given time period, such as month and day of the month. For example, you might want to see if the characteristics between Monday and Wednesday have any noticeable change when you look at the Product facet with day of the week selected as the time scale.

You can compare the deviation within the selected facet (that is the selected aspect). You can also see if you can find anything noticeable in that aspect with the given time scale, compared to the keywords that are found within the facet. In an earlier example, the deviation score index of 2008-11 for apple juice (bottle) is greater than the one for apple juice when you look at the data with the month time scale. Therefore, you might want to drill down the documents related to apple juice (bottle) and investigate why its deviation is noticeable compared to the other product found in the Product facet at 2008-11.

6.7 Facet Pairs view

The Facet Pairs view (Figure 6-17 on page 244) shows the correlation of keywords from two selected facets. In this view, you select two facets from the Facet Navigation pane to see the correlation of these facets.

After you select two facets to analyze, you can also choose from the following three alternative displays of the facet pair comparison:

•Table view

•Grid view

•Bird’s eye view

Figure 6-17 Facet Pair view showing the Verb facet and the Product facet in Table view format

6.7.1 Table view

As shown in Figure 6-17, the Table view shows the selected two facets using a table style. By default, it is sorted by frequency. In this view, it is important to focus on which pair has the highest correlation value.

In Figure 6-17, we select the Product facet for Rows and the Verb facet for Columns and sort the result by frequency, not by correlation. As a result, the frequency of the document that contains both the keyword “orange juice” and “leak” is 67, and the correlation of those selected keywords is “4.2.” The Table view is most useful when you sort by frequency.

The Table view is the default view when you initially select two facets to compare. You can change the default Facet Pairs view to another view format in the Preference window.

6.7.2 Grid view

You can see the correlation for the selected facets by using the Grid view. In the example shown Figure 6-18 on page 246, we select the Product facet for Rows and Verb facet for Columns. The cell that is the intersection of orange juice and leak is highlighted in orange, which indicates that the correlation value is greater than the threshold when compared to the other correlation values. That is, the leak issue has a high correlation with the orange juice product, or the orange juice product has a high correlation with the leak issue.

You might notice that two values found in the orange cell (67 and 4.2) are the same as those shown in the Table view in Figure 6-17 on page 244. The first row in the cell is the frequency (the number of documents) that contains both the selected keywords, and the second row in the cell is the correlation value.

You also see the numbers, such as 86 under the keyword “orange juice” and 123 under the keyword “leak.” These values are the frequency of each keyword and the number of documents returned by the selected keyword.

In this view, you can see the comparison values in table form by row and column, one facet for each dimension. The Grid view can calculate 100 x 100 cells of the table data, based on the highest frequency. Consequently, the top 100 most frequent keywords in one selected facet are displayed as rows. Also the top 100 most frequent occurring keywords in the other selected facet are displayed as columns.

What if either of the selected facets has more than 100 keywords? The Facet Pairs view is constructed based on the assumption that the users want to see the highest frequency first because they usually contain the most interesting correlations. If you must oversee all of the data, instead consider using deep inspection, which is explained in 10.7, “Deep inspection” on page 431. Or, you can confirm the keyword connection by using the Connections view as explained in 6.8, “Connections view” on page 250.

Figure 6-18 Facets Pairs view showing the Verb facet and Product facet in the Grid view format

6.7.3 Bird’s eye view

By default, in the Grid view, you can only view a 15 x 15 celled area of the 100 x 100 table at a time. The use of the 15 x 15 viewing area is implemented for performance reasons to keep the number of required calculations low. With the Bird’s eye view, you can select other areas of the table that you might want to see that are not in the default 15 x 15 viewing area.

As shown in Figure 6-19, the current 15 x 15 viewing area is displayed as a blue box in the upper left corner of the table. The dimensions of the table within the 100 x 100 limit are displayed as white cells. To select a different viewing area, move your cursor to where you want to start in the table, click and drag the blue box, and then click the table or the grid view to see the area.

Notice that the values in the individual cells (keywords, frequency, and correlation values) are displayed in a pop-up box as you hover your cursor over the cell. Also notice how the colors of highly correlated cells are maintained in the bird’s eye view to provide a convenient way to quickly locate areas of interest in the 100 x 100 table.

Figure 6-19 Facets Pair view: Verb and Product facets in the bird’s eye view format

6.7.4 Understanding the Facet Pairs view with correlation values

With the Facet Pairs view, you can identify a high correlation of keywords from the selected facets. Content Analytics requires two sets of search results to calculate a correlation. Accordingly, you select two facets that represent the two search result sets of the document set.

This section explains how the correlation value is calculated in Figure 6-19 on page 247 by using the Product facet as the row and Verb facet as the column.

Calculating the correlation value

To calculate the correlation value, first you must know the following information:

•The total number of occurrences of “orange juice” is 86.

•The total number of occurrences of “leak” is 123.

•The entire number of documents in the corpus is 852.

Therefore, about 14% of the documents (123/852) contain the keyword “leak.” This value is referred as the density of the keyword in a given document set.

Next, you must look at the intersection cells of the keywords, “orange juice” and “leak.” You see that the total number of occurrences of documents that includes both the keywords “orange juice” and “leak” is 67. Thus the density of the keyword “leak” in the document set for the keyword “orange juice” is 78% (that is 67/86).

When you calculate the correlation value, you must consider the ratio of these two density values as the correlation value:

•The density of the given set of documents that includes the specific keyword

•The density of the entire set of the documents in the whole collection

That is, you are interested in the ratio of the following items:

•The density of the keyword “leak” in the document set for the keyword “orange juice” (67/86 = 78%)

•The density of the keyword “leak” in the entire document set (123/852 = 14%)

In this example, the correlation value of orange juice and leak is calculated as 5.5 (roughly 77%/14%).

The correlation value calculated in this manner is not reliable especially when the number of documents, which includes both keywords, is relatively small. For example, the number of documents that includes keyword B is 2, while the number of documents that includes keyword C is 100. Then, consider the number of documents that include keyword A in both document sets. The density is 50% for both document sets. Which value is more reliable?

•The number of documents that includes keyword A and keyword B is 1 (50% of the document set that includes keyword B).

•The number of documents that includes keyword A and keyword C is 50 (50% of the document set that includes keyword C).

In this case, you must consider the first case (50% calculated by 1/2 is the document that includes keyword A and keyword B) is less reliable. Content Analytics takes into account such situations by applying a reliability correction that uses statistical interval estimation to make the calculated correlation more reliable. (Usually it makes the calculated correlation value smaller to some degree.)

With reliable correction, the earlier example correlation value of 5.4 becomes 4.2, as shown in Figure 6-17 on page 244 in the Table view. The correlation value 4.2 is higher than the normal threshold.

Interval estimation: The topic of interval estimation is outside the scope of this book.

Correlation value: In this example, the correlation value that is calculated from the given figure is much higher than the one that is displayed in the Table view. This result is normal because Content Analytics adjusts the correlation value to a smaller value to some degree by reliability correction. Remember that the data distribution is a sample data set and is not a real case. If the document set is relatively small and not reliable, the correlation value is reduced to be a more reliable value.

6.7.5 Using the Facet Pairs view

The Facet Pairs view is useful when you want to compare facets of your collection and have Content Analytics show you how highly correlated they are to each other. Content Analytics highlights intersecting cells when two keywords are highly correlated. The Bird’s eye view is used to review the entire table to quickly identify these highly correlated cells. You use the Grid view to focus on that specific area of intersecting cells.

After you discover the highly correlated keyword pairs, you can go back to the Documents view and look at the textual data (content of the document) and determine if there are any possible trends or insight there.

Remember that the Facet Pairs view only concentrates on the top most frequently occurring keywords. If you need to consider all of your data, use deep inspection as explained in 10.7, “Deep inspection” on page 431.

6.8 Connections view

The Connections view shows a graphical view of the relationship between keywords or subfacets within selected facet pairs. This view is another representation of the correlation of keywords within the selected facet pairs. The keyword is represented by a node, and the link between the nodes shows the correlation value between the two keywords.

Figure 6-20 shows an example of the Connections view when you select the Product facet and the Verb facet.

Figure 6-20 Connections view when selecting Product facet and Verb facet

Representation in the Connections view: Depending on the browser window size or your operation, the representation in the Connections view can vary even though the same facet pairs are selected.

In the Connections view, you can determine the highly correlated keyword pairs by focusing on the size of the node, link color, and link length:

•The Node shows the keyword or subfacet in the selected facet pairs.

– Node size represents the frequency of the keyword or subfacet. The larger node size represents a higher frequency count in the entire document corpus.

– Node color indicates the selected facet where the keyword is located. Our example has two node colors: light blue and dark blue. The keywords “leak”, “use,” and “find” belong to the Verb facet (light blue), while the keywords “orange juice”, “apple juice,” and “chocola” belong to the Product facet (dark blue).

– When you move the mouse pointer over a particular node, the facet name, keyword, and frequency are displayed, as shown in Figure 6-21.

Figure 6-21 Tooltip showing the facet name, keyword, and frequency of the node

•Link color shows the rank of the correlation index. The link in red has a higher correlation value than a yellow link color.

•Link length reflects how tightly the two nodes are correlated. The higher correlation is between two nodes, the shorter the length of the link is.

You can modify the rendering behavior of the Connections view from the Connections tab in the Preferences window. The following settings are among some of the settings that are configurable:

•Do not allow nodes to overlap

•Link length corresponds to correlation values

•Node size corresponds to frequency values

Configuration in Preferences: From the Connections tab in Preferences, you can also select the following attributes:

•Number of results to show for Facet1 per page

•Number of results to show for Facet2 per page

•Show the keywords or Sub facets by default for Facet1

•Show the keywords or Sub facets by default for Facet2

The Number of results to show for Facet1 or Facet2 is the number of facets used for the analysis. By default, 50 is set for both facets, which does not mean that 50 nodes are displayed in the search results. Fewer nodes are displayed in the search result if fewer nodes are involved in the correlation. Increasing the value can affect the performance of showing the Connections view user interface.

6.8.1 Features in the Connections view

The Connections view has the following features:

•Creating a window capture and saving it as an image file

•Resuming or pausing rendering

•Zoom in and Zoom out

•Zoom to fit the viewing area

•The AND operator

•Filter by correlation

•Node labels

•Highlight mode

Creating a window capture and saving it as an image file

You can create a window capture when you want to save a specific connection representation by clicking the Camera icon (highlighted in Figure 6-22). Then a window opens where you can select the location to save the image file and select the image file format.

Figure 6-22 Create a window capture and save as image file features

Resuming or pausing rendering

When you select two facets, the Node connections are animated, which takes a while to complete. You can resume or pause the animation in the middle of rendering when you click the Resume rendering or Pause rendering icons in Figure 6-23.

Figure 6-23 Features in the Connections view

Zoom in and Zoom out

Sometimes the Connections view becomes busy when many nodes are to be displayed. You can zoom in and zoom out for specific node connections.

Zoom to fit the viewing area

After you change the browser window size, the Connections view is redrawn if you select the Zoom to fit the viewing area feature. Content Analytics redraws the Connections view to fit the new size of the viewing area.

The AND operator

Similar to other views, you can perform the AND operator search to narrow down the scope. The AND operator is helpful in narrowing down the data before interacting with the other views. First, select a specific keyword, and then click the AND operator icon.

Filter by correlation

Sometimes the Connection view becomes busy, because, by default, all keywords that have a correlation value greater than 2.0 are displayed. You can set the filter by correlation value to a higher number by using the slide bar. Only keyword pairs that contain a correlation value greater than the correlation value that you set are displayed in the Connection view.

For example, when you want to view the keywords that have a correlation value greater than 5.0, set the filter to 5.0. As a result, fewer nodes are displayed in the Connections view (Figure 6-24) than before the filter change (Figure 6-23 on page 253).

Figure 6-24 Connections view filtered by a correlation value of 5.0

Node labels

You can select how the node label is displayed. The options are Complete, Truncated, or None. By default, the Truncated option is selected. To display the full name of the keyword in the node, select the Complete option.

For example, the keyword “chocola” and the keyword “dirty” are displayed in the connection view, as shown in Figure 6-24. If you set the Node label field to Complete, the keyword “chocola” is changed to “chocolate cookie”, as shown in Figure 6-25.

Figure 6-25 Selecting Node label as Complete in the Connections view

Highlight mode

Changing the highlight mode is useful when you want to focus on specific keywords in the Connections view. By default, the No Highlighting option is selected.

Suppose you focus on the keyword “smell.” To see the direct connection to the keyword, select Direct links only for the Highlight mode field. In Figure 6-26, the direct connection to the keyword in the view is now highlighted, and other connections are grayed out.

Figure 6-26 Selecting Direct links only for Highlight mode to view the node “smell”

In addition, when you select All Links for the highlight mode field, all the links that are connected to the keywords are highlighted, as shown in Figure 6-27. In our example, the connection between the keyword “have” and the keyword “taste” is now highlighted.

Figure 6-27 Selecting All links for Highlight mode to view the node “smell”

6.8.2 Understanding the Connections view

The Connections view helps you to identify the high correlated keywords based on the selected facets. As explained in the 6.7, “Facet Pairs view” on page 244, Content Analytics requires two sets of search results to calculate the correlation value.

This section helps you to interpret the results of the Connections view and understand how the Connections view is created. In the example in this section, two facets are selected: Product and Verb.

Understanding the values that are displayed

This section explains the meaning of the values that are shown in the Connections view. In Figure 6-28, the keyword “leak” has a higher frequency compared to the other keywords shown in the figure. The node size represents the number of occurrences for the keyword in the result set.

The keywords “orange juice” and “apple juice” have the same blue color, while the keywords “leak” and “drink” have the same light blue color. The color indicates which facet the keywords represent. Thus, you can conclude that the keywords “orange juice” and “apple juice” belong to the same facet (Product). The keywords “leak” and “drink” belong to the same facet (Verb), but the facet is different from the Product facet.

Figure 6-28 Connections view example

When you see the connection between the nodes, you notice that the connection between the keywords “orange juice” and “leak” is an orange color. It is also shorter as compared to the other connections (between the keywords “leak” and “apple juice” or between the keywords “apple juice” and “drink”). Thus, the correlation value between the keywords “orange juice” and “leak” is higher than others. You might want to investigate this relationship further.

How the Connections view is created

As described in the previous section, Content Analytics requires two sets of search results to calculate the correlation value. Content Analytics uses the same calculation mechanism in the Connections view as described in the 6.7, “Facet Pairs view” on page 244. However, it considers all possible combinations of the selected facet pairs.

For example, consider calculating the correlation value between the nodes “orange juice” and “leak” and between the nodes “apple juice” and “leak”, as shown in Figure 6-28. You can see the same correlation value between these keywords when you select the Product facet and the Verb facet in the Facet Pairs view, as shown in Figure 6-29 on page 259. Likewise, the correlation value between the nodes “apple juice” and “drink” is calculated with the same facet pairs, Product facet and Verb facet.

Figure 6-29 Grid view in the Facet Pairs view when selecting Product and Verb

However, the connections sometimes have a different correlation value than Facet Pairs view. The difference in values occurs when keywords from the same facet are compared against each other, such as “smell” and “taste”, “smell” and “have”, “have” and “taste” from the Verb facet, as shown in Figure 6-30.

Figure 6-30 Connections view in the same Verb facet

How is the Connections view created in this case? Content Analytics uses the same calculation as the Facet pairs view, but considers all three combinations for the selected facet pairs. In this case, Content Analytics uses the facet pairs, but selects the Verb facet for both vertical facet and horizontal facet, as shown in Figure 6-31.

Figure 6-31 Grid view in the Facet Pairs view when selecting Verb and Verb

The remaining combination of the facet pairs in this example relates to selecting the Product facet for both the vertical facet and the horizontal facet. However, as shown in Figure 6-32, these correlation values are not higher than the threshold (2.0 by default). As a result, the connection between these keywords from the Product facet are not displayed.

Figure 6-32 Grid view in the Facet Pairs view when selecting Product for the horizontal and vertical facet

The Connections view shows the correlation value automatically. You do not need to open the Grid view in the Facet Pairs view and select a different facet each time. You can see how the keywords are correlated with each other immediately so that you can concentrate on highly correlated keywords.

6.8.3 When to use the Connections view

With the Connections view, you can see the keyword cluster based on the correlation. In addition, you can find the hints or “hidden connections” between keywords because the Connections view shows the correlation value for the selected facet pairs and all possible variations of the selected facet pairs.

For example, consider the case when you select the Category facet and the Adjective facet in the Connections view for the sample collection used in this chapter. There is a high correlation between the “strange” keyword and the “Taste/smell” keyword. A high correlation between the “sour” keyword and the “Taste/smell” keyword is displayed. These same high correlation keyword pairs are shown in the Facet Pairs view. However, in the Connections view, you also see the correlation between the “strange” keyword and “sour” keyword. In this case, the three keywords “Taste/smell”, “strange,” and “sour” are connected. Therefore, you can easily see that the user reports a problem when the product taste or smell is sour.

Based on this information, you might drill down into the documents to understand the situation further. When you individually look at the keyword pairs of “sour” and “Taste/smell” or “strange” and “Taste/smell,” you might not think that the user’s report of a “sour” taste as being something unusual. However, in the Connections view, you can easily see the connection through the visual representation of the keyword connections.

You can see the same results in the Facet Pairs view as described earlier. However, you cannot see the correlation of keywords at one time when you use the Facet pairs view. Sometimes you cannot see the keyword pairs with the default configuration in the Grid view (such as “apple juice” and “drink”). To determine this information using the Facet pairs view, you must use the Bird’s eye view which involves extra steps. However, you can find the highly correlated keywords in the Connections view without changing windows.

The Connections view shows the “connection” based on the correlation value between all keywords found in the selected facets, and it is easy to filter the result by correlation value. You might see the trends of keyword clusters. As in the Facet Pairs view, after you discover the highly correlated keyword pairs, you can open the Documents view to further investigate the content of the document.

6.9 Dashboard view

The Dashboard view shows various predefined charts and tables in a single text mining view. With this view, you can visualize various aspects of the data to quickly interpret, analyze, and further investigate. In addition, you can save the images in the Bitmap, PNG, or JPEG format so that you can easily share the data with other people for collaboration purposes.

The administrator can customize the predefined Dashboard layouts. Then, the user can select the Dashboard layouts for viewing, saving them as images and analyzing them further. By default, you can use two preconfigured layouts. These default layouts include Layout 1 and Layout 2. The administrator can add additional layouts and customize them based on business requirements. Figure 6-33 shows an example of the default layout, Layout 2, as it is displayed in the Dashboard.

Figure 6-33 Default Layout 2 in the Dashboard view

6.9.1 Configuring the Dashboard

In this scenario, a new Dashboard layout is configured that contains four separate charts or tables to display data from the collection that created in 4.3.2, “Creating a text analytics collection” on page 90. The new Dashboard layout includes a bar chart, facet table, pie chart, and column chart. In this scenario, we set up each chart to contain separate data.

Creating a layout

The Dashboard layout consists of one or more horizontal or vertical containers. First, you must set up how you want the containers organized in the layout and add panels to each of them. In this scenario, we create a layout that consists of a chart or table in each cell of the layout containing two columns and two rows.

To create a layout, follow the steps:

1. In the administration console, click the Analytics Customizer tab to open the Analytics Customizer application.

2. Click the Dashboard tab.

3. On the Dashboard page (Figure 6-34), complete the following actions:

a. Select the collection that you want to associate with the new layout in the Collection name field. In this scenario, for Collection name, select Sample Text Analytics Collection.

b. For Layout, select New.

c. For Title, type Problem Report as the name for the layout.

Figure 6-34 Adding two horizontal containers to the new Problem Report Dashboard layout

The layout consists of containers. Each container needs to contain a panel. If you do not add a panel to each container, you cannot save the new layout.

d. For Number of containers for the horizontal field, select 2.

e. Click Add horizontal container. You now have two containers of equal size, as shown in Figure 6-35.

f. Click the leftmost horizontal container that you just created to select it, and select 2 for the Number of containers field for the vertical container (Figure 6-35).

Figure 6-35 Adding two vertical containers to the left container

4. Click Add vertical container. You now have three containers. For this scenario, we want to show four charts. Therefore, you need to add one more container to the layout.

5. Click the rightmost horizontal container to select it, and select 2 for the Number of containers for the vertical container, as shown in Figure 6-36.

Figure 6-36 Adding two vertical containers to the right container to create a four container layout

6. Click Add vertical container. You now have four containers of equal size, as shown in Figure 6-37.

Figure 6-37 Adding a panel to the upper left container of the new layout

7. Click the upper-left container, and click Add Panel.

8. Repeat step 7 for each container so that each of the four containers contains a panel. This action adds a panel to each container with the default settings for a bar chart.

The following sections explain how you can edit the panel to create your specific charts and tables.

Creating a bar chart

In this section, you configure the new layout to contain a bar chart located in the upper-left panel of the new layout view. You set the data to be displayed in the bar chart to be the frequency of the top category facet values. For more information about the Dashboard configuration options, see the following address:

http://www-01.ibm.com/support/docview.wss?uid=swg21420024

To create the bar chart, follow these steps:

1. Click the upper-left panel to select it and click Edit (Figure 6-38).

Figure 6-38 Dashboard layout panel to configure the bar chart

2. In the Panel properties window (Figure 6-39), complete the following steps:

a. For the Title field, type Top 5 Frequent Categories.

b. For the Type field, select Bar chart.

c. For the Facet ID field, click Select and click Category.

d. Click OK.

Figure 6-39 The bar chart panel properties

You do not see the results of the bar chart until all containers are configured. For an example of the completed bart chart, see the upper-left container in Figure 6-43 on page 275.

Creating a facet table

In this section, you configure the new layout to include a facet table located in the upper-right panel of the layout view. You set the data to be displayed in the facet table as the top correlated verbs.

To create a facet table, follow these steps:

1. Click the upper-right panel to select it and click Edit (Figure 6-40).

Figure 6-40 Dashboard layout panel to configure the facet table

2. Add the field values as shown in Table 6-1. Keep the default values for all other fields.

Table 6-1 Panel properties for the top 5 correlated verb facets

Field	Value
Title	Top 5 Correlated Verbs
Type	Facet Table
Facet ID	Select Part of Speech → Verb
Show data type	Correlation or index
Sort type	Correlation

3. Click OK.

You do not see the results of the facet table until all containers are configured. For an example of the completed facet table, see the upper-right container in Figure 6-43 on page 275.

Creating a pie chart

In this section, you configure the new layout to contain a pie chart located in the lower left panel of the layout view. You set the data to be displayed in the pie chart as the correlated values of the Product facet.

To create a pie chart, follow these steps:

1. Click the lower left panel to select it and click Edit (Figure 6-41).

Figure 6-41 Dashboard layout panel to configure for the pie chart

2. Add the field values as shown in Table 6-2 and keep the default values for all other fields.

Table 6-2 Panel properties for the top 5 correlated products pie chart

Field	Value
Title	Top 5 Products by Correlation
Type	Pie Chart
Facet ID	Select Product
Show data type	Correlation or index
Sort type	Correlation

3. Click OK.

You do not see the results of the pie chart until all containers are configured. For an example of the completed pie chart, see the lower-left container in Figure 6-43 on page 275.

Creating a column chart

You configure the fourth panel to contain a column chart located in the lower right panel of the layout view. You set the data to be displayed in the column chart to the frequent values of the subcategory facet for documents that contain the term “juice”. The data set is narrowed to documents that contain the term “juice.” Then the top 50 most frequent subcategory values are analyzed, and the top three subcategories are displayed in the chart.

To create a column chart, follow these steps:

1. Click the lower right panel to select it and click Edit (Figure 6-42).

Figure 6-42 Dashboard layout panel to configure for the column chart

2. Add the field values as shown in Table 6-3 and keep the default values for all other fields.

Table 6-3 Panel properties for the top three frequent subcategories column

Field	Value
Title	Top 3 Frequent Subcategories Containing Juice
Type	Column Chart
Facet ID	Select Subcategory
Count to display	3
Explicit query to analyze	juice

3. Click OK.

You do not see the results of the column chart until all containers are configured. For an example of the completed column chart, see the lower-right container in Figure 6-43 on page 275.

Saving the Dashboard layout

Now that you configured the new Dashboard layout, you need to save it:

1. Click Save.

2. Click Close to close the Dashboard configuration window.

3. Click Save Changes. Otherwise, the changes might not fully save to the text miner application.

4. Click Exit to exit the Analytics Customizer application.

5. In the message window that indicates that the window will close, click OK.

6.9.2 Viewing the Dashboard

After an administrator sets up a new dashboard layout, a user can view it and work with it through the text miner application:

1. Open the text miner application.

2. For this scenario, use the Sample Text Analytics Collection. If this collection is not displayed at the top of the text miner application, click the change link, select the Sample Text Analytics Collection, and click Save.

3. Click the Dashboard tab. In the Layout File field, select the Problem Report layout.

The layout that you created in 6.9.1, “Configuring the Dashboard” on page 263, is now shown in the window. The four charts and tables contain the analytic data of the collection, as shown in Figure 6-43. If a data value is listed in more than one chart or table, the color associated with that data value is the same across the multiple charts and tables.

Figure 6-43 Problem Report layout in the Dashboard view

6.9.3 Working with the Dashboard

The Dashboard provides additional useful functionality. For example, you can further narrow down the data set, display correlation and frequency values for a data point, enlarge a table or chart, and set user preferences.

Narrowing the search results

With the charts and tables in the Dashboard layout, you can further narrow down your data set to focus on an area of interest. In this scenario, you click the Package/container bar in the “Top 5 Frequent Categories” chart. As a result, the following syntax is automatically added to the query statement:

/"keyword$.category"/"Package / container"

Notice that all the charts and tables have changed, except for the column chart. Also only the documents that contain a category equal to Package/container are now displayed. Because the column chart has the term “juice” set as the explicit query to analyze, it is not modified by any additional query statements that a user includes. If you want the user’s query statement to be appended to the query defined for the chart, set the query in the additional query to analyze the field when editing the panel in the dashboard layout.

To view the query syntax area, click the Expand this Area icon at the top of the window. Figure 6-44 on page 277 shows the addition to the query syntax and the updated charts and tables.

Frequency and correlation display

When you move the mouse pointer over a data point on the chart or table, you see the frequency and correlation values for that particular data point. In this scenario (Figure 6-44 on page 277), move the mouse over the Mineral pie piece in the “Top 5 Products by Correlation” pie chart to view the frequency of 39 and correlation of 1.9.

Expanding and minimizing a chart

To see only one chart at a time in a larger view, click the expand icon in the upper-right corner of the chart or table (Figure 6-44 on page 277).

After the chart is expanded, click the minimize icon in the upper right corner of the chart or table to go back to the main layout view that shows all of the charts or tables.

Changing the chart or table size

To change the size of the charts, you move the mouse pointer over the border of the chart, and drag the frame of the chart to your desired location. With this action, you keep all the charts in the layout viewable while making a chart larger. Figure 6-44 on page 277 shows an example of changing the frame of a chart. Notice that the “Top 5 Frequent Categories” and “Top 5 Products by Correlation” charts are wider than the other charts and tables.

Dashboard preference settings

To change the default layout that is shown in the Dashboard, select Preferences → Dashboard. For the Sample Text Analytics Collection, select Problem Report to be the default layout. Now the Problem Report layout is displayed every time you open the Dashboard view for the Sample Text Analytics Collection.

Figure 6-44 Problem Report layout with results for the Package/container category

6.9.4 Saving Dashboard charts as images

With the Dashboard, you can save all of the charts and tables in the selected layout as an image, or you can save an individual expanded chart or table. You can use the Bitmap, PNG, and JPEG formats to save the images. You save an image by clicking the image icon, as shown in Figure 6-45, and selecting your desired image format. After the image is saved, you can share it with coworkers for further collaboration.

Figure 6-45 Saving the Dashboard charts by clicking the image icon

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 6. Text miner application: Views

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 6. Text miner application: Views