Chapter 3. Data Exploration

In this chapter, we will go through the main functions of KNIME visualization (except reporting) and other techniques to explore the data you have. This can be helpful when you want to do the preprocessing too, but you can also check the result of visualization or see how well they fit the computed models and the test/validation data. The topics covered in this chapter are as follows:

  • Statistics
  • Distance matrix
  • Visual properties
  • KNIME views and HiLiting
  • JFreeChart nodes
  • Some third party visualization options
  • Tips with HiLiting
  • Visualizing models

Computing statistics

When you want to explore your data, it usually is a good idea to compute some statistics about them so that you can spot the obviously wrong data (for example, when some data should be positive and it appears as a negative minimal value, it is suspicious).

Most of the nodes require you to not have NaN values within the data to be analyzed. You can remove them with the value modification techniques presented in the previous chapter, or by filtering the rows, also discussed in the previous chapter.

The minimal and maximal values can be checked in the port view's Spec Columns tab. This can already be used to spot certain kinds of problems.

For statistics within groups, we have the good old GroupBy node. That allows you to aggregate using the functions described on the Description tab of the configuration dialog.

When you do not need the grouping, you can use the Statistics node with easier configuration. Just select the columns, the number of values that should be present in the view, and the number of common/rare values that should be enumerated. You might find that the median is not computed in the results. In this case, you should check the Calculate median values (computationally expensive) checkbox. The following is the statistics you get in the view (for the numeric columns):

  • Minimum
  • Maximum
  • Mean
  • Std deviation
  • Variance
  • Overall sum
  • No. missings
  • Median
  • Row count

You also get the number of missing values and the most common and rarest values for the selected nominal (and also numeric) columns, with their number of occurrences.

The statistics table, which is the first output port, contains the same content as the view for the numeric columns. The second output port (occurrences table) gives a table with the number of occurrences for each numeric and nominal values in a decreasing order of frequencies (including the missing values).

Using the output tables, you can create conditions or further aggregate operations. For example, creating the flow variables from the certain mean and standard deviation and creating conditions using the Java Edit Variable node allows you to filter the rows with certain ranges related to the mean and standard deviation with the row filtering/splitting nodes. (Or use the Java Snippet Row Filter node directly with the flow variables.)

The Value Counter node acts in a manner similar to the Statistics node's second output, but in this case, only a single column is used. So, no missing values will appear in the count column (which is not sorted) and the values from the original column will appear as row IDs. In this form, they are better suited for visualization. Also, because this node is able to support HiLite, you can select the original rows based on the frequency values.

When you want a similar (frequency) report with two columns and a possible weight column to create crosstabs, you should use the Crosstab node. In the view of the node, you get the crosstab values in the usual form. You can specify which parts (Frequency, Expected, Deviation, Percent, Row Percent, Column Percent, or Cell Chi-Square) should be visible. (The row and column totals are always visible, and if there are too many rows or columns, you can keep only the first few.)

There is another table in the view, beneath the frequency. It is the summary of the Chi-Square statistics (degree of freedom (DF), the Computing statistics Value, and the probability (Prob) of no association between the values (a p-value)), and also the Fischer test's probability, when both columns contain exactly two values.

The Crosstab node's first output port contains the values similar to the view's main table, but in this case, it is in a different form: the column values are in columns, while the statistics (Frequency, Expected, Deviation, Percent, Row Percent, Column Percent, Total Row Count, Total Column Count, Total Count, and Cell Chi-Square) are in other columns. You can transform it to the usual crosstab form (keeping a single statistics) using the Pivoting node (select one of the columns as the group column, the other as pivot, and the statistics should be used as an aggregation option). You can check the workflow from the crosstab.zip file available on this book's website.

The second output table of the Crosstab node contains the statistics just like the second part of the view, but in this case it is in a single row even if both the columns contain two values (the Fischer test's p-value is in the last column).

When you want to create a correlation matrix, you should use the Linear Correlation node. It will compute the correlation between the numeric-numeric and nominal-nominal pairs. Also, a model will be created for further processing. You can use this information to reduce the number of columns with the help of the Correlation Filter node.

The view of the Linear Correlation node gives an overview about the correlation values with the color codes.

There are three t-test computing nodes: Single sample t-test, Independent groups t-test , and Paired t-test. The Single sample t-test can be used to test whether the average of the selected columns is a specified value or not. The t-value (t), degree of freedom (df), p-value (2-tailed), Mean Difference, and confidence interval differences are computed relative to the specified mean value (the Test value). The other output table contains some statistics about the columns, such as the computed mean, standard deviation, standard error mean, and the number of missing values in that column.

The view of Single sample t-test contains the same information as the two output tables.

When you want to compare the means of two measurements of the same population (or at least not independent), you can use the Paired t-test node. The view and the resulting tables contain the same statistics as the Single sample t-test node, but in this case the mean difference is replaced with the standard deviation and the standard error mean values, both in the view and the first output table. The configuration options allow you to select multiple pairs of numeric columns.

For two sample t-tests, you should use the Independent groups t-test node. It expects the two groups to be defined by a column; the values are grouped by that column's values. You can select the column that contains the class for grouping and the values/labels for the two groups within that column. The average of the columns will be compared, and the t-tests will be computed both for the equal variance assumption and without that assumption (first output table). The Levene test is also computed to help decide whether the equal variance can be assumed (second output table).

The descriptive statistics is augmented with the number of rows that are not in either group (Ignored Count (Group Column)).

The last test for hypothesis testing is the One-way ANOVA. It allows you to compare the means within groups defined by the values of a single column, just like the Independent groups t-test node does; however, it supports multiple groups.

Finally, when you need robust statistics, you can use the Conditional Box Plot node. It gives you the minimum and maximum values, the median, Q1, Q3, and the whisker values (can be the same as min/max, else the 1.5 times interquartile range (Q3 – Q1) below or above Q1 and Q3).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.53.119