Basic KNIME views

The main views of KNIME give you multiple options to explore data. These nodes do not provide options to generate images for further nodes, but they give quite a good overview about the data, and you can save the files using the File menu.

There are different flavors for some of the nodes: the interactive and the normal. With the interactive flavor, you can modify certain parameters of the view without reconfiguring (and executing) the view. The interactive versions are better suited for data exploration, but the normal ones make it easier to check certain things with new data.

The Box plots

The Box Plot node has no configuration, but gives robust statistics (minimum, smallest, lower quartile, median, largest, and maximum) for numeric columns. You might wonder about the difference between the minimum and the smallest values or the largest and maximum values. The smallest is the maximum of the minimal value and the The Box plots value. The largest is computed analogously.

The view gives a box-and-whisker diagram, which is useful to find outliers. The Column Selection tab allows you to focus only on certain columns. The Normalize option on the Appearance tab will rescale the box-and-whisker diagrams to have the same length on the screen between the minimum and maximum values.

The Conditional Box Plot node's view is quite similar to the Box Plot view, although in this case, the diagram is not split by the columns, but by a preselected nominal column. The values are representing the values from a numeric column. You can also select whether the missing values should be visible or not.

The node view controls are really similar to the Box Plot's. However, in this case, the Column Selection tab does not refer to the columns from the table, but to the columns on the diagram; you can select the class values that should be visible.

Hierarchical clustering

There is an option to visualize the result of hierarchical clustering with the Hierarchical Cluster View node; however, it is worth summarizing how you can reach the state when you can show the cluster model. First, you have to specify the distance between the rows using one of the options we described in the Distance matrix section.

In the Hierarchical Clustering (DistMatrix) node's configuration, the main option you have to select is the Linkage Type, which defines how the distance between the clusters should be measured:

  • Single: It measures the minimal distance between the cluster points
  • Average: It measures the average of differences between the points of the clusters
  • Complete: It measures the maximal distance between the cluster points

You can also select between the distance matrices if you have multiple columns.

Histograms

The difference between Histogram and Histogram (interactive) is minimal in the configurations (the non-interactive version allows you to specify the number of bins configuration time). The common configuration options are the Binning column, Aggregation column, and the No. of rows to display. With the Binning column option, you can define how the main bins should be created; it can be either nominal or numeric. The coloring information splits between the bars, and the aggregation columns are available as separate, adjacent bars.

The possible aggregation options are: Average, Sum, Row Count, and Row Count (w/o missing values). When you have multiple aggregation columns selected, Row Count (with missing values) is not an informative or recommended choice.

On the Visualization settings tab, you can further customize the view, by enabling/disabling outlines, grid lines, the orientation, width, or the labels.

The Details tab gives the information about the selected bars, such as the average, sum, count for each column, and colors. (You can select the monochrome part of a bar too.)

Interactive Table

The interactive table looks like a plain port view; however, it gives further options, such as the HiLiting support and the optional color information (in the port view, it is not optional). You can also save the content to the CSV file (Output | Write CSV), adjust the default column and row size (View | Row Height... and Column Width...), and find certain values (Navigation | Find, Ctrl + F).

The options for sorting by columns (Ctrl + click, or the menu from the regular click) and reordering (dragging) them are also available in this view, and you can select the preferred renderers for them. However, you cannot check the metadata information (column stats and the properties).

The Lift chart

The Lift Chart node is useful when you want to evaluate the fit of a model for a binominal class. In the configuration dialog, you can specify what is the training label and the value learned. The probabilities of the learned label should also be specified, just like the width of the bins (in percentage, you will get 100/that value points). In the view, there are two parts—Lift Chart and Cumulative Chart—both with separate configurations of color, line widths and dot sizes (with visibilities).

The Lift Chart node also contains the cumulative lift, but it can be made invisible if you do not want it.

Lines

The Line Plot node and the Parallel Coordinates views are similar, but they show the data in the orthogonal/transposed form with respect to each other. The Parallel Coordinates view contains the selected columns on the x axis and the row values flow horizontally colored by the color properties, while in Line Plot, the rows are on the x axis and the (numeric) columns are represented by user-defined colors.

The missing values are handled differently; in Line Plot, you can try to interpolate, while in the other, you can either omit or show them or their rows.

Line Plot is more suited for equidistant data, such as time series, for other data it might give misleading results (the distances between the rows are the same). The Parallel Coordinates view is better suited to find connections between the values of different columns, because in this case you have no ordering bias. The Parallel Coordinates view gives a neat option to use curves instead of straight lines. Fortunately, you can change the order of columns within the view using the extra mouse mode Transformation, so you can create neat figures with this view. This view is quite good to show intuitive correlations.

Pie charts

The Pie Chart and the Pie Chart (interactive) nodes have the same configuration options, although for the latter, the configuration gives only the overridable defaults in the view. These configurations include the binning column and the aggregation column, just like the aggregation function.

With Ctrl + click, you can select multiple pies. HiLiting works in this view, and the Details tab contains statistical information for each selected sections, which is split by the colors within the pies. When the binning is not consistent with the color property, no coloring is applied unless you select them (and enable the Color selected section).

In the Visualization setting tab, you can specify whether the section representing the missing values should be visible or not, show outline, explode the selection, or whether the aggregated value/percent should be visible or not (for selected, all, or no sections). The size of the diagram too can be adjusted in this tab.

The Scatter plots

The Scatter Matrix and the Scatter Plot nodes are quite similar. The Scatter Matrix node is a generalization of the latter. It allows you to check the scatter plots for different columns side-by-side.

A scatter plot can use all the visual properties (size, shape, and color), so you can visualize up to five different columns' values on a 2D plot.

There are not many configurations for either maximum rows or maximum distinct nominal values in a column.

In the case of Scatter Plot, you can only select the two columns for the x and y axes, but in case of the Scatter Matrix node, you can set the ranges for them. With the Scatter Matrix, you can select multiple columns, and when you are in the Transformation mouse mode, you can rearrange the rows/columns, but you cannot change their ranges.

Both the views support the jittering when one of the columns is nominal (the Appearance tab, Jitter slider). In that case, the values in the other dimension get some random noise, so the number of points at a position could be easily estimated. If you want precise positions, you might consider adding transparency to the color of the points, so when there are overlaps, they will be more visible.

The Linear Regression (Learner) and the Polynomial Regression (Learner) nodes also provide the scatter plot views, although these show the model as a line. It can be useful to have a visual view of the regression, even though these do not specify which slice of the function is shown from the many possible functions, parallel to the selected.

Spark Line Appender

The Spark Line Appender node does not have a view, but it generates a column with an SVG image of a line plot of the selected numeric columns, for that row. This can be useful to find interesting patterns. However, it is recommended to use Interactive Table, because the initial size is hard to see, and changing the row height multiple times is not so much fun (and can be avoided if you hold the Shift key while you resize the height of a row). But with the special view, you can do that from the menu.

Radar Plot Appender

The Radar Plot Appender node works quite like the previous node, although it has more configuration options. You can set many colors for the SVG cell, and also the ranges and the branches (columns) of the radar plot. The resulting table has a bit larger predefined row height, but the use of an Interactive Table view might still be a good idea.

The Scorer views

The ROC Curve (ROC (Receiver Operating Characteristic)) and Enrichment Plotter nodes give options to evaluate a certain model's performance visually. Because the views are not too interactive, you have to specify every parameter upfront in the configuration dialog.

In the ROC Curve configuration, you have to select the binominal Class column and the label (Positive class value) to which the probabilities belong. This way, you will be able to compare different kinds of models or models with different parameters. The node also provides the areas beneath the ROC curve in the result table.

The Enrichment Plotter node helps you decide where to set the cut-off point to select the hits. The node description gives a more detailed guide on how to use it.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.16.81