In this case, we will compute ranks (based on a certain order) within groups. This is a much easier task, but can be very useful if you want to select the outliers without prior knowledge to define cut-off points. However, it can also be useful for summarizing historical data (find the three/five top hits leading the sales list the longest in different genres, for example). There is also a simplification when we do not need the rank, but just the extreme values. But, certain algorithms can use the rank values for better predictions, because we humans are biased to the best options. For example, in a 100-minute race, the difference between the first and the fifth drivers, is one minute hypothetically; that is it amounts to one percent. It's a quite small difference, although the difference in the prizes and fame are much larger.
The example workflow is in the GroupRanks.zip
file.
First, we generate some sample data with the Data Generator node, just like before. Then we loop through the groups defined by the Cluster Membership
column in the Rank custom meta node using the Group Loop Start looping node.
In the group, we sort the data by the Universe_0_0
column in the ascending order (and the other numeric columns to break ties) with the Sorter node.
The Java Snippet node just uses the ROWINDEX
method to calculate the result (the index + 1 to start ranking from one).
In the Loop End node, we disabled the generation of the iteration column, because it is not interesting for us, and the Cluster Membership
column identifies the groups with a nice label.
That's it. This is really easy.
Sometimes the rows that are outliers in multiple dimensions can be explained with a covariance between the columns. However, when you have other outliers, which are outliers only in a few dimensions, those might be a measure error in that column.
With the ranks in the columns, you can now perform the checks you find worth executing.
18.227.46.69