Case study – ranks within groups

In this case, we will compute ranks (based on a certain order) within groups. This is a much easier task, but can be very useful if you want to select the outliers without prior knowledge to define cut-off points. However, it can also be useful for summarizing historical data (find the three/five top hits leading the sales list the longest in different genres, for example). There is also a simplification when we do not need the rank, but just the extreme values. But, certain algorithms can use the rank values for better predictions, because we humans are biased to the best options. For example, in a 100-minute race, the difference between the first and the fifth drivers, is one minute hypothetically; that is it amounts to one percent. It's a quite small difference, although the difference in the prizes and fame are much larger.

The example workflow is in the GroupRanks.zip file.

First, we generate some sample data with the Data Generator node, just like before. Then we loop through the groups defined by the Cluster Membership column in the Rank custom meta node using the Group Loop Start looping node.

In the group, we sort the data by the Universe_0_0 column in the ascending order (and the other numeric columns to break ties) with the Sorter node.

The Java Snippet node just uses the ROWINDEX method to calculate the result (the index + 1 to start ranking from one).

In the Loop End node, we disabled the generation of the iteration column, because it is not interesting for us, and the Cluster Membership column identifies the groups with a nice label.

That's it. This is really easy.

Tip

Exercise

Modify the example to give ranks from the opposite direction too. How would you do that without resorting to the subtable? Could you do it in a way that the small absolute ranks would be extreme values, while the larger ones are the usual? For example, 1, 2, 3, 4, -4, -3, -2, -1

Sometimes the rows that are outliers in multiple dimensions can be explained with a covariance between the columns. However, when you have other outliers, which are outliers only in a few dimensions, those might be a measure error in that column.

Tip

Exercise

Compute the ranks for a user-defined list of numeric columns in both directions to find outliers with this method.

With the ranks in the columns, you can now perform the checks you find worth executing.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.227.46.69