Clustering

Tableau 10 introduces the ability to quickly perform clustering analysis in your visualizations. This allows you to find groups, or clusters, of individual data points that are similar based on any number of your choosing. This can be useful in many different industries and fields of study, as in the following examples:

Marketing may find it useful to determine groups of customers related to each other based on spending amounts, frequency of purchases, or times and days of orders.
Patient care directors in hospitals may benefit from understanding groups of patients related to each other based on diagnoses, medication, length of stay, and number of readmissions.
Immunologists may search for related strains of bacteria based on drug resistance or genetic markers.
Renewable energy consultants would like to pinpoint clusters of windmills based on energy production and then correlate that with geographic location.

Tableau uses a standard k-means clustering algorithm that will yield consistent results every time the view is rendered. Tableau will automatically assign the number of clusters (k), but you have the option of adjusting the value as well as assigning any number of variables.

As we consider clustering, we'll turn once again to the real estate data to see whether we can find groupings of related houses on the market and then determine whether there's any geographic pattern based on the clusters we find.

Although you can add clusters to any visualization, we'll start with a scatterplot, because it already allows us to see the relationship between two variables. That will give us some insight into how clustering works, and then we can add additional variables to see how the clusters are redefined.

Beginning with the basic scatterplot of Address by Size and Price, switch to the Analytics pane and drag Cluster to the view:

When you drop Cluster onto the view, Tableau will generate a new Clusters field (automatically placed on Color here) and will display a Clusters window containing the fields used as Variables and an option to change the Number of Clusters. The Variables will contain the measures already in the view by default:

The Variables are all of the factors that the clustering algorithm uses to determine related data points. The Number of Clusters determines into how many groups the data is partitioned. In the preceding view, you'll observe three clusters of houses:

Those with low price and smaller size
Those in the middle
Those with high price and large size

Because the two variables used for the clusters are the same as those used for the scatterplot, it's relatively easy to see the boundaries of the clusters (you can imagine a couple of diagonal lines partitioning the data).

You can drag and drop nearly any field into and out of the Variables section (from the data pane or the view) to add and remove variables. The clusters will automatically update as you do so. Experiment by adding Bedrooms to the Variables list and observe that there's now some overlap between Cluster 1 and Cluster 2, because some larger homes only have two or three bedrooms while some smaller homes might have four or five. The number of bedrooms now helps define the clusters. Remove Bedrooms and note that the clusters are immediately updated again.

Once you have meaningful clusters, you can materialize the clusters as groups in the data source by dragging them from the view and dropping them into the data pane:

The cluster group will be recalculated at render time. Using a cluster group allows you to accomplish a lot, including the following:

Cluster groups can be used across multiple visualizations and can be used in actions in dashboards.
Cluster groups can be edited and individual members moved between groups if desired.
Cluster group names can be aliased, allowing more descriptive names than Cluster 1 and Cluster 2.
Cluster groups can be used in calculated fields, while clusters can't.

In the following example, a map of the properties has been color-coded by the Address (clusters) group in the previous view to help us to see whether there's any geographic correlation to the clusters based on price and size. While the clusters could have been created directly in this visualization, the group has some of the advantages mentioned:

In the view here, each original cluster is now a group that has been aliased to give a better description of the cluster. You can use the drop-down menu for the group field in the data pane or alternately right-click the item in the color legend to edit aliases.

There are a lot of options for editing how maps appear. You can adjust the layers that are shown on maps to help to provide additional context for the data you are plotting. From the top menu, select Maps | Map Layers. The layer options will show in the left-hand side bar. The preceding map has had streets, highways, county borders, and zip code borders enabled to give each address a more precise context of location. The layers that are available to select will depend on the zoom level of the map.

In looking at the previous view, you do indeed find neighborhoods that are almost exclusively the Low Price & Size (Cluster 1) and others that are almost exclusively Mid-Range (Cluster 2). Consider how a real-estate investor might use such a visualization to look for a good buy of a low-priced house in a mid-range neighborhood.

Table of Contents for Clustering

Create new playlist

Sign In

Sign Up

Table of Contents for
Clustering