The Relationship Between Tools and Tasks: Clustering Respondents

DATA PROFESSIONALS ARE NOT A homogenous group— there are various types of roles in the space. While it is easier—and more common—to classify roles based on titles, clustering based on tools and tasks is a more rigorous way to define the key divisions between respondents of the survey. Every respondent is assigned to one of four clusters based on their tools and tasks.1

The four clusters were not evenly populated: their shares of the survey sample were 29%, 31%, 23%, and 17%, respectively. They can be described as shown on the right.

A selection of tool and task percentages are described in the sections that follow, and the full profiles of tool/task percentages are found in Appendix A.

Operating Systems

In our three previous Data Science Salary Survey reports, the clearest division in tool clusters separated one group of open source, usually GUI-less tools, from another consisting of proprietary software, largely developed by Microsoft. Common tools in the open source group have been Linux, Python, Spark, Hadoop, and Java, and common tools in the Microsoft/ closed source group include Windows, Excel, Visual Basic, and MS SQL Server. This same division appears when we cluster respondents, and is clearest when we look at the usage of operating systems:

A set of tasks also emphasize the division between the first two and last two clusters. The following percentages represent respondents who indicated major engagement in these tasks:

For all of the above tasks, the top two percentages were held by clusters 3 or 4 and were both much higher than either percentage for clusters 1 and 2.

Python, Matplotlib, Scikit-Learn

Another set of tools that exposed the primary split between clusters 1/2 and 3/4 are Python and two of its popular packages, Matplotlib (for visualization) and Scikit-Learn (for machine learning):

Survey respondents assigned to clusters 3 and 4 tend to use Python much more than those assigned to 1 and 2, and the relative difference (as a ratio) grows when we look at the two packages: cluster 3 and 4 respondents are 8–10 times as likely to use them as cluster 1 and 2 respondents. Between clusters 3 and 4 there is a difference as well, albeit more minor: cluster 3 has a higher Python usage rate, while a larger share of cluster 4 respondents don’t use Python or these packages. It turns out that these are the only tools whose highest usage rate is among cluster 3 respondents.2 For most other tools that are used much more frequently by clusters 3 and 4 than by 1 and 2, they are also used more frequently by cluster 4 than by cluster 3.

Cluster 4 rates for two tasks also stand out:

Cluster 4, it seems, is much more of an “open source data engineer” descriptor than cluster 3, which heads in that direction but not nearly to the same extent. It’s not rare for cluster 3 respondents to have used these tools—86% of them used at least one—but on average they only used about 2.2. In comparison, respondents in cluster 4 used an average of 5.3 tools. The fact that ETL and data management are much more important in cluster 4 than cluster 3, implies that while both might represent data science, cluster 3 tends toward the analyst’s side of the field, and cluster 4 tends toward the engineering or architecture side.

As for the other two clusters, differences between clusters 1 and 2 become apparent once we look at the rest of the aforementioned proprietary tool set. Cluster 2 respondents tended to use these much more frequently.

For most of tools shown below, cluster 1 has the second highest usage rate, but they significantly lag behind those of cluster 2. Cluster 1 respondents tended to use fewer tools in general: just under 8 on average, compared to 10, 13, and 21 for the three other clusters, respectively.

Tasks Without Coding

There are also some tasks that are undertaken by cluster 2 respondents significantly more frequently than those in other clusters:

The first two tasks are functions of an analyst, and are fairly common among cluster 3 and 4 respondents as well. Crucially, none of these tasks depend on being able to code (at least, not as much as the four tasks above that are closely associated with clusters 3 and 4). The low percentages for cluster 1 sheds some light on the nature of this cluster: most respondents in the sample whose primary function is not as a data scientist, analyst, or manager seem to be grouped there. This includes programmers who aren’t deep in the space (e.g., Java programmers who only use a few data tools). There are analysts and data scientists in cluster 1, but they tend to have small tool sets, and the composite feature of non-participation in many data tasks and non-use of data tools is what binds cluster 1 together.

Some of the proprietary tools listed above are used by respondents in cluster 4 about as much as those in cluster 1, most notably SQL Server. In other words, they begin to violate the primary cluster 1/2 vs. 3/4 split. A few other tools and tasks take this pattern even further, or simply don’t show large usage differences between clusters:

Tableau, Oracle, Teradata, and Oracle BI usage is higher in clusters 2 and 4, lower in clusters 1 and 3. The same is true for SQL, but like Excel and R, it’s exceptional in its wide usage across all four clusters. In fact, SQL and Excel are the only two tools (or tasks) that are used by over half of the respondents in each cluster. R is not used as much by cluster 1, but usage among the other three clusters is about the same: 67%– 69%. Data cleaning and basic exploratory analysis are similarly high for clusters 2, 3, and 4, and much lower for cluster 1. These tasks and tools cut across the cluster boundaries, and don’t seem to have much correlation with the more salient tool/task differences.

Managerial and Business Strategy Tasks

Perhaps even more illustrative of the connection between clusters 2 and 4 are the managerial/business strategy tasks. The implication is that respondents in 2/4 tend to be more senior, which turns out to be true, but only to an extent. In terms of years of experience, clusters 1, 2, and 4 are about the same—8–9 years on average—while for the cluster 3, the average is much smaller: only 4.4 years; a similar difference exists for age.

Despite representing the least experienced cohort, cluster 3 isn’t the lowest paid; that distinction goes to cluster 1, with a median salary of $72K. At $84K, cluster 3 is still lower than cluster 2 ($88K), but cluster 4 salaries tended to be far higher than either, with a median of $112K. Cluster 4 respondents tend to use a far greater number of tools than respondents in the other clusters, and many of the tools they commonly use are ones that had positive coefficients in the regression model.

1 We tried a variety of clustering algorithms with various numbers of clusters, and the two best performing models came from KMeans, with two and four clusters. The partition in the 2-cluster model is more or less preserved in the 4-cluster model, so we will use the latter, keeping in mind that there is a primary split between the first two and last two clusters.

2 Excluding tools that didn’t have a significant difference between the top two percentages: Mac OS X, ggplot, Vertica, and Stata.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
13.58.247.31