Building a good team is always difficult. What makes it even more difficult for big data and data science are:
The skill shortage for big data and data science has lasted several years. I’ve heard in-house recruiters complain about it, and I’ve seen it highlighted as a major theme at industry events. Service providers and consultants have seen the mismatch in supply and demand and many are rebranding existing staff to sell services they are, in fact, unable to deliver.
In this chapter, I’ll cover key roles related to big data and data science, as well as considerations for hiring or outsourcing those roles. To start, consider the mystically titled role of ‘data scientist’.
This relatively new job title has enveloped a dozen traditional job titles and taken on a life of its own. Whereas accountants may be ‘chartered’, physicians may be ‘licensed’ and even first-aid workers are ‘certified’, anyone can call themselves a ‘data scientist’.
The term ‘scientist’ has traditionally referred to creative individuals who apply any available tool to observe and interpret the world around them. The term ‘engineer’ would then be someone trained in a specific application. With the changes in available data sources and methodologies, such as the application of AI to unstructured big data stores, we found ourselves needing to move beyond our predefined (engineering) analytic methods, such as statistics and numerical optimization, and creatively apply a wide range of tools to a wide range of data sources: tools such as neural networks, support vector machines, hidden Markov models, calculus-based optimization, linear and integer programming, network flow optimization, statistics, and additional methods that have proven useful within the broad fields of data mining and artificial intelligence. We apply these methods to any data we can find, not only the familiar data within corporate files but also web logs, email records, machine sensor data, video images and social media data. Thus, the term ‘science’ became more appropriate for a field where practitioners were creatively moving beyond traditional methodologies.
Today we use the term ‘data scientist’ to encompass not only those experts who are creatively expanding the use of data but also anyone who ten years ago might have been called a statistician, a marketing analyst or a financial analyst. We have created a term so rich in meaning that it has become almost meaningless.
The Harvard Business Review wrote in 2012 that data scientists had ‘the sexiest job of the twenty-first century’.73 Glassdoor, a career portal, listed data scientist as the best job in America for both 2016 and 2017.74 It is thus not surprising that recent years have seen a flood of semi-qualified job candidates entering the field, further muddying the recruitment waters. Data from the job portal Indeed.com shows a levelling out of data science positions over the past few years (Figure 10.1), while the number of job candidates for such positions grew steadily (Figure 10.2), which is not to say that the number of qualified candidates has increased. This surge in job seekers emphasizes the importance of properly screening candidates.
Figure 10.2 Percentage of candidates searching for ‘Data Scientist’ positions.75
Despite its inherent vagueness, you’ll want to include the term ‘data scientist’ in your analytic role descriptions for purposes of keyword search, followed by concrete details of what you really need in the candidate. You may see this job title connected to any of the specific roles I’ll describe in the next section. Internally, focus your recruitment efforts on the specific competencies you need, rather than on the term ‘data scientist’.
Let’s look now at the specific job roles you’ll want to fill for your big data and data science initiatives.
If you’re not utilizing Infrastructure as a Service or Platform as a Service offerings, you’ll need staff to get your specialized computer systems up and running, particularly the distributed computing clusters. Some common position titles related to these functions are ‘systems engineers’, ‘site ops’, and ‘dev ops’.
Preparing data for analysis is more time consuming than doing the analysis. You’ll need to Extract the data from source, Transform/clean the data, and Load it in tables optimized for retrieval and analysis (the ETL process). Specialized software will help, but you’ll still waste a lot of time and see huge performance losses if you don’t have specially trained people for this task.
Your data engineers should:
The role of data engineer has become very difficult to fill in some geographies, but if you don’t get specialist data engineers, others in your team will waste time covering this critical but specialized task, with at best mediocre results. I’ve seen it before, and it’s not pretty.
Your most innovative projects will be done by experts using mathematics, statistics and artificial intelligence to work magic with your data. They are writing the programs that beat the world champion at Go, or recommend your next favourite movie on Netflix, or understand that now is the right time to offer the customer a 10 per cent discount on a kitchen toaster. They are forecasting your Q2 revenue and predicting the number of customers you’ll see next weekend.
The people you hire for these tasks should have a strong background in mathematics, usually a degree in maths, statistics, computer science, engineering or physics, and they should have experience writing and coding algorithms in a language such as Java, Scala, R, Python or C/C++. They should preferably be experienced in object-oriented programming. If you are developing a highly specialized algorithm, such as for image or speech recognition, you will probably want someone who has completed a PhD in that area.
There are a few key skills I look for in building the team of algorithm specialists. These skills may not all be present in one person, but they should be covered within your team.
For the algorithm specialist role, look closely at the candidates’ degrees and Alma Maters. Some universities are much stronger than others. Be aware that countries differ in the effort required to earn a degree. To further complicate the matter, some universities may not be top-ranked overall but are world leaders in specific fields. You may be surprised to see the computer science programme at the University of Washington ranked above the programmes at Princeton and Harvard. Finally, keep in mind that the difference between two PhD graduates from the same school can still be wide enough to drive a truck through.
Educational background and experience with well-known companies can be strong signals of candidate strength, but they should not dictate your hiring decisions.
For some roles related to algorithm development, particularly those requiring extreme innovation, we value high intelligence and creativity more than relevant experience. Several years ago, a friend interviewed at one of the world’s top hedge funds. The entire interview process consisted of five to six hours of solving brain teasers, with almost no questions related to the financial markets or even coding. This company was looking for raw creative intelligence in their algorithm developers, and they trusted that the right people could learn any relevant subject matter as needed. Although this may be a viable tactic when hiring algorithm developers, it’s not appropriate for roles such as data engineers and business analysts.
Most of the ‘data scientists’ that you hire will probably be what I would call ‘business analysts’. These analysts are business thought partners and answer basic but important data questions asked by your business units. They typically use basic technologies to gather data and then spreadsheets to analyse the data and deliver results. In other words, these guys are great with Microsoft Excel.
There are various schools of thought as to where these analysts should be positioned within the organization, with some companies grouping them in a centralized team and some embedding them within business units.
Centrally located analysts can more easily share knowledge and can be allocated on demand to the highest priority projects. Dispersed analysts can leverage the insights and quick feedback available as part of a business team. The decentralized model probably occurs more often in small to mid-sized enterprises, as it does not require executive sponsorship but is funded at department level and justified by an expressed business need for data insights.
In either case, encourage the business analysts to keep strong lines of communication among themselves, with the algorithm developers and especially with the data engineers. The business analysts will provide valuable business insights to the algorithm developers, who in turn can propose innovative solutions to front-line challenges. The data engineers should actively assist the business analysts with data extraction, or else the analysts will waste time writing sub-optimal queries.
Customer online behaviour is a very important data source. You can choose from a broad selection of mature web analytics products, but whichever tool(s) you choose should be managed by a trained specialist who keeps current on developments in web analytics and related technologies (including browser and mobile OS updates).
Your web analyst will oversee web and app tagging and make sure that online customer activity is collected effectively. Some web analytics tools can also collect data from any connected digital device, not only browsers and apps, and the web analyst can assist with this data consolidation. The web analyst will create conversion funnels and implement custom tagging, and will monitor and address any implementation problems that may arise, such as data errors related to browser updates. They will assist merging internal data with web analytics data, which may be done within the organization’s databases or on the web analytics server.
Your web analyst will also be an expert in extracting data, creating segments, and constructing reports using available APIs and interfaces. For this reason, this person may be actively involved with A/B testing, data warehousing, marketing analysis, customer segmentation, etc.
You’ll benefit greatly if you hire or train staff skilled at creating top-notch graphs and tables. This requires a mixture of art and science and should be done by people who excel in, for example:
Stephen Few has written multiple books covering best practices for data visualization.60, 76–79
On a technical level, the reporting specialists should be comfortable writing database queries to extract data from source systems, and they should be trained on your BI tool(s).
Leadership is key to the success of your analytics programme. In the CapGemini survey referenced previously, close to half the organizations were already engaged in organizational restructuring to exploit data opportunities, and a third were appointing senior big data roles, recognizing that data opportunities spanned the breadth of their businesses.
My clients sometimes ask me to help scope requirements for and recruit analytics leadership. This ‘lead data scientist’ role is typically opened by the company for one of two reasons:
I’ve conducted several hundred interviews for analytics roles over the nearly 20 years that I’ve worked in financial and business analytics, and I’ve screened even more CVs. The candidates with whom I’ve spoken have come from across the world, many having completed world-class technical graduate programmes or MBA programmes at schools such as Wharton, Chicago Booth or Oxford. It’s been a real privilege to find and hire many excellent people over the years.
Filling a lead analytics role, however, is particularly challenging because of the complex requirements the candidate must satisfy.
The lead role requires a strong blend of technical, business and communication skills; skills that often correlate negatively. Individuals excelling technically often have proportionately less interest in mastering communication with non-technical business colleagues and may prioritize technical innovation above business value.
From an analytics perspective, the leadership role requires both familiarity with a broad range of tools and techniques and an experience-based understanding of what is involved with in-depth technical implementations. There is certainly space in an organization for specialists in areas such as statistics, deep learning, NLP, or integer programming, but for the lead role, the right candidate must have an overview of the entire analytic tool chest, allowing them to select techniques that best address business problems and to recruit specialized talent as needed.
The leader must also be familiar with relevant tooling, including database technologies, programming frameworks, development languages and prototyping tools, examples of which were given above. The technology space is already quite broad, and it continues to expand. Properly leveraging existing technologies can easily save months or years of in-house development.
Initiatives will almost certainly fail if the analytics leader cannot:
There are three phases through which I typically progress alongside a company recruiting the lead role.
Because big data roles have only existed for a few years, many external recruitment firms struggle to understand the profiles they are being asked to fill. Some third-party recruiters I’ve spoken with are not able to distinguish between a data engineer and an algorithm developer. They are not familiar enough with the rapidly changing technology landscape to match skills and experience on a C.V. with requirements for a posting, let alone to assist you in writing specifications that best describe your needs. They may struggle to present the role in a way that is attractive to top talent and may end up recycling old job postings, demonstrating to candidates a disconnect with modern technology.
Generalist recruitment firms compete with internal recruiters at technology companies, who are actively poaching specialist recruiters. You should rethink your traditional methods of sourcing candidates, broaden your network of third-party recruiters and make conscious efforts to help internal recruiters understand the nature of the new roles as well as the preferences and quirks of target candidates. Send your recruiters to a good data conference to get them up to speed with concepts and terminology and to help them in networking.
Instacart, an online company providing same-day grocery deliveries, was founded in Silicon Valley in 2012 by a former Amazon employee. In 2015, Forbes called it ‘the most promising company in America’. By 2017, it had grown to over 1000 employees and a market valuation of several billion dollars.
Instacart uses machine learning for several important applications, such as to decrease order fulfilment time, plan delivery routes, help customers discover relevant new products, and balance supply with demand.
In a recent interview, Jeremy Stanley, Vice President of Data Science, elaborated on analytics staffing within Instacart. Their data people are divided into two categories:
They only hire ML engineers with solid experience, but they have also trained internal software engineers to be ML engineers, a process that typically takes about one year. Although none of their business analysts have transitioned to the role of ML engineer, they estimate it would take two to three years of training to teach these business analysts the development skills necessary to write production-ready ML software.
They feel recruitment is most difficult at the top of the funnel (finding the candidates), but is helped by:
Their decentralized model pushes much of the hiring and mentoring to the data science VP, who estimates his time is evenly split between hiring, mentoring and hands-on project work.
The hiring challenge is compounded when it needs to happen at scale. You may want to staff up rapidly after you’ve established the value of an analytics effort through a proof of concept. According to a recent McKinsey survey of 700 companies, 15 per cent of operating-profit increases from analytics were linked to hiring experts at scale.80
You can fill some positions by re-allocating internal resources, particularly those positions that require only general software development skills or a general analytics background. For more specialized skill sets, particularly within AI, companies often fill staffing needs by acquiring smaller, specialized companies, particularly startups. We saw this at eBay in 2010, when eBay quickly scaled its pool of mobile developers by purchasing Critical Path Software. We see it still within AI, with Google’s acquisition of DeepMind (75 employees at the time) and Uber’s acquisition of Geometric Intelligence (15 employees). Salesforce, which is pushing its AI offering in its Einstein product, acquired key AI staff in 2016 through its acquisition of the Palo Alto-based AI startup MetaMind, with the expressed goal to ‘further automate and personalize customer support, marketing automation, and many other business processes’ and to ‘extend Salesforce’s data science capabilities by embedding deep learning within the Salesforce platform.’81
Figure 10.3 Rate at which AI companies have been acquired 2012–2017.84
GE, a company with over 10,000 software developers and architects, recently launched an IoT software platform called Predix. They grew the Predix team from 100 employees in 2013 to 1000 employees in 2015, with plans to retrain their entire global software team on the new platform.82 This rapid growth was also fuelled by acquisition. They hired the co-founder of key technology provider Nurego as Predix general manager, subsequently acquiring the entire company.83
Figure 10.3 illustrates the increasing rate at which AI companies have been acquired over the last few years.
You can bring external resources to supplement your in-house staff or you can outsource entire projects or services.
Outsourcing projects facilitates agile development and allows you to focus on your core strengths. In terms of agility, outsourcing allows you to quickly secure very specific expertise in technologies and data science applications. A third party may be able to start work on a project within a few days or weeks, rather than the several months sometimes needed for internal resources that would need to be re-allocated or recruited (both of which are difficult for proofs-of-concept).
Owing to their specialized experience, a small team of externals might complete a proof of concept within a few weeks, whereas an internal team without comparable experience could easily take several months and would be more likely to fail. This tremendous boost in speed allows you to quickly determine which analytic initiatives bring value and to start benefiting as soon as possible.
The daily cost of external resources may be several times higher than internal salaries, but when you consider the difference in development time, they may well be more cost-effective. When you move the technology from proof of concept to production, you will want to move the expertise in-house but will then have the business case to support the long-term investment.
Many organizations hire externals to supplement in-house staff, putting externals within their internal teams. Supplementing staff with externals serves three purposes.
Bringing in external experts may be the best way to jump-start a project or do a proof of concept.
A word of caution on outsourcing: it can be quite difficult to find high-quality data science consultants. Quality varies significantly even within the same company. Since your projects will by nature be R&D efforts, there is always a chance they will result in little or no tangible benefit, regardless of the strength of the analyst. Thus, it is especially important to maximize your odds of success by bringing in the right people. If possible, look for boutique consulting firms, where the company owners are involved in monitoring each project.
In the end, if you’ve managed to assemble a strong internal team and a reliable set of externals to call on when needed, you’ve probably done better than most of your peers.
If you are leading a smaller company or working alone, you probably won’t have the resources or the requirements for a full data team. With only a few end users, you won’t be as reliant on the skills of specialized data engineers. You also won’t have enough consumers of reports and dashboards to justify hiring a reporting specialist, and you’ll probably not have the resources to commit to a full machine learning project.
Your ‘minimum viable product’ for a data team in a small company would be to place the web analytics responsibility within your marketing team and to hire an analyst who can cover business analytics and reporting. The minimum skills for this analyst are:
Although you typically won’t launch internal machine learning projects, at this stage you can still take advantage of the pay-per-use offerings of some of the larger vendors without needing to understand how they work. Examples include the image and text recognition software of Google Cloud Vision API, Salesforce Einstein and Amazon AI.
18.118.10.32