Chapter 8. Set—preparing data, technology, and people

This chapter covers

  • Identifying potential sources of data, both inside and outside the organization
  • Assessing the quality and quantity of data
  • Assembling an effective AI team

This chapter picks up where chapter 7 left off. Now that you know how to use the Framing Canvas to create an ML-friendly vision for your project, it’s time to put together the other ingredients that you need. Bringing a project to life requires three main ingredients: the ML model, data, and people. While choosing a good ML model is a task for the technical team, it’s your job to recruit them and craft a data strategy so they can get to work. This chapter focuses on how to find and manage data, and how to recruit a team of talented people with the right skills for your project.

8.1 Data strategy

One of our goals when writing this book was to make you think critically about data and understand how engineers use it to build ML models. Because data is so crucial, developing a coherent data strategy is critical for the success of any project. In part 1, we talked about data in a way that took for granted that you had it readily available to build your models. You can probably guess that this is rarely the case; this chapter will fill in the gaps and help you understand how much data you need, where you can get it, and how to manage it.

One distinction we have to make is between the data strategy of your organization and the data strategy of your AI project . When business media or executives talk about a data strategy , they’re generally talking about the overall company-wide strategy for acquiring, storing, and using data. This strategy is driven by the company’s long-term goals and vision for the future.

This book focuses on the data strategy of your project , which is specific to a single AI initiative. Having a project-specific focus is better than a broad one for two reasons.

First, developing a data strategy even for a single project forces you to be concrete about the specific data you’ll need to collect to build your project, how much you’ll need, and where you’re going to get it. To develop an AI strategy, you may say, “We’ll build infrastructure to collect data from users’ interactions with the platform.” But to build an AI project, you need to say, “We’ll use user clicks, the time they spend on each page, and social shares, and merge that with the CRM data to recommend content.” This level of specification helps you think about what you really need to collect to build your projects. Someone has to make the decision of what exactly to collect, and if the directions you give the IT team are too broad, they’ll make that call for you, often without knowing the business context.

The second (and more important) reason to think about an AI project data strategy is that the organization-wide strategy should be informed by the needs of each AI project that you’re planning to pursue. Starting to think about a company-wide data strategy without any experience is like starting to think about your family house when you’re single. You simply don’t know what you’ll need. (Will you have zero, one, or five kids? Which location is more convenient for you and your partner? What’s your partner’s taste? What can you afford?) Just as individual AI projects help you form a more informed AI vision, likewise building up your data strategy as the minimal set of data needed for the projects helps you make sure you are collecting the data you need.

If you start from AI projects instead, building the organizational data strategy will be a simple exercise of combining the experience that you gained through the implementation of each project (as shown in figure 8.1).

Figure 8.1 The organizational data strategy combines the needs of each individual project (plus some incidentals thrown in for the future).

To build the data strategy for an AI project, you need to know where to get data and how much you need. The next two subsections dive into each topic.

8.1.1 Where do I get data?

Data has been a fundamental concept throughout this book. We’ve covered various kinds of data: structured core data, images, and sound and natural language. If you want to build an AI project, an even more important distinction will affect you: data you have, and data you don’t.

Let’s start by assuming your organization has the data you need to build your AI project. Because this data is produced and owned by your organization, we’ll call this internal data . Chapter 2 already covered the value of a specific kind of internal data that we called core business data : data with a direct impact on the top or bottom line of the organization. For our real estate platform, this can be house-price data; for Square Capital, it’s the transactions of their customers; or for Google, the consumption data of its data centers (basically, its only variable cost).

Your organization likely also produces data from business processes that may not be directly linked to your main sources of income or of costs. We’ll call this kind of data ancillary data . In our real estate example, house pictures and house reviews can be considered ancillary data: still interesting and useful, but not as much as the characteristics of the houses on sale and their selling prices.

As another more concrete example, let’s consider an e-commerce platform like Amazon, Zappos in the United States, or Zalando in Europe. In their cases, core business data would be the purchases of each customer, as they’re directly correlated to company revenues. Ancillary data may be the pages that a customer visits, the emails opened, product reviews, and so on. You can build amazing projects with ancillary data, but it’s most likely not as impactful as what you can build with core business data.

Before moving on, we want to warn you that owning a dataset doesn’t mean you can use it pain-free. You may be planning to use customer data but don’t have the permission to do so--whether customers denied their permission or, as often happens, your privacy policy wasn’t written to account for the use you’re planning for now. Even if you do have permission to use data, in our experience, most people underestimate the time and challenges of going from “we have the data” to actually bringing it into the data scientist’s laptop, ready to be processed. Even if you’re 100% sure that your organization has the data you need for a project, our heartfelt suggestion is that you try to account for everything that can potentially slow you down in actually using it.

In one project we worked on, the legal department of a company made it so difficult to export the data we needed that we literally had to take a plane, get into their building, plug in our laptops, encrypt the data, and bring it home on a hard disk. In another project, we knew the company had the data we needed, but we all struggled to understand what the variables in the dataset meant. The problem was that the person who designed the data collection processes had retired, and we had to spend a whole month running interviews to understand the data we were looking at.

Let’s now cover the case in which you don’t yet have the data you need for a project. You have three options:

  • Start to collect it and wait until you have it.
  • Find open data or scrape it from the internet for free.
  • Buy it from providers.

The first option is the slowest; sometimes it’s also the most expensive, but in some cases you may not have any other choice. Let’s suppose you run a brick-and-mortar clothing shop and don’t have a loyalty card system or any other way to track who buys what. If a customer walks in, you don’t track that his name is John and he just bought some Nike Air Jordan sneakers, size 10, for $129. If you haven’t stored this information, do you think any other company has it? Obviously, in this case, your only option is establishing new processes to collect this information and then waiting until you have collected enough data.

On the other hand, sometimes your projects need data that you either can’t collect or don’t want to make the effort to collect. The second option in this case is to use open data.

It’s amazing how much free data you can get from the internet. An example is the open data that governments often open source, which you can freely use for your projects. For instance, you can use income data for your country to focus your marketing on the wealthiest areas. Here are some other great places to look for open datasets:

  • kaggle.com --A website where companies or individuals can host ML competitions and upload datasets.
  • arxiv.org --A free repository of scientific papers. As researchers collect new datasets, they write and release scientific papers to present them to the scientific community and often release their datasets as well.
  • github.com --A repository of open source code.

When using third-party datasets, you should be concerned about the quality of the data and legal issues. Many of these datasets are produced on a “best effort” basis and collected as needed to support new algorithms or fields of application; often data quality is not guaranteed. Also, many publicly available datasets are released with a noncommercial license, meaning that they can’t be used for business purposes; always check that you can use an open dataset for the purposes you intend.

The third option is buying data from providers. This strategy can sometimes be extremely expensive, and can also put you in a dangerous position as you may end up depending on these providers forever. Before deciding to go this route, we suggest you first spend enough time evaluating the associated long-term costs and determining whether multiple providers exist to buy it from. If there’s just one, you need to find a strategy to protect yourself in case they go out of business or their business strategy changes and they stop selling you their data.

In both cases, we encourage you to think critically about datasets that come from outside your organization. By definition, if some data is openly available--for free or for money--it means that everyone else can get it too. Often, projects based on external data sources will have a harder time on the market because they can easily be replicated by competitors. Some of the strongest AI projects are built on a foundation of proprietary datasets that are hard for others to reproduce.

To sum up, the walls of your organization (or the virtual ones of its data center) make a major difference in the data strategy of your AI project, as shown in figure 8.2. If data sits within the boundaries of your organization, you have a unique and highly valuable dataset you can use. Otherwise, you have to either get it for free or buy it. In both cases, take into account the liabilities that come with those options.

Figure 8.2 The digital or physical walls of your organization define the boundaries between types of data: core and ancillary data within your organization, and free or paid data outside of it.

Of course, you can combine internal and external data. Actually, this is often a good idea. An example is our real estate house-price predictor: we could have used free government data on the income of each neighborhood to improve the model. You can even use data from OpenStreetMap or from Google Maps to check the presence of services and public transportation in various neighborhoods, adding another dimension to the house-price predictor.

For a project we built for a large organization, we used all kinds of data. The company had sales data from its stores, but we wanted to see correlations between sales performance and demographics, so we gathered free census data. The census data that was freely available was OK most of the time, but for some cities we needed a more fine-grained picture of the population. We then turned to specialized external providers and integrated their data as well. Notice that whereas sales data is updated each day, demographics change much more slowly, so it wasn’t a problem to rely on external sources.

Wherever you’re getting your data, a piece of information can make or break your project: labels. Remember that if you’re training a supervised learning algorithm, the computer needs to learn to produce a number or class (label) based on other numbers (features). Most of the time, you can work around missing features, but you may be in big trouble if you don’t have labels. In other words, when it comes to data, labels are way more important than features. Let’s use the example of the home-price predictor. Our label is the price a house is sold for, and our features can be the square footage, number of rooms, presence of a garden, and so forth.

Let’s assume that when you designed the interface for your real estate website, you didn’t think to add a field that lets users specify whether their home has a garden. Therefore, you don’t have that information in your database and can’t use it to build the model. However, you still have other relevant features included in the house listing form, including the square footage, number of rooms, and location. Even without the “garden” field, you’re likely to still be able to build a decently accurate model.

On the other hand, if you forgot to ask users about the sale price, you’d be completely lost. Without labels, there’s no way to build a supervised learning model.

Labels can be collected in three ways:

  • Naturally
  • Hacking
  • Paying

Natural labels are generated by your business processes. For instance, if your real estate platform asks clients to input their home sales price as they delete their listing, you’ll naturally get the label. Google was naturally saving data on its data centers’ energy performance because Google was running them. Amazon stores in a database everything you bought. All this information is stored to make the business run, and can be used as labels if needed.

Sometimes, labels aren’t as easy to get, but you can still find clever hacks to get them. An example is what Amazon does with product reviews. When you write a review about your love for your new vacuum cleaner, you also add a star rating (say, from 1 to 5). The score can be used as a label for a sentiment analysis system. Basically, you’re giving Amazon both the input (the text review) and the label (the star score) it could use to build sentiment analysis technology, for free. Another example is Facebook, which in the early days asked users to tag friends in pictures by clicking their faces. Facebook could have simply asked you to write who’s in the picture, but by clicking a face, you give Facebook an approximate label for image recognition algorithms. Finally, it probably occurred to you that while registering for a new internet service, you’ve been prompted with a tedious task, like finding cars in images to prove you’re human. This service is called Google reCAPTCHA, and by now you probably guessed what it’s used for: you’re giving the company labels for its ML algorithms, for free.

In some cases, your only option is to pay people to label examples. A common pattern for labeling data is to use a crowdsourcing platform such as Amazon Mechanical Turk, which gives you on-demand, paid-by-the-minute access to a temporary workforce distributed across the globe. With Mechanical Turk, the only thing you can count on is that contractors will have an internet connection: because they’re usually untrained, you have to prepare training materials and a labeling interface the worker can use to choose the correct labels for the training examples. Figure 8.3 shows an example of two labeling interfaces for collecting labels for image classification and object localization tasks.

Figure 8.3 Labeling interfaces for image recognition and object localization. Choices can be made using keyboard shortcuts to increase data entry speed.

In general, crowdsourcing platforms are good for labeling tasks that don’t need much training. If you’re working on a project that requires high-level human reasoning (say, finding cancer cells on microscope scans), you would be better off putting together your own highly skilled workforce.

Table 8.1 summarizes the cost and time required to collect labels for the three labeling strategies.

Table 8.1 Cost and time requirements for the three labeling strategies

Labeling strategy

Cost

Time required

Natural (free) labels

Zero--You are collecting these labels already.

Zero--You already have them.

Hacked labels

Low and fixed--You just need to set up new data collection processes.

Depends on your traffic.

Paid labels

High and variable--You pay based on the number of labels you want.

Depends on the time needed to label an example and the number of labelers you have.

Once you have a clear view of where to get the data you need, the next step is figuring out how much you need.

8.1.2 How much data do I need?

In our experience as consultants, we’ve often seen people falling into the trap of big data . It’s reassuring to think that having a lot of data is a silver bullet to unlocking amazing opportunities. But as you’ll see in the coming sections, data quality is often more important that data quantity .

The amount of data that you need to build an AI product strongly depends on the product itself. So many variables are at play that giving precise rules like “you’ll need data from 10,523 customers to build a 93% accurate churn prediction model” is just not possible in practice. What we can give you are guidelines that can help develop your intuition about the data requirements for the most common types of problems you’ll find in the business world. We think it makes sense to divide our presentation according to the type of data your project will be dealing with, just as we did in the first part of the book.

Let’s first talk about projects that require structured data , such as predicting home prices (chapter 2) or customer churn (chapter 3). You need to consider three factors:

  • Whether the target (the thing you want to predict) is a number ( regression ) or a choice ( classification )
  • For classification problems, the number of classes you’re interested in
  • The number of features that affect the target

Let’s start with features. Remember that structured data is the kind of data you’re looking at when you open an Excel sheet, and it’s organized into rows and columns. A good way to think about data requirements is to picture that, if you could view your dataset in a single screen, you’d want the data to look very thin and tall, as shown in figure 8.4. You want many more rows than columns. This is because rows indicate examples, while columns indicate features that the model has to learn from. Intuitively, the more information a model has to learn from (the more features), the more examples it needs to see in order to grasp how the features influence the target (which means more rows).

If you don’t have enough examples and too many features, some of the columns might even be useless or misleading! Take, for example, a home-price dataset like the one in chapter 2. Adding a column with the zodiac sign of the seller is unlikely to improve the accuracy of price predictions. However, ML models can’t draw common-sense conclusions a priori: they need to figure them out from the data. As long as you have enough rows (examples), most families of models are indeed able to do so, and will rightfully “ignore” the zodiac sign column. However, if you have too little training data, the model would still try its best to estimate how the zodiac sign affects the price, reaching numerically correct but misleading conclusions.

As an extreme example, imagine that the only $1 million home in the dataset was sold by a Gemini. Surely, this doesn’t mean that price predictions for homes sold by Geminis should be higher than those for homes sold by Aries. Most models will be able to avoid this mistake if the dataset contains many million-dollar villas (a dataset with a lot of examples), because buyers will have many different zodiac signs and their effect can be correctly estimated. However, if you don’t have many examples, the model could “think” that the zodiac sign is the driver of house value.

Figure 8.4 The dataset on the left, which has many examples and a few features (tall and thin), is a good dataset for ML. The one on the right, which has a lot of features and a few examples (short and wide), is not a good dataset for ML.

Let’s now go into the specifics of classification or regression problems. In classification problems, you’re trying to predict whether an example belongs to two or more classes. The first classification examples in this book were as follows:

  • The loan eligibility algorithms of Square in chapter 2 that were giving customers one of two classes: either eligible for a loan or not eligible
  • The churn prediction of chapter 3 that was labeling a customer as either about to abandon a service or not

Assuming you have a modest number of features (let’s say 10), you should budget at least 1,000 examples for each class that you have in your problem. For example, in a customer churn model that has only two classes (loyal versus nonloyal), you might want to plan for at least 2,000 examples. Intuitively, the more classes the model has to deal with, the more examples it will need to see in order to learn how to distinguish all the classes.

It’s harder to give similar rules of thumb for regression models because they can model more-complex scenarios and phenomena. Many regression models are based on time-series data , a special kind of data that describes how measurements or numbers evolve over time. An example is Google’s data center case study from chapter 2 that collected environmental measurements throughout the day. Another example you may be more familiar with is financial data (for example, the price of stocks). You can see what time-series data looks like in figure 8.5.

Figure 8.5 Time-series data looks like a stream of measurements taken over time (for example, of temperatures). Values to the right of the plot are newer than those on the left.

In time-series data, the number of data points you collect is not as important as the time span along which you collect them. The reason is pretty intuitive: suppose you are Jim Gao, the Google data center engineer from chapter 2, and you’re collecting one data point per second from your DC air-conditioning system. If you collect data from January until March, you’d have almost 8 million data points (3 months × 30 days per month × 24 hours per day × 60 minutes per hour × 60 seconds per minute). With so many points, you may think you have a great dataset, and indeed your models may be very accurate . . . for a few days.

But what happens as summer gets closer and temperatures rise? The model doesn’t know how the data center behaves during the hottest months of the year because it’s been trained using only winter data.

Talking about AI for media , the simplest image classification tasks (think cats versus dogs) need roughly a few hundred examples for each class (the famous ImageNet dataset has roughly 700-800 images per class). If you were to choose classes that are more similar to each other (like breeds of dogs), the number of required samples shoots up into the thousands because the model will need more examples to assimilate the subtler differences. If you’re taking advantage of transfer learning, you’re starting your training from an existing model trained on huge datasets like ImageNet (as explained in chapter 4), and you can have good performance in the low hundreds of images (or even tens for very simple tasks).

Giving similar guidelines for natural language applications is more difficult, just because the landscape of tasks is more varied. For simpler tasks like sentiment analysis or topic classification, 300-500 examples for each class might be enough, assuming you’re using transfer learning and word embeddings.

Now that you have guidelines about how much data you should be thinking about collecting, you should understand that not all data points have the same importance when training models. Adding especially bad examples might even backfire and reduce the overall accuracy of the model. This is the subject of the next section.

8.2 Data quality

Much of the discussions about AI in popular media suggest that AI-based decisions are inherently more rational than humans because they’re based on math and data. In other words, they insist that AI is data driven and thus immune from human biases and prejudice. The three stories in this section aim to convince you otherwise.

Machine learning is all about finding patterns in data, and training on biased data will cause the model to make biased decisions. What do we mean by biased data ? You already saw an example in chapter 4, where we told you about researchers collecting a dataset with pictures of huskies and wolves, and setting out to build a model to classify the two. At first glance, the model was doing great, but researchers discovered that it did so only because it was relying on the fact that all pictures of huskies had snow in the background, and those of wolves didn’t.

It turns out that even standard datasets are affected by the same problem: neural networks like to hallucinate sheep when presented with pictures of grasslands with no sheep present, as shown in figure 8.6.

Figure 8.6 Neural networks tend to associate pictures of green hills with sheep, because that’s where most of them appear in the training dataset. (Source: https://aiweirdness.com/post/171451900302 .)

Likewise, if you show a neural network an image of a kid holding a lamb, it will most likely classify it as a dog. This happens because the training datasets are biased: sheep appear only in green fields, and most animals held in a human’s lap are dogs. Because ML models are mostly pattern-matching machines on steroids, the model may have learned that a “sheep” is just a tiny speck of white on a mountain landscape, and a “dog” is anything with four legs close to a human.

The underlying problem is that the training set is incomplete : it’s missing important combinations of factors that would help the model in its (autonomous) quest to find the most important characteristics in the image. For instance, the problem would be solved by adding images of people holding other animals, and other images of sheep in unusual places (on a beach, in a house, on a boat, and so forth). In this way, the models will learn to separate the sheep from its context and correctly “understand” what a sheep really is.

Let’s start talking about another subtler and more insidious case of data bias. The goal for this project was to create a model that could automatically screen patients for skin cancer by looking at photographs of their bodies. It turns out that when dermatologists take pictures of sick patients, they often place a ruler alongside the cancer for scale. This means that most photos of sick patients also had a ruler in the frame. Needless to say, the model learned to pick up the presence of the ruler, as that’s much easier to recognize than small patches of darker skin. Lacking any ability to recognize that a ruler was a confusing hint, the model was completely useless on future patients.

We like this case because it proves how seemingly insignificant details can affect the quality of the training dataset, and thus of the resulting model. In other words, today’s AI models have such a shallow understanding of the world that their mistakes can be devastatingly dumb.

In the case of cancer patients and rulers, at least an experienced engineer would have been able to find and pinpoint the problem. However, some sources of bias are harder to spot. Say that all pictures of sick patients were done in a room with fluorescent lighting, and all those of the healthy subjects in a room with LED lights. The slight color change in the pictures is hard to pick up with the naked eye but is an attractive hint for any neural network. As a result, you’re left with a potential disaster: you think your model has learned to spot cancer, but it actually learned to classify LED light.

The common theme of these first two stories is that training data has a direct impact on the behavior of your model. So much for AI-based impartiality and data-driven rationality! In both cases, the problem was a lack of contrarian examples in the dataset (say, sheep walking on asphalt, kids cuddling little tigers, and skin cancer under different light conditions). However, data quality can also suffer because of deeper incorrect assumptions. Let’s see how in the final example.

Figure 8.7 During World War II, Allied planes returning home after a mission were often heavily hit on the wings and fuselage.

During World War II, both the Allies and the Axis powers were constantly looking for ways to improve their fighter and bomber planes. One of the most important goals was to make aircraft lighter so they could fly farther. Because planes need engines and fuel tanks to fly, cutting weight meant removing armor that could protect the vehicle during enemy fire, potentially saving the pilot’s life. The engineers were wondering about which areas in the plane were the most vulnerable and thus needed heavier armor.

The military started examining the frame of damaged planes and produced images like figure 8.7. It seems that the fuselage and wings suffered the worst damage. This means that the airplane designers should reinforce the armor in those hard-hit areas, right?

Counterintuitively, this is exactly the wrong thing to do! We’re looking only at the planes that made it back home, and not at those shot down by the enemy. All this could teach us is that contrary to what we thought in the beginning, planes can sustain heavy fire on the fuselage without losing the ability to fly. In fact, designers ought to be putting the heaviest armor on the engines , even if they found just a few bullet holes there, because that meant that planes shot on the engines were crashing.

Compared to the previous two examples, in which data was simply unbalanced or incomplete, this story teaches us that data collection can be plain misleading, and lead us down a completely incorrect path. This phenomenon is commonly referred to as survivorship bias , and you’ll find it’s often applicable outside AI too. In general, anytime you’re selecting subgroups of items or people, you should be on the lookout for survivorship bias effects that will thwart your efforts. Say you’re constantly surprised by how many 20-year-old cars you see driving around. You might be inclined to think that newer models would never last that long, as cars today are breaking down all the time. Survivorship bias suggests you should also count old cars that are already sitting in a junkyard after breaking down years ago. You won’t see any of those driving around, and thus the numbers will be skewed.

These three examples have shown you ways in which your data collection efforts might lead you to miss the forest for the trees and end up with ineffective models. Don’t despair, though; these are some of the most active areas of research today. In the meantime, the best antidote is to follow a deep and wide approach . Deep refers to efficiently collecting the bulk of the data that you need to build an accurate model. Wide is about complementing the deep with a (likely smaller) dataset of unusual and tricky examples that you can use to double-check the results.

Remember: the main challenge is that we have limited tools to analyze how AI models make their decisions. Just as scientists didn’t know much about bacteria before the invention of the microscope, ML engineers have to guess what the model is doing. Because the only way to learn how the model would react to a specific input is to run it, models must be tested and validated on realistic data. As a practical example, self-driving car companies collect data throughout all four seasons of the year and times of the day to make sure their object detection models can reliably detect pedestrians no matter whether they’re dressed in shorts or coats. In other words, make sure that you’re mixing up things a bit and snapping those pictures of huskies at the beach.

Now that we have talked about the main aspects of a data strategy, the remainder of the chapter deals with a less technical, albeit equally challenging, type of resource: humans.

8.3 Recruiting an AI team

This section guides you through the process of recruiting a team with the appropriate mix of talents and experience needed for the implementation of an AI project. It’s helpful to talk about three main categories of skills:

  • Software engineering --Extracting data from various sources, integrating third-party components and AI, and managing the infrastructure
  • Machine learning and data science --Choosing the right algorithm for a specific problem, tuning it, and evaluating accuracy
  • Advanced mathematics --Developing cutting-edge deep learning algorithms

You may be tempted to find a person who has all of these skills, a figure often referred to as a unicorn : it basically doesn’t exist. However, you can find different people (or teams of people) that excel at one or more of these skills.

In this book, when talking about technical implementation of AI projects, we often refer to the mythological figure of the data scientist. Defined by the Harvard Business Review in 2012 as “The Sexiest Job of the 21st Century,” every company is looking for one, but no one can describe exactly what it is.

The problem is that in many people’s minds, a data scientist is a jack-of-all-trades: someone who can solve any kind of problem that involves data, ranging from simple analysis to the design of complex ML models. On top of that, we often want the ideal data scientist to show business acumen too. The expectations of what a data scientist can do are so high in theory that they’re often unmet in practice.

We recommend that you look for a data scientist in the following cases:

  • You are not yet sure about the direction that your organization will take. Having someone with a broad skill set can help you be flexible until you have a clearer direction.
  • You don’t have extremely complex needs on the technical side. If all you need is someone who can analyze medium-sized datasets (up to a few gigabytes) to make business decisions and build some ML models, people who call themselves data scientists are usually a good fit.

In case you are specifically trying to build an AI team that can tackle more-complex AI projects and also deploy them to be used in the real world, you probably want to start recruiting people who are specialized in different skills. We can identify three major categories of talents, each with its own skill set:

  • Software engineer
  • Machine learning engineer
  • Researcher

Figure 8.8 helps you visualize how each of these roles has a different composition of the three core AI skills: software engineering (SW), machine learning/data science (ML), and advanced math.

Let’s see how each one of these figures fits in the data/model/infrastructure breakdown of AI projects. Software engineers work mostly on the infrastructure side of the project, setting up the systems that fetch data (from internal or external sources) and integrating the ML model into your existing products or services.

There’s also a special kind of figure referred to as a data engineer . This person is usually hired when you’re dealing with massive amounts of data (many terabytes), and even computing the simple average of a database column can be a challenge. Because you’d probably use a data engineer for the infrastructure of your project but not for its ML models, we’ll consider them a special kind of software engineer dealing with extremely large datasets.

Figure 8.8 The typical software engineer, ML engineer, and researcher have different levels of skills across software engineering (SW), machine learning (ML), and math.

A machine learning engineer has specialized knowledge of AI and machine learning. They know the contents of part 1 of the book by heart (and much more), and their job starts with cleaning and analyzing the data and ends with choosing the most appropriate model for your AI project. They will also take care of writing the code to train the model and evaluating its performance. The outcome of the work of the ML team often is a trained model, together with a description of its accuracy and the data that was used for training. ML engineers handle performance improvements to the model, and will guide your efforts in collecting more or better data. Although some ML engineers also know how to deploy models effectively, that’s usually not their favorite activity. Out of these three figures, ML engineers are the ones who are most similar to our previous description of data scientists.

Finally, the researcher is the most academic role of the three. They can use their in-depth research experience in a specific niche to find solutions to novel problems that are not well explored in industry yet. Of the three components of the project, the researcher sits firmly in the model camp, where they can use their knowledge of the state of the art to build AI algorithms that push forward the frontiers of what’s possible. Most researchers are great at building models, but lack software engineering skills and are usually not interested in deploying their creations or figuring out all the bugs.

The main difference between a researcher and an ML engineer is that the former is a scientist, while the latter is an engineer. The distinction might seem arbitrary, but they have very different skill sets. If your project is about any problem or algorithm that we have explained in this book, you want an ML engineer. All the research work has already been done for you, and you just need somebody who understands it and can apply it to your specific problem. If you’re venturing outside the realm of industry-standard problems and data sources, you want a researcher at hand who has a deeper understanding of the theory and can come up with innovative approaches to novel problems.

That being said, the skill sets of these three kinds of people definitely overlap and might even be interchangeable in some cases. You can look at the three figures as different kinds of chefs. If you want to eat a basic dish, probably any kind of chef can pull it off. As the level grows higher, chefs start to specialize in specific cuisines or disciplines, and in a Michelin three-star restaurant, you may find a pastry chef, a sauce chef, a fish chef, and so on.

Let’s see who could have built some of the projects we designed in this book. The price predictor is a pretty standard ML project (there are also many examples online), so you may go with a data scientist or machine learning engineer. The same goes for the churn prediction and upselling classifier projects we talked about in chapter 3. Chapter 4 covered media data, which is a bit trickier. If you have a lot of users using your platform at the same time, you may need a software engineer to make sure your infrastructure is stable and can withstand the load of hundreds of images that need to be processed at the same time. The algorithm is a simple classifier, and probably any skilled ML engineer can pull it off. In chapter 5, we looked at text data, and we saw different project ideas with different levels of difficulty. For the simplest projects like the text classifier or for sentiment analysis, you could go ahead with an ML engineer, but if you want to push forward the boundaries of the state of the art and try building the super powerful brokerbot, you may have to hire a researcher as well. In chapter 6, we talked about recommender systems. Simple models can be built by a single ML engineer, whereas if you want to build particularly complex models, you may look for a researcher to join the team. It’s possible that you need your recommender system to be extremely fast and stable. This is the case if you have a lot of users using your service at the same time, and speed is crucial. An example is Facebook’s feed, which needs to recommend content almost instantly as you--and millions of people--are scrolling. In this case, software engineers become extremely important, as they are the ones who can build an infrastructure that can stand such loads.

To get a complete overview of these roles, see table 8.2, which highlights which types of tasks are most suited to each talent category.

Table 8.2 SW engineer, ML engineer, and AI researcher do their best work when the task matches their experience and background.

Software engineer

ML engineer

AI researcher

Usual background

Computer science

Computer science, statistics

Statistics, math, physics

Tasks

Implementing the application, integrating it with the existing infrastructure, building the
database

Building and testing ML algorithms, developing a data strategy, monitoring the outcomes

Developing new algorithms, researching new solutions

Project fit

Complex IT infrastructure, very high performance required (for example, a self-driving car, or a service with a lot of traffic like Facebook), huge datasets (data engineers)

ML problem related to business problems solvable with state-of-the-art ML, like most of the examples discussed in this book

Complex--When the technology needed to solve a business problem isn’t developed yet (for example, complex NLP tasks)

As always, keep in mind that the optimal skill set of the team will change as the project evolves over time, and even as different components of the project are completed. Factors like these will drive your decision to employ expert consultants or hire full-time team members. We advise against making the common mistake of hiring a hot-shot AI researcher without having a solid base of software engineering in place, as the new hire would likely be frustrated by the lack of proper infrastructure to move the work into production and would eventually leave. A common strategy is to start hiring for the more generalist roles first, so they gain experience with the broad context of the organization. It’s easier to rely on consultants for niche knowledge like AI modeling, as their skills are more broadly transferable across multiple organizations (after all, image classification models are the same no matter whether you’re trying to classify cucumbers or pedestrians).

Lastly, do not underestimate the importance of seniority. Although some tasks like data cleaning can be performed by junior figures, AI projects can get messy fast, and without the supervision of more senior persons, you risk ending up with unmaintainable code. Saving money now can lead to wasting even more money and time down the road.

This chapter introduced the main ingredients of any successful AI project: a data collection strategy, the software infrastructure and integration with existing processing, and a well-skilled team. We have also warned about many of the pitfalls troubling companies that are still inexperienced with AI. The next chapter provides the recipe for pulling these ingredients together into an AI project.

Summary

  • The technical implementation of AI projects has three main components: data, model, and talent.
  • A coherent data strategy takes into account the relative strengths and weaknesses of proprietary and external data.
  • Besides quantity, other primary concerns about collecting data and labels are types of bias, coverage, and consistency.
  • A good AI team includes people with different backgrounds and skills: software engineers, machine learning engineers (or data scientists), and researchers.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.119.172.146