Chapter 5. What is an ML pipeline, and how does it affect an AI project?

This chapter covers

  • Understanding an ML pipeline
  • Understanding why an ML pipeline ossifies and how to address that
  • Understanding the evolution of ML or AI algorithms in larger systems
  • Balancing attention between business questions, data, and AI algorithms

In previous chapters, you learned how to select your initial AI project and how to tie technology metrics with business results. Now it’s time to understand how to guide the development of the software architecture of the AI project. This chapter teaches you to recognize in which respects the AI system behaves differently from other software systems. To effectively implement an AI project, it’s important to understand the technical artifacts and the life cycle of an AI project. I’ll start by explaining the most important artifact of the AI project—the ML pipeline.

The ML pipeline describes how data flows through the system, what high-level transformation is done on it, which ML and AI algorithms are applied, and how the results are presented to the user of your AI system. Without a focus on the pipeline’s architecture, your project would be saddled with an ML pipeline that emerges from early proof-of-concept (POC) decisions. That’s not good, because an ML pipeline quickly ossifies and becomes difficult and costly to radically change.

This chapter shows you how to determine early in your project if the ML pipeline can support your business goals. Finally, this chapter shows the role that AI methods play in the ML pipeline and in which order to make decisions about business, AI methods, data, and technical infrastructure.

5.1. How is an AI project different?

An AI project is a software project that’s subject to the same considerations as other software projects. In addition to being an ordinary software project, an AI project always has an artifact called an ML pipeline. Furthermore, an AI project links with the React part of the Sense/Analyze/React loop: it can directly control the system (in the context of a fully automated system) or report the results of the analysis (if a human is responsible for the reaction).

This section explains the concept of the ML pipeline, how the software architecture of an AI project is different from the software architecture of other projects, and how to manage the construction of an AI project’s architecture.

5.1.1. The ML pipeline in AI projects

The ML pipeline is a software artifact that addresses several considerations in your AI system. It describes how data flows, performs a high-level transformation on that data, applies ML and AI algorithms, and determines how results are presented to the user. The ML pipeline is essential to your project, and your team must get its design right. To help you do that, this section presents the concept and an example of an ML pipeline.

Your AI software system consists of more than just AI algorithms. It must account for many functions:

  • AI algorithms operate on the data. That data needs to be stored somewhere, which may require a technical infrastructure such as a big data framework.
  • That data can be dirty (meaning there could be various errors and irregularities in it), so it needs to be cleaned.
  • Data relevant for your AI system often resides in multiple and different data sources, such as a data lake or various databases, so you need to bring together and combine data from various sources.
  • If you’re expecting a person to fulfill the React part of the Sense/Analyze/React loop, then the results of an AI analysis need to be presented to the user of your AI system.
  • If the React part of an AI system is automated, then you must ensure that the AI system follows some business and safety rules that are specific to your domain.

To demonstrate, let’s construct an ML pipeline used in the context of a factory line. That factory line includes many machines that have consumable supplies, such as oil and various parts. The machines are automated and can sense the current level of consumables. You want to order consumables when they’re needed. Figure 5.1 shows an associated ML pipeline, simplified for a more straightforward explanation of the concept.

Figure 5.1. Example of an ML pipeline for a factory line. This ML pipeline oversees the automated ordering of consumable supplies from suppliers.

Let’s deconstruct what’s shown in figure 5.1. Note that the boxes in figure 5.1 could represent interactions both between software systems (for example, communications like invoking other programs and computer systems) and between parts of a single program. The deconstruction would look like this:

  • Data about the current level of supplies in the machine are obtained from the sensors in the device itself (box 1). The system predicts when a machine will run out of supplies and reorders new supplies. The system needs to know both the current level of the supplies and the historical trend of how fast the machine uses them.
  • Data about historical supply usage is stored in the enterprise data warehouse (EDW). From box 1, we get the current level of supplies. An extract/transform/load (ETL) process makes data from the EDW available to the AI system (box 2).
  • Sensor data about the current level of consumables needs to be combined with the data from the EDW (box 3).
  • Data usually isn’t in the exact format that the ML algorithm needs. For example, it might have empty fields, errors, or different formats than the algorithm expects. The ETL process typically won’t completely address all of these problems. As a result, you need to do some cleanup of both the EDW and sensor data (box 4).
  • Now you need to answer the question, “Based on the current levels of supply and historical usage, when would I need to have new supplies ordered?” To answer that, you’d need to predict the usage of the remaining supplies. You apply an AI algorithm to the levels of supply to forecast the expected time when additional supplies would be needed (box 5).
  • Based on the forecast, you query various suppliers for their current level and cost of supplies (box 6).
  • Because not all of the suppliers are equal when it comes to quality, the system chooses the best supplier (box 8) based on the historical quality metrics (box 7) and the current cost and inventory available from the suppliers (box 6).
  • The outcome from the previous step determines the automated order that will be placed with the supplier (box 9).
Real-world ML pipelines are complex

Real-world, enterprise-strength ML pipelines are more complicated and expensive to build. The ML pipeline in figure 5.1 is simplified to highlight only the most critical parts of a typical pipeline used for the automated ordering of supplies.

For production, you’d need to think about additional details in the ML pipeline, so the pipeline could be much more complicated. Even the individual steps in each of the boxes in figure 5.1 are typically more complicated. The process of extracting data might include various data quality checks, ELT/ETL processes,[a] and data cleansing operations. It’s also quite possible that some of the source data wouldn’t be in the optimal format and would need to be transformed.

a

ETL stands for extract/transform/load and ELT for extract/load/transform processes. Those processes describe transformation of the data from data sources they’re in into system responsible for analytics.

In any extensive database of historical data, you’ll find data that’s simply plain wrong (such as NULL fields that aren’t supposed to be NULL and other incorrect data). Remember that you often deal with data that’s been accumulated over many years, with many changes in the system and many bugs that made it into the system.

There’s no such thing as a universal ML pipeline that would work for every project and every organization. An ML pipeline is customized for each organization and the problem being solved. The structure of the pipeline can be different even when two different businesses are solving the same problem!

Back to our example, the ML pipeline won’t be the same even if two organizations in the same business are solving the problem of automated supply ordering. The reason is that both organizations might be using different machines as part of their factory line.

In figure 5.1, the assumption was that data about the historical usage of supplies was stored in the EDW. However, if you have a different factory line, then machines on that factory line may store historical data about supply usage. You might also be able to assume that such historical data is correct and doesn’t have any significant errors that need to be cleaned. The ML pipeline for such a factory line would be different; it would bypass EDW and send data directly from the machines to the predictive algorithm (box 5 in figure 5.1). Figure 5.2 shows this situation.

Figure 5.2. Modification of the ML pipeline from figure 5.1 for a situation in which supply levels are stored in the machine itself. You don’t need to worry about EDW or merging data and can assume that the historical data is correct, so there’s no need to clean the data.

Note that some stages in the pipeline of figure 5.2 play the same function as corresponding stages in figure 5.1. For easier reference, I’ve numbered such stages identically.

While pipelines are often customized, it frequently happens that ML pipelines used to address similar problems have a commonality in structure. ML pipelines might represent years of accumulated wisdom from your company (or the whole community) and be structured accordingly. For example, natural language processing (NLP) often uses the same general form of ML pipeline [84].

Can we be more formal with an architectural description?

To help readers who are accustomed to the typical architectural presentation done in the industry today, I’m using a simple box and arrow style of architectural diagram, as opposed to the more formal architectural presentation methods (such as the 4+1 architectural view model [85,86]). If you’re an experienced software architect and are wondering where the diagrams in figure 5.1 and 5.2 fit into the broader architecture of the system, they’re drawn in the context of the development view of the system.

5.1.2. Challenges the AI system shares with a traditional software system

As you’ve seen, the ML pipeline serves many functions, and it’s easy to see why your software engineering team would be interested in structure. This section discusses structural attributes of the ML pipeline that you should care about even if you’re a business user.

The ML pipeline isn’t just a codification of technical decisions, it’s also a codification of the business decisions you make. In figure 5.1, the ML pipeline automatically communicates with suppliers. Someone had to decide which suppliers to use, and contracts with those suppliers needed to be signed. You also needed to work with the suppliers to get access to their systems and APIs so that you could place automated orders. All of those decisions were business decisions that are now reflected in your ML pipeline.

An ML pipeline is costly to develop, not just because of the cost of developing software, but because of the associated costs you incur to obtain data that the pipeline would use. It’s also costly in the form of related contracts you need to sign with your business partners.

However, don’t forget that an ML pipeline is a part of the larger software system. All the rules that apply to the development of typical software systems still apply. You need competent engineers and architects, and you need proper software development processes. Because an AI project could fail in the same way as any other software system, use the knowledge you already have about managing software projects to help prevent such a failure.

5.1.3. Challenges amplified in AI projects

In some cases, introducing AI can amplify considerations that already exist in traditional software systems. Although those considerations aren’t related to AI, they often manifest differently in the context of AI projects. This section provides some examples of those considerations.

Security is important for a software project, but it’s even more crucial in the context of an AI project. An AI project can have a lot of data about your users, and a breach of that data can be much more impactful.

AI systems integrated with physical devices warrant additional considerations. AI might need more information than a traditional system would. For example, it’s unusual for a website to require access to a video stream of your workspace. A security system using AI, however, might benefit from a video stream. In such a system, you must account not only for the costs and behavior of the ML pipeline but for the costs and behavior of the larger system.

Note

Imagine that you’re selling AI security devices for the home market. Security problems on the device might allow a hacker to get access to the video and audio of your customer’s home. Security is a crucial component of AI systems!

When constructing an AI system that controls a physical device, the reactions of that system (in the context of the Sense/Analyze/React pattern) must be safe. If your ML pipeline controls operating parameters of the physical engine, it’s typical that one of its steps would be responsible for keeping those parameters within a safe operating range.

5.1.4. Ossification of the ML pipeline

Besides the problems that AI projects share with other software projects, they also have their own problems. One such problem is the cost of maintenance of the ML pipeline. One of the biggest contributors to the cost of an AI project is that the ML pipeline quickly becomes difficult and costly to radically change.

AI software is costly to maintain. There’s an inherent tangling between the data and the algorithms used that is more prevalent than in any other type of software project. To address this, if you’re a software architect, I highly recommend that you read the article by D. Sculley et al., “Machine Learning: The High Interest Credit Card of Technical Debt” [87], for details about common pitfalls and advice on how to avoid them.

Tip

As a business leader, make sure you have not only the best data scientists you can get, but also the best software architects. They’re equally important.

Building the ML pipeline itself also involves costs and challenges. Construction of the pipeline is an early step in an AI project, and data scientists make an early version of the ML pipeline during the initial POC. You may be tempted to adopt that ML pipeline for your production system, but you should be careful when selecting the pipeline you’ll use for production. Once the ML pipeline emerges, as mentioned, it rapidly becomes difficult and costly to radically change—it ossifies. The more your project progresses, the more difficult and costly it will be to change the structure of your ML pipeline.

Tip

The very nature of the ML pipeline is that it starts ossifying the moment you begin implementing it. You can’t prevent ossification. The best you can do is engineer an ML pipeline that solves your business problem. What’s costly is being caught by surprise with the ossification of the wrong ML pipeline.

On a high level, ossification happens for both technical and organizational reasons. On the technical side, the ML pipeline is a complex piece of software that could have many steps. Each of these steps might require specialized skills from the data engineering domain, skills ranging from the area of big data to cloud computing. On the organizational side, ML pipelines, more than most other software artifacts that the organization builds, require agreements across departments and even new contracts with external vendors. For example, let’s look at the ML pipeline from figure 5.1.

You can see that in steps 6, 7, and 9, the pipeline interacts with the software systems that your suppliers control. How do you get access to those systems? To gain access to those systems, you need to have agreements with both the EDW team and various suppliers whose automated ordering API you’d use. You need to talk with multiple departments and sign a ton of contracts, and you’ll have the legal department on speed dial. The end structure of the ML pipeline reflects the result of many negotiations.

It isn’t just interaction with external organizations that gets you. You might have forces inside your organization that would accelerate ossification. For example, you might have to negotiate with other departments within your organization to gain access to data. In the larger organization, even within your department and project, you might see some form of Conway’s Law [88,89] acting on you.

Note

Conway’s Law states that any organization that designs systems produces a design whose structure is a copy of the organization’s communication structure. If you have five teams, don’t be surprised if you finish with a five-stage ML pipeline.

Once an ML pipeline is defined, it’s common (and unavoidable) that various parts of it will be entrusted to different people or different departments. It’s still rare to find people who have a strong familiarity with both AI and all the pieces of the data engineering needed in the pipeline. Departments will focus on their own area of expertise, not on the ML pipeline as a whole. The overall result increases the rate of ossification, and as time progresses, it’s more difficult to make radical changes in the structure of the pipeline.

An ML pipeline once defined would both immediately start accumulating technical debt (in the form of the code that implements it and that’s subject to the issues detailed in the article by Sculley et al. [87]) and be subject to human inertia. Agreements people make are the results of significant negotiation. Such agreements are often more resistant to change than the software itself.

Note

ML pipeline ossification is not limited to large organizations. Even in smaller organizations, if you hire someone specifically because they know Apache Spark [14], it’s a safe bet that they’d want to continue using Spark.

An ML pipeline is more visible to the rest of an organization than is the other software the organization builds. As a result, changes in an ML pipeline are more noticeable to the rest of the organization. This naturally makes the management of an AI team more reluctant to change the ML pipeline than it would be for other pieces of software. Once the ML pipeline is defined, your AI team (and broader organization) becomes significantly more resistant to changing the pipeline’s structure. Organizational resistance would be even higher toward the idea of completely replacing the ML pipeline with a new and different one, and you’d need good reasons to persuade the rest of the organization that the complete replacement of the ML pipeline was warranted.

Warning

An ML pipeline rapidly ossifies. That’s a foreseeable result and the very nature of an ML pipeline. Ossification is unfortunate and can’t be prevented. Therefore, it’s important that you ensure early in ML pipeline development that it’s the right pipeline to solve your business problem.

Because of these characteristics of pipelines, without proper planning, it’s easy to end up with a pipeline that’s not only inadequate for your needs but also difficult and costly to change.

An ML pipeline could span the whole community

Whole communities and subfields of ML are formed around the pipelines that emerge early, showing that ossification of the ML pipeline can affect not only a single organization, but a whole community. An example is the NLP community, which often uses a relatively standardized form of the pipeline. Historical efforts of the speech recognition community also have led to a standardized pipeline [90]. In some situations, whole AI communities might be facing the possibility that the current standard pipeline needs to be changed. For example, the advance of deep learning caused a need for significant changes in the traditional speech recognition pipeline [90].

The ML pipeline’s ossification is compounded if data science and data engineering are in separate groups in the organization. In that situation, data scientists focus on getting the best results and then throw their AI methods over the wall to the data engineering group. The data engineering group, in general, doesn’t have the knowledge or the mandate to modify the AI methods used, so they implement them as-is. Engineering decisions are made by a cacophony of people working in various teams. None of these decision makers are likely to have the ambition to own the whole ML pipeline. The result is a compromise, a lack of ownership. Figure 5.3 illustrates what may happen when you put such an ML pipeline together.

Figure 5.3. An ML pipeline requires data engineering and data science expertise to construct. Specialists in one area might not know another area well, making mismatched elements of an ML pipeline more likely.

5.1.5. Example of ossification of an ML pipeline

Let’s now discuss the problem of ossification of the ML pipeline in more detail. For that, let’s take a hypothetical example of how ossification occurs in real ML pipelines used in business. In this example, you’re trying to solve a clear business problem with a simple ML pipeline, and suddenly you must worry about a lot of complicated technologies, each one of them requiring specialists in that technology. The ossification causes I show in this example are something that I’ve seen (and you may also have seen) in several projects in today’s industry.

Note

The ossification of an ML pipeline manifests as having to deal with complex technologies, AI algorithms, and processes, organizations, and contract details to implement the pipeline.

All the choices I present in this section are reasonable for an ML pipeline like the one in figure 5.4, and every technology chosen is widely used in the industry today. The technologies I discuss will be familiar to many data scientists and data engineers. The point here is not what those technologies are but what ossification is.[1] Suppose you’re running a small team that wants to build an AI-enhanced home security system. The system would look at video from a home and recognize when a person at the home didn’t look like any of the family members. You’d also install an application on the mobile phones of the family members and use geolocation to determine if they were at home. To avoid false alarms when a family invited guests, the AI alarms would be enabled only when none of the family members were at home.

1

Don’t worry if you’re business focused and not familiar with some of the technologies mentioned—I used specific technologies only to make the example concrete for the technologists. What matters for the business user is that even a simple AI project would find itself quickly dependent on a lot of specialized (and expensive) technologies and skillsets.

Figure 5.4. Example of an ML pipeline that’s used for an AI security system prototype.

Initially, everything looks easy, and the POC proceeds well. There’s a video feed coming from the camera, there’s geolocation of the mobile phones, and the video stream is enabled when none of the occupants are at home. You ask the owners to take a couple of selfies with their phones so the system can recognize their faces. Finally, you have a good data scientist that’s on the team who quickly builds the prototype. If we examine the prototype, we see an early version of the ML pipeline, as shown in figure 5.4.

The initial ML pipeline looks simple when you’re brainstorming the project. Then real-world problems hit you. You aren’t able to find people with a strong background in the storage of a large amount of video data, but at least you have software architects who are experienced with a big data system and a streaming solution. Your software architects use their previous experience to suggest an Apache Spark [14] based solution on the AWS [11] cloud. The system uses the same ML pipeline that the data scientist used during the POC.

Your data engineering team doesn’t know much about video recognition, and your data scientist team doesn’t know much about engineering for such a large data volume and video streams from many users, so they make a compromise. The data engineering team will take care of the streaming and will invoke a neural network library provided by the data scientists. The data scientists need to train the neural network on a broad set of images. Initially, you learn that they’ll also need a training cluster, and they’ve used TensorFlow [91].

But what about the training data that you’d use to teach a neural network to recognize images? You initially use a public image dataset called ImageNet [92], but you discover that you need an additional and different dataset with lots of photos of pets so you can better track pet behavior. The good news is that you can mine images of pets from the local humane society’s adoption pages, or at least you think so before an attorney from the legal department walks into the project meeting. Two weeks later, you meet with representatives from the local humane society, and they allow you to use their data. You also have a written contract, and it’s signed.

Your new best friend from the legal department is also immensely helpful in advising you about the limitations of the privacy policy. You need to secure and anonymize pictures and video streams. You can’t have facial images of the minors stored, even with their consent. If you save any photos of the owners accidentally (which happens at the moment they open the door if their phone is off, so the system hasn’t noticed yet that they’re back home), you must delete them. You’re about six weeks into the project. All these decisions might be the right decisions, but you’ve already made many choices that are difficult to change later.

Let’s look at all the decisions that have already ossified in your pipeline:

  • You’re using the same ML pipeline that your data scientist used for the small POC. Is it the best pipeline for a security system rolled out to 100,000 people?
  • Your pipeline is using Apache Spark [14], and you’ve hired an Apache Spark specialist.
  • You’ve also hired Amazon Web Services (AWS) [11] specialists, but they don’t know anything about the competing cloud services, such as Microsoft Azure [13] or Google Cloud Platform (GCP) [12].
  • Your neural network is implemented using TensorFlow [91].
  • You have a contract with the humane society to use training pictures of the pets.
  • Your data scientist spent a month understanding the API interface that the engineering team wants to use because it has some data engineering specifics that the data scientist needs to use.
  • You also have an End User License Agreement (EULA) with the user that specifies what your ML pipeline can and can’t do.
  • You needed a mobile application, so now you have iOS [93] and Android [94] developers on staff.

A couple of weeks pass, and the data engineers come to you. They’ve found additional dependencies in the system—the data scientist was training AI models on AWS using GPU instances! They didn’t notice the dependency on AWS and GPU instances before because the data scientist’s deliverable was an already trained AI library. Only when the data engineers looked closely did they find that the part of the system used for training AI models had a dependency on GPU instances. At that moment, you remember that the data scientist did mention it before during a conversation with you and the project manager, but that part didn’t seem particularly important at the time, so it fell through the cracks.

Let’s summarize the ossifications that have crept into your ML pipeline so far. You’re working with the local humane society and using AWS, Spark, TensorFlow, and some specific API library. You’re using GPU instances on AWS. You’ve added specialists in AWS, Spark, iOS, and Android to your team. Your ML pipeline structure is the same one that your data scientist used in the quick prototype for POC. You have three different legal contracts that you must follow in everything your software does, with a nagging suspicion that the next time you see your new best friend, the lawyer, there’s going to be another draft of a unique legal contract. That’s what you know; probably a few other decisions have escaped your initial attention, and you’ll learn about those down the road.

That’s how an ML pipeline ossifies in the real world. I placed this example in the context of a small initial effort in a big company. A startup doesn’t fare much better, and many startups might not have an attorney to consult about the specifics of the system.

Tip

Coincidentally, some of the possible definitions of software architecture that are often heard are that it’s a set of design decisions that must be made early in the project or it’s a shared understanding of system design [95]. Because an ML pipeline is difficult to change, you need to think about the ML pipeline as one of the primary artifacts to emerge from your software architecture.

An ML pipeline ossifies because it’s not just in software, it’s also a determinant of department structures and even business partnerships. The best software engineering practices in the world don’t guarantee that ossification won’t happen—they postpone it. However, poor software engineering practices are an excellent way to accelerate ML pipeline ossification.

5.1.6. How to address ossification of the ML pipeline

As you’ve seen, it’s easy for the ML pipeline to ossify. That ossification is a natural result of how the pipeline is initially constructed and will manifest unless the team takes specific steps to prevent it. How should you manage your ML pipeline development?

Vital for managing the ossification of the ML pipeline is recognizing that it will happen no matter what. Ossification of the ML pipeline isn’t an error of the engineering team; it’s just the nature of the beast. And once ossification has occurred, the cost of changing the pipeline will be high.

Tip

The key isn’t to prevent ossification of the ML pipeline. The key is to make sure that you don’t ossify the wrong structure of the pipeline.

While ossification is unavoidable, with the proper planning and oversight, you can ensure you have the right pipeline so it won’t need to be replaced soon. Allowing the pipeline to emerge haphazardly is always an error—pipelines must be designed. Designing the pipeline means you need to make sure that you have the right ML pipeline and that the implementation is technically sound.

You also need the right software architecture

The ML pipeline is one of the most important artifacts of your AI project’s software architecture. It’s therefore a good practice to perform a dry run of your software architecture. The dry run consists of testing your proposed architecture with use cases before you write any code. The goal is to ensure that the architecture correctly covers the use cases it’s supposed to address and that you understand the tradeoffs that you’ve made in choosing that architecture.

While the larger topic of the software architecture is outside the scope of this book, if you’re a software architect yourself and are interested in how you can perform a dry run, I recommend that you check out the Architecture Tradeoff Analysis Model (ATAM) described in Software Architecture in Practice [96]. (You may also want to see Wikipedia [97] for a quick summary.)

One part of ATAM consists of using a set of use cases to perform a dry run of the architecture. During that dry run, you can discover various tradeoffs based on a use case and then discuss how the architecture would serve that particular use case. You could think about such a process as a check of the whole software architecture. As a side effect, I’ve found that a dry run is useful for uncovering software engineering problems you might encounter in the data engineering parts of the ML pipeline.

While a full-blown ATAM analysis is typically more appropriate for organizations that develop a full software architecture in the initial stages of the project, the idea of an informal dry run of the architecture is applicable in the Agile environment and in the context of small companies too.

Pipeline ossification means that you should treat it like a building’s foundation. Once you start pouring concrete, there’s a limit on the time you have to make any intervention, and you don’t get the chance to change your mind a day later. When you’re setting up the foundation for a large building, you plan what needs to be done before the concrete arrives.

Consequently, you should plan your ML pipeline early and analyze if it’s capable of supporting your business goals. The ML pipeline isn’t an artifact you can just let emerge and see what happens after it appears; because of ossification, this is a costly approach. Make a concentrated effort to design the ML pipeline before you begin actual construction. Designing an ML pipeline up front doesn’t mean that you can‘t use Agile methods on your project. It means that if you’re using Agile methods, then what Poppendieck and Poppendieck [98] call a “last responsible moment to make a decision” arrives early for your ML pipeline.

Frameworks are emerging around ML pipelines

On the side of the technical soundness of the implementation of an ML pipeline, things are rapidly evolving, and already a set of frameworks are standardizing the way you can code an ML pipeline. I won’t take a position on which one is the best one to use, if for no other reason than I expect this area to continue rapidly evolving. For examples of some technical frameworks that are the first step in formalizing the ways we build ML pipelines, see caret [99], Spark’s ML Pipelines [100], and TensorFlow Extended [101].

You could argue that the frameworks we have now resolve only some of the technical problems that are associated with the use of the ML pipeline. Also the current frameworks vary in their orientation toward a prototyping focus versus a development or production focus. However, an ML pipeline is already recognized as an essential architectural artifact, and technical frameworks centered around the ML pipeline will continue to evolve.

Allowing a pipeline to emerge from the POC without checking whether it’s the right pipeline for your business problem means that you don’t know if it can solve your business problem. That’s a great way to become the proud owner of the wrong (and ossified) ML pipeline.

It’s possible to analyze an ML pipeline early in the project life cycle and estimate its ability to achieve your business goal. Even better, such analysis typically lets you release a first version to the market faster than what you’d be able to do if you didn’t think about the pipeline.

Tip

As a manager, you need to focus the attention of your engineering groups on the ML pipeline to better understand the results of the analysis of it. That will help you to know how well suited your pipeline is to supporting your business goals.

The design and analysis of an ML pipeline is the most crucial step in properly engineering the entire system. You start by documenting the pipeline you intend to use and then analyze the pipeline’s suitability for your business purpose. Chapter 6 shows you how to do that, but before you discover how to analyze a pipeline, let’s see why you need to analyze it.

5.2. Why we need to analyze the ML pipeline

The ML pipeline defines a user’s perception of the AI system as a whole. To improve the results of the ML pipeline, the most important rule to remember is that a system as a whole is more important than the sum of its parts. The reason is that the system behaves differently than its parts, and it’s possible to have an excellent system using pedestrian (but well-matched to each other) stages in the ML pipeline. Conversely, it’s also possible to have a weak system even if the implementation of the individual stages is quite good.

In this section, I’ll show you how an ML pipeline behaves when you modify a stage of it. The results of the modification depend on the exact ML pipeline you’re using, so, to be concrete, I’ll analyze a straightforward ML pipeline, as shown in figure 5.5.

Figure 5.5. A simple pipeline with just three stages, of which only the first two stages affect the end result.

In the figure 5.5 ML pipeline, you collect the data, apply the ML algorithm, and present the output to the user. This is one of the simplest ML pipelines you’re likely to encounter in practice, but even modifications of individual stages to this pipeline have educational value.

Suppose you’re not happy with the results of your ML pipeline. What do you do if you want to improve the results? You can invest in data (better data or collecting more data) or in a better AI algorithm. How do you know where to invest? As you suspect, the answer would be the dreaded “It depends.”

Note

There’s no universal rule about what works best in all cases to improve the results of the ML pipeline. In some cases, it’s better to invest in data. In other cases, it’s better to invest in a better algorithm.

The following sections present concrete situations where simple pipelines like the one in figure 5.5 would benefit from improvements in either the data or the algorithm.

5.2.1. Algorithm improvement: MNIST example

Suppose the year is 2012, and frameworks that will make deep learning simple to apply have yet to emerge. You have a pipeline like the one in figure 5.5, with an ML algorithm for vision recognition, namely, recognizing handwritten digits on an image. You also have data that’s dirty (5% of the data is wrong) because of various errors and bugs in your ingestion pipeline. What should you improve, the data or the algorithm?

To get a feel for how easy it would be to develop an ML algorithm in the pipeline (and how easy it is to improve it), I’ll show you an example. Let’s see what the ML community as a whole (all academics and quite a few industry practitioners) has achieved on probably the most widely used dataset in computer vision today. That dataset is a Modified National Institute of Standards and Technology (MNIST) dataset [102], which consists of 60,000 handwritten digits from 0 to 9.[2] The MNIST dataset was often used to benchmark computer vision algorithms.

2

I’ve used an MNIST dataset as an example of the relationship of data and algorithm improvement in my previous work [103 p56–59]. In this section, I expand that argument in the context of ML pipelines.

The AI community has tracked the accuracy of the various computer vision algorithms on the MNIST dataset. According to LeCun et al. [102] and Benenson [104], algorithm improvements by the community between 1998 and 2013 resulted in the accuracy of digit recognition improving by only 2.19%—the error rate declined from 2.4% [102] to 0.21% [104].

Although the 2.19% better accuracy for digit recognition the community has achieved on MNIST is a significant improvement for computer vision algorithms, how relevant is 2.19% to us? We have to remember that, in our use case, 5% of our data is wrong. Moreover, the improvement in vision algorithms came at a significant cost. Some of the algorithms used to achieve that 2.19% improvement were the result of the best efforts of the entire ML community! Algorithms that achieved that 2.19% improvement were also difficult to implement in 2012! In the meantime, the initial algorithms in 1998, the ones that achieved about 2.19% worse accuracy, are something that a good intern could develop over a few days.

Warning

If you’re already using the best AI algorithm from the well-established academic field, improving it to do better than the rest of the community would be a heroic achievement and not something a team new to AI should attempt.

Achieving a 2.19% improvement in digit recognition can be valuable in a business sense; it might represent the difference between the post office being able to sort most of the mail automatically and having an employee look at numerous envelopes. However, if your data is wrong, how wrong could your total result be? It could be more than 3% wrong (as in, it could be 100% inaccurate).

So, what should you do if you’re overseeing a team in 2012 that has 5% dirty data and is working on computer vision? Such a decision is easy. An average industry team would have a much easier time improving the data quality by fixing a few problems in the data processing pipeline than figuring out how to improve vision recognition for 1% of the data. In this case, investing in data cleaning would be better than trying to develop a better algorithm.

Does this mean that you should always start by cleaning your data to get the best data you can? No, not at all. Let’s see some counterexamples.

5.2.2. Further examples of improving the ML pipeline

Sometimes you’re better off cleaning the data; other times, using a better AI algorithm is the answer. Yet other times, you need to improve both the algorithm and the data. There’s no one-size-fits-all rule that you can use for every ML pipeline.

Sometimes, your ML pipeline will produce poor results even if the input data is perfect. Suppose you’re trying to understand how proficient someone is in constructing a persuasive argument in the text they’re writing. The ML algorithm used in the pipeline in figure 5.5 will classify arguments as strong or weak. Currently, understanding the strength of the argument is a difficult problem. Let’s assume you built a prototype and found that your first algorithm can correctly distinguish whether the argument is strong or weak in only 66.7% of the cases. The last stage of the ML pipeline would emit an error in one-third of the cases, even with perfect data!

There’s also a case when you should improve both the algorithm and the data. Such is the case if the algorithm’s error rate is 10% (on whatever technical metric is appropriate), and it’s also facing 5% of the wrong input data.

No single rule applicable to all ML pipelines tells you whether it’s better to clean the data or improve the AI algorithm. Of course, you shouldn’t be surprised if engineers advocate for investment in an area they understand. It’s often the case that data engineers are biased toward spending more effort on cleaning data, and data scientists prefer to improve algorithms.

5.2.3. You must analyze the ML pipeline!

The best strategy for improving an ML pipeline depends on the exact pipeline you’re using. You must analyze the pipeline and find what works in your case. You must look at it as a whole system. Here are the questions you need to answer (and the subjects of chapters 6 and 7):

  • Chapter 6: Am I using the right pipeline?
  • Chapter 7: What area of my existing ML pipeline should I invest in?

If you don’t focus your team on thinking about the whole system, they might end up concentrating only on individual parts. Leading a project without a focus on the system can be a costly idea. You should organize the system engineering process to get the ML pipeline right the first time, and you should plan on not changing it often. It would be much more common to change the choice of AI method used in a stage of the ML pipeline. A flexible ML pipeline is one that allows for the natural change of individual methods after getting the data flow right.

The pipeline should be analyzed using the systems engineering methods I describe in chapter 6. The goal of the analysis is to quickly understand if the pipeline can support your business goals and what the right way is to direct work on various elements in the pipeline so you can reach your business goals fast and in a cost-effective manner.

Team dynamic matters!

Leadership requires understanding the incentives you place your team under. If you ask your team in a situation when there’s no systematic process to decide where to invest, and they’re not comfortable talking about areas they don’t know, what would you expect to happen? I’d expect that you’d get an answer in the form of “invest in my area:” data engineers would recommend that you invest in the data, and data scientists would recommend that you invest in better algorithms. Are people on your team comfortable discussing something with which they aren’t familiar?

People who understand the whole system are rare on AI project teams today. It’s much more common that you’d have smart people that understand parts of the system well. It’s your job as a leader to understand the social dynamics this causes.

Unless team members feel that’s it safe to ask questions about parts of the system they don’t understand, they might opt to talk about only the individual parts they know. Are you running a team in which people are comfortable being wrong in front of their peers?

5.3. What’s the role of AI methods?

If what matters for the end perception of the result is an entire system, and if the ML pipeline as a whole matters more than an individual AI algorithm, what’s the role of advanced AI algorithms? AI algorithms have an active role, but the best way to think about them is that they’re pieces of a larger system. In that larger system, every part (including such AI algorithms) is likely to be improved over time.

Today, conversations between AI engineers are dominated by discussions about AI methods. It’s common to see data scientists talking for a long time about which methods and algorithms are the best for a given application. It’s not uncommon even for larger communities to have this discussion. Different schools of thought in the AI community disagree with each other about which methods are the most promising in the long run.[3]

3

References [105] and [35] provide an overview of the history of AI, including various schools of thought that have emerged in the AI community.

Currently, deep learning [7,8,106] has a huge mindshare in the AI community. However, a decade ago, methods that became what we know today as deep learning faced much skepticism, even in academic circles. Methods such as Support-Vector Machines (SVMs) [107,108] were much more popular at the time. Even insiders of the AI community are often surprised with what comes next. Methods you’re using today can significantly evolve in just a few years.

It’s reasonable to assume that this trend will continue—the best method today might not be the best method tomorrow. With all the attention, resources, and investments pulled into the research of new methods, the AI field is growing tremendously. Methods are subject to continuous evolution, and we’re finding newer and better methods every day. While AI methods are undeniably mathematical in nature, for managing a data science project, it’s better to think about them more as engineering modules than as math.

Data science isn’t primarily an application of mathematics

If you’re a data scientist, I’d like to expand on the point that AI methods are more engineering modules than math. To acquire a deep understanding of AI methods, you need strong mathematical foundations. However, just because methods are formulated in a way that requires an understanding of the math doesn’t mean that all you need to do is use mathematical transformations to get an answer to your business problem.

In business and industry, it’s highly unlikely you’ll encounter a situation in which there’s mathematical proof that any method is the correct and definitive way to address your business problem. If for no other reason, it’s rare that a real-world industry problem would perfectly match all assumptions of the method. As a result, any practical data science work requires much experimentation. You need to try multiple methods and approaches to determine the best way to address your problem. For that, you also benefit from having a flexible and easy-to-change ML pipeline.

Contrary to the name, data science is not a science. It’s an engineering discipline. You have to experiment to find what the best method is for your problem. While some methods work better in some domains than others, you already know that there’s never a definitive method that’s always the best across all possible datasets [67].

The right way to think about AI methods is that they’re pluggable modules to be applied at a particular place in the ML pipeline. Each method has characteristics (and for that matter, a limited lifespan) that today makes it a good match for a part of the ML pipeline. You should apply methods in a way that allows for you to replace them with different and better methods as they become available, and you should organize the pipeline in a way that makes plugging in the new methods simple.

Figure 5.6 illustrates the idea of changing the role of methods. A pipeline’s structure is a choice that affects the whole organization and, as such, is a focus of the entire team.

Figure 5.6. The role of methods in the construction of an ML pipeline. The methods that implement a step in the pipeline often change during the lifespan of the pipeline.

The choice of AI methods to use depends on the technical considerations and what the methods can provide. In figure 5.6, which AI method to choose for each of the steps in the pipeline is a technical decision. For example, if a specific step in the pipeline would be to predict a time series value, data scientists might elect to use familiar algorithms such as autoregressive integrated moving average (ARIMA) [109] or long short-term memory (LSTM) [110] for that prediction. Whatever they choose, it’s an engineering/technology decision.

The best approach to the engineering of the AI system is to implement methods in a way that acknowledges these characteristics and allows for replacing any method you implement today with better methods tomorrow. The rules of software engineering (including, but not limited to, maintainability and extensibility) apply to AI systems too.

Tip

While ML pipelines are subject to ossification, AI methods are subject to rapid evolution. Your systems engineering process should be designed to acknowledge and manage this characteristic. While the pipeline calls for deep up-front analysis, you should implement AI methods in such a way that it’s easy to change the AI method you’re using in each step of the ML pipeline. Agile methods typically shine in the implementation of the elements of the pipeline.

In addition to organizing your software to make it easy to change the AI methods, choose frameworks that are also easy to use when developing quick prototypes. The AI community is continuously developing frameworks that are focused on ease of use. Examples of this approach are the mlr package [111] and the caret package in the R programming language [99,112], which focus on easy-to-use implementations of many of the traditional ML algorithms. Another example is the Keras library for deep learning [7,8,113].

What’s of interest to the broader team is the question, “Under the assumption that the methods behave in the way that the specialist described them, how would the whole pipeline be affected?” Chapter 6 and 7 show you how to analyze the behavior of the larger pipeline.

5.4. Balancing data, AI methods, and infrastructure

I cautioned you in chapter 1 not to overfocus on infrastructure. Overfocusing means that you don’t have the right balance of attention between data, methods, and infrastructure. Let’s see what the right balance looks like.

Balance is based on the order in which you answer questions about your project. This section provides the right order to think about data, AI methods, and the infrastructure you’d use. Figure 5.7 depicts how you should manage an analytics project to balance the business problem definition, AI, and data infrastructure.

Figure 5.7. The relationship of the business question, methods, data, and infrastructure. The order in which you make decisions is the business AI infrastructure.

In figure 5.7, defining the business question (box 1) and the metrics used to evaluate it (box 2) must be the first step; otherwise, you’ll waste time searching for answers to the wrong question. Second, you must define at which value threshold of the business metric you’ll have a minimum viable product (MVP) [28] (box 3). Again, the failure to define a threshold is a sure sign that you need to further develop your business case. The threshold value of this metric gives you a milestone in defining the MVP. Note that boxes 1, 2, and 3 are the responsibility of the business team.

Once you’ve satisfied boxes 1–3, find out what data you have (box 4), what data you need and can acquire (box 5), and the AI methods you’ll use to determine the need (box 6). These decisions are in the domain of the data science team.

In all cases where you have a choice among multiple infrastructural components, decide on appropriate infrastructural components to support the data and AI methods that you’ve chosen. You should define the infrastructure to support the intended analytical approach (box 7).

Figure 5.7 is iterative, with the steps in boxes 4, 5, 6, and 7 repeated. When possible, you should do initial iterations as a POC with a smaller dataset as a starting point.

Tip

When starting AI projects, you should first understand which business problems you’re solving, which AI methods and data you should use, and how to link business with technology. Choose infrastructure last.

Once you know you have a viable solution, you can scale it to the larger dataset. A combination of the organization’s operational practices and data sizes determines the final infrastructure to use.

One exception when you might elect to choose infrastructure first is when there’s a tangible link between the business case and the infrastructure solution. Another is when you’re working in a company that’s so large that it’s expecting to run so many AI projects that standardizing on a single infrastructure stack is worthwhile.

Note that you could consider the process shown in Figure 5.7 as an extension to the traditional cross-industry standard process for data mining CRISP-DM process [43] that was defined in the mid-1990s to systematize the approach to data analytics. When you use the CLUE process, it helps you to properly sequence the steps when choosing business considerations, data, AI methods, and infrastructure.

5.5. Exercises

The following exercises will help you to better understand the concepts introduced in this chapter. Here’s where teamwork comes into its own. An analysis of the ML pipeline is both a technical skill and a business skill, so it’s time to find an engineer with whom you’d work in the future and do some of these exercises together. Time spent on that relationship would come in handy on the real project.

Question 1:

Construct an ML pipeline for this AI project: the project takes feedback from your customers and analyzes it. If a customer appears unhappy, an alert is issued so that you can contact the customer and try to appease them before they decide to leave. (That part of AI which determines whether a customer is happy or not is technically called sentiment analysis.) You already have an AI software library that performs sentiment analysis. The data is in your customer support system, which is a web application.

Question 2:

Suppose you implement the ML pipeline from the previous example in your organization. Which departments would be responsible for the implementation of which parts of the pipeline?

Question 3:

What business metric would you use to measure the success of the ML pipeline from question 1?

Question 4:

What is the history of the coordination between departments from question 2 in past projects that they’ve participated in? Were projects on which those teams worked successful?

Note

The next two questions (5 and 6) are targeted toward data scientists. You can skip them if you don’t have data science expertise.

Question 5:

As a part of the installation of an AI security product, you’re offering a 30-day, money-back guarantee. Your customers have taken a survey about their satisfaction with the product, which they completed as soon as the product was installed. You’re interested in predicting if your customers would return the product. During discussions, the team has mentioned that this problem could be solved using either an SVM, a decision tree, logistic regression, or a deep learning-based classification. Should you use deep learning? After all, it’s an exceedingly popular technology, has a substantial mindshare, and could solve the problem. Or should you use one of the other suggested options?

Question 6:

You answered question 5 using an algorithm of your choice. Suppose the algorithm you chose didn’t provide a good enough prediction of a customer returning the product. Should you use a better ML algorithm? Is it now time to use the latest and greatest from the field of deep learning?

Summary

  • Every AI project uses some form of the ML pipeline. That pipeline starts by collecting data and finishes by presenting results. The ML pipeline is one of the primary determinants of how the user will perceive the system as a whole.
  • The ML pipeline starts to ossify the moment it’s constructed. Choosing the wrong ML pipeline, or, worse, letting it directly emerge from the POC, is a costly mistake.
  • The success of the AI project is based on the entire system, as opposed to the individual ML algorithms used.
  • The user sees the output of the whole system, not the output of the individual AI algorithm. Think about the ML pipeline as a critical architectural artifact and about AI algorithms as important (but evolving) pieces of that pipeline.
  • The order in which you make decisions on an AI project should be business first, AI algorithms and data second, and infrastructure last.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.222.93.132