Chapter 5. Machine Learning (ML)

AI relies on large volumes of historical data and advanced ML methods to generate insights. In the last chapter, we focused on data’s importance to your organization’s success when implementing AI. If data is the fuel that powers your AI applications, then ML is its engine.

The previous two chapters discuss how to coordinate AI analytic development with DataOps and key stakeholders, such as data owners and risk managers. This section will discuss AI development itself and how its tooling and process can evolve toward a more sustainable, scalable, and coordinated capability.

When considering how ML fits within the wider AI domain, it’s important to first understand the data analytics spectrum. Figure 5-1 illustrates the differentiation between conventional statistical analysis (i.e., forensic analytics, which examines past and present phenomena) and anticipatory analytics (i.e., forward-looking analysis).

Figure 5-1. Complexity versus explainability for analyses

Each analytic approach is associated with specific challenges, with more advanced approaches providing the decision support and cognitive analysis typically associated with robust AI capabilities. A key distinction is the interplay between complexity and explainability. In most cases, forensic analysis is far less complex and more explainable than anticipatory analytics, which is rapidly becoming incredibly complex and far more difficult to explain and can lead to lower explainability. That being said, industry continues to invest heavily and advance the capabilities to explain these complex models.

In practice, ML often serves as the primary mechanism driving AI capabilities, leveraging data at an enterprise-wide scale to build and validate complex models. Managing complexity and explainability, as discussed in Chapters 2 and 3, is a key design consideration with ML (that guides decisions like training algorithm, parameters, etc.).

Then again, many of the most popular AI applications we are beginning to see in our daily lives—smart assistants, real-time object tracking, and self-driving cars, to name a few—are powered by ML, neural networks, and deep learning capabilities that are more difficult to explain and manage.

In this section, we introduce the concept of an ML model and distinct ML frameworks to address specific data structures and analytic objectives. This section isn’t intended to be a functional “how-to” for ML, but a solid working knowledge will help you better lead AI adoption within your organization.

What Is an ML Model?

Business analytics, often referred to as simply “analytics,” are business rules encoded into software by an analyst or developer that subsequently converts data into useful insights to inform decision making. By comparison, ML models use learning algorithms to generate business rules using prepared training data.

For supervised and reinforcement learning techniques, these learned rules are encoded in a special file known as a model. Because they are designed to be written and interpreted by machines, these models are much harder to read (explain) than business analytics.

Unlike conventional analytics, the act of feeding user data into the analytic and producing insights is known as inference. Updating ML analytics is accomplished with some combination of the original training data, new/updated data, and the trained model.

Traditionally, organizations treat ML models as black boxes where only their inputs and outputs required explanation. For all the reasons described in Chapter 3, mature organizations are abandoning this black box approach in favor of explainable outcomes.1 From a leadership standpoint, this shift toward explainability will continue to change how results are documented in the short term and overall design processes in the long term.

ML models are also unique in that similar models and data can produce radically different results. Identical source code is compiled into identical applications, but identical training data produces a variety of models, depending on training parameters. Since it is difficult to predict the best configuration in advance, it is often best practice to train with many different configurations and then compare performance for each of the results during training.

This is especially true for more sophisticated ML approaches, such as neural networks, and accounts for part of these techniques’ reputation for large computational and data storage overhead. Additionally, even the smallest changes to training data can produce meaningful changes to the end result.

These trends result in the potential for hundreds or even thousands of individual ML model versions. Only a small number will be operational, but many (or all, depending on your goals) are worth preserving to provide additional insight into the training process. You should choose how much data and the number of models you’d like to retain based on how explainable the analysis must be and how tightly you can control the operational and training data inputs.

As with any other form of software, it is important for ML models to be documented, tested, and version-controlled to help the team maintain knowledge of its structure and functionality. Unlike conventional analytics, it’s not possible to break up, document, or independently test a model’s internal structure. Instead, we must rely on surrounding documentation, testing, and metadata. While not entirely unique to ML analytics, metadata plays a much larger role in describing a model’s behavior than it does with normal (or classic) software. MLOps (discussed in “Machine Learning Operations (MLOps)”) can help with this documentation, testing, and version control.

DataOps provides the data pipelines to develop and train ML models. Then, MLOps provides an integrated and automated process with a documented CI/CD process to integrate these models into larger software applications that leverage said models using DevSecOps. This approach ensures resulting software applications that utilize integrated models are secure and that data scientists or software engineers can rapidly update the software and models as required. These combined frameworks allow data science practitioners to continuously refine models and their overall approach to analytics.

ML Methodologies

Much like selecting the right tool for a given job, it’s important to understand the algorithms supporting our analytic efforts. Particularly in the case of ML, we use these underlying processes to characterize and define each unique ML approach. By understanding these techniques, we can address any assumptions and identify appropriate performance evaluation metrics. Understanding these techniques also enables us to consider a model’s ability to generalize (i.e., ability to apply each model to sets of unfamiliar data) when putting models into a production environment, and to better monitor and improve our overall design choices.

Supervised learning is perhaps the most often discussed example of ML. With this technique, algorithms are used to analyze numeric or labelled data (e.g., input-output pairs, tagged images, or other closely associated information). Common applications include regression models used for predicting continuous numeric values (e.g., stock prices) and classification models used for predicting categorical (e.g., file is/not malicious) class labels from operational data features.

Supervised ML breaks data into input (called “features”) and outputs (known as “labels” for classification and “targets” for regression) and attempts to predict the latter from the former. Supervised learning’s key benefit is that it enables us to focus the analytic model on a specific task for high accuracy, precision, and recall.2 This makes supervised learning the most practical technique for most operational use cases.

Supervised learning’s operational success has resulted in additional research and development into explainability and sustainability, currently far in advance of other ML techniques. The key distinction between supervised learning and other ML approaches centers on the optimization of the model through the process of training its weights (i.e., learned parameters), as depicted in Figure 5-2.

Figure 5-2. Supervised ML process

With supervised learning, labeled input data is fed into the algorithm to train the model, where it refines the algorithm’s “weighting” by testing against specified criteria. The process used to refine a model is rooted in statistical cross-validation, where data is divided into two random subsets—one to train the model, the other to test its performance. Although the specifics get incredibly technical very quickly, the bottom line is this: once sufficiently trained, we can refine and improve deployed models by reengineering training data features and retraining the model as necessary while monitoring its behavior throughout the ML life cycle.

A second ML technique, unsupervised learning, trains a model directly from patterns present in data rather than the input-output approach found in other types of ML. The key advantage with unsupervised algorithms is its capacity for finding patterns not known to the developer beforehand, presenting new avenues of exploration. This approach’s downside is that these algorithms generally find a high volume of patterns in production data, with no focused way of knowing which patterns are valuable to a particular use case apart from abstract data-fit metadata. Additionally, unsupervised learning approaches are inherently less explainable than supervised models, as they usually leverage training data lacking a specified classification or input-output relationship between features. This high-level process is depicted in Figure 5-3.

Figure 5-3. Unsupervised ML process

While unsupervised learning techniques are certainly used as stand-alone approaches to modeling (e.g., anomaly detection and data clustering), we regularly leverage these techniques in tandem with supervised processes to simplify the problem space or gain additional insight into our data through exploratory analysis.

A third technique is reinforcement learning, which provides a novel approach to training models within a specified environment. While there are similarities with supervised learning (e.g., optimization around a relevant penalty or reward metric), reinforcement learning takes a distinct approach to model refinement. Approaches such as Q-learning allow so-called “agents” (i.e., software bots) to interact directly with their environment (physically, virtually, or both) to refine its behavior with a goal of iterating toward a specified objective. The underlying algorithms, including support vector machines (SVM) and neural networks (defined later in this section), generate a model to dictate the agent’s behavior and continue to update based on increased experience within the environment.

Additionally, this technique is considered to be more explainable than a purely unsupervised approach. During a reinforcement learning process, we incrementally feed training data into a model, producing a result evaluated by a reward function, and then provide the model with feedback in the form of the results of the reward function that is combined with subsequent training data to continually refine the model parameters.

As shown in Figure 5-4, the reinforcement learning process begins with users establishing goals, which become the primary objectives for the ML algorithm. When we initialize this process, the agent is completely ignorant of the rules of the experiment, necessitating multiple training runs before it can improve its performance. As the agent’s model continues to update based on reward feedback from the environment, it continually refines its performance, learning from previous successful approaches while deemphasizing less useful tactics. This approach initially gained notoriety through the application to games like chess, AlphaGo, and Starcraft, where, after execution of significant training, the models were able to outperform expert human players at these games.

Figure 5-4. Reinforcement ML process

The primary issue with reinforcement learning is that not all use cases allow for good reward functions, and reward functions may be more difficult to explain than training data for a given use case. Despite its more obscure processing, reinforcement learning provides spectacular results when employed in use cases that provide appropriate and measurable reward functions, such as with robotics and video games. Another benefit of reinforcement learning is that it can produce structured training data to support model development for applications where training data does not currently exist.

Neural networks are an ML implementation that uses layered network architectures (i.e., neurons and weights) to perform a specific function, passing inputs through the network’s layers to calculate an output of interest. While neural networks also leverage statistical methods to evaluate a given input, they are unique in that their inner processes are largely nonlinear abstractions of conventional calculations. A neural network contains a collection of parameters, known as weights and biases, trained to optimize its outputs based on a given criteria. While neural networks continue to grow in complexity and layers of abstraction, in every instance the network’s architecture determines its function.

On their own, neural networks do not specify the method used to train them. They are capable of following the supervised, unsupervised, and reinforcement learning paradigms interchangeably.

For example, the convolutional neural network in Figure 5-5 can ingest machine-readable images, training the algorithm to assess latent patterns and relationships among encoded pixels, and then classify the image with a given degree of certainty.

Figure 5-5. Neural network example

Developing a neural network’s structure is a difficult and expensive design process. To minimize this expense in time and effort, organizations often tailor existing neural networks for a specific application by starting with the structural information and regenerating weighting information from different training data. This process, called transfer learning, is the most approachable way for many organizations to use pretrained neural networks and leverage the extensive training previously executed for similar applications.

As previously mentioned, models are updated in one of several ways, depending on the learning technique. In some use cases, it is not desirable to reload all of the training data that were initially used to create a model to update a model. In this situation, an online learning technique can be used to allow training data to be incrementally added to the model without having to retrain from scratch.

An even more extreme version of this is distributed/federated learning. Recent developments in so-called “edge” technologies, such as widely used mobile phones and other IoT devices, enable ML development using distributed—or remote—architectures.

Distributed learning with edge devices occurs in parallel: training occurs in multiple online locations at the same time, before merging results to combine their overall knowledge. Distributed learning’s key advantage is the ability to enable the model to be configured for inference and training on low-powered edge devices while combining significant amounts of information globally in a single model. The disadvantage of edge learning is that the learning process is increasingly dynamic, complex, and difficult to curate. This environment can create complex scenarios that are extremely difficult to predict, detect, and correct in real time to combine that distributed learning into a valid global model.

Examining the architecture outlined in Figure 5-6, distributed learning’s first step involves pushing a pretrained “pseudo model” to edge devices from a central hub. Once on the edge device, this initial model is refined by available data at the edge, performing inference and training updates on the local device. After these edge-based models are sufficiently trained and updated, we aggregate them at a central hub, versioning the models for eventual redeployment to edge devices. The central hub orchestrates these iterative improvements by using an MLOps approach to monitor model performance.

Figure 5-6. Distributed learning example

ML Advanced Topics

Having discussed conventional ML modeling approaches and their considerations, we should pause a moment to examine emerging frontiers. These specialized applications leverage complex algorithms that combine and further abstract ML approaches. Each of the following approaches enables more mature AI capability. However, as with all techniques, we recommend you choose cautiously when considering the wider implications of interpretability and explainability. Due to the complex architecture within these techniques, we are generally unable to “open the black box” of these models, often forcing us to verify their reliability in development and production environments.3

One of the most impactful developments in ML over the last several decades is deep learning, a collection of methods leveraging multilayered “deep” neural networks to perform highly specialized tasks. Use cases include image classification, machine translation, and speech recognition. In practice, deep learning takes advantage of multiple ML methodologies (e.g., applying unsupervised approaches to transform data as the input for neural networks) to engineer features and improve model performance. These complex models enable AI by leveraging large quantities of data to train and refine parameters of deep neural networks, whose specific architecture enables performance able to exceed human capabilities. As with any ML approach, the underlying structure of the algorithm dictates the functional benefit of its specialized task, underscoring the need to consider responsible implementation, resource requirements, and interpretability of complex deep learning models.

Natural language processing is a broad field containing both manual approaches and AI applications to analyze natural language data. Enabling machines to parse textual data is a critical component of natural language processing, without which we could not ingest data necessary to train the model. Traditionally, challenges are rooted in the subjectivity of human language and difficulty achieving semantic generalization, particularly when working with jargon or translating between structurally dissimilar languages.4

Fortunately, recent research5 provides novel approaches to encapsulating language, necessitating more robust benchmarking metrics as performance improves. Current natural language processing applications span document classification, sentiment analysis, topic modeling, neural machine translation, speech recognition, and more. As this field continues to mature, you can anticipate increasingly novel applications to promote automated language understanding.

ML Life Cycle

Operationalizing ML consists of four primary phases: business analysis, model development, model vetting, and model operation (e.g., monitoring your model in production). Executing these components with common approaches, practices, and understanding allows an organization to scale ML applications while reducing time-to-market business value of AI systems. To achieve this common understanding, it helps to decompose each component into its respective activities and discuss the methodologies, approaches, and opportunities presented by each, as illustrated in Figure 5-7.

Figure 5-7. ML life cycle

Business Analysis

Business units are increasingly investing in or identifying ML techniques as areas for potential growth. These business units are looking to leverage data and discover patterns to drive efficiency and enable better-informed decision making. A successful machine learning project analyzes business objectives to reveal significant factors influencing outcomes for machine learning deliverables. This step is critical to ensure your machine learning projects produce the right answers, ones that directly address the business problem you originally set out to solve.

You should collaborate closely with subject matter experts and end users to better understand how they perform their activities in the current environment. Interviews and consultation performed at this stage will help you to better understand the business problem that needs to be addressed, the goal to be met, any applicable constraints or dependencies, and the desired outcome.

A project plan should then be developed that defines how machine learning will address business goals. It’s important for all stakeholders to agree on the criteria that specify how results should be measured and what constitutes success from a business perspective. Assumptions should be documented to help guide the subsequent model development process during the next phase. All steps should be defined, including project details about scope, duration, required resources, inputs, outputs, and dependencies.

Model Development

Developing ML for business use requires incremental steps (outline in Figure 5-8). Because this development process involves scoping the problem, conducting experiments, reviewing results, and refining the approach, it’s difficult to depict in a straightforward fashion. While each of the five steps may occur at various times, their contribution to the overall process remains consistent.

Figure 5-8. Model development steps

Once the project objectives are set and the business problem is understood, an exploratory analysis is necessary to identify and collect all relevant data and assess their ability to meet the intended business purpose. This process is designed to create a comprehensive view of the available data and use analytic techniques to explore important relationships in the data. Exploratory analysis also serves to identify deficiencies in the data that may prevent the model from meeting the intended business purpose. Data exploration analyzes the data we have available and also informs the type of data needed to develop a useful model.

The data preparation stage transforms data to be used with ML algorithms. You’ll want to check for issues such as missing information, duplicate/invalid data, or imbalanced data.

This stage also includes techniques to help prepare data, including: removing duplicate/invalid records from the dataset; addressing class imbalance (when the total number of a class of data [positive] is far less than the total number of another class of data [negative]); enrichment using additional data sources; and selecting data features and executing feature engineering (to include combining multiple features for additional training benefits). This process is designed to help determine the required input variables when developing a predictive model. Your subject matter experts are invaluable during this phase of model development.

Model training begins once data is prepared. In this step, training and validation data is used to compute the model parameters that optimize performance. Prior to initializing the model training process, it is necessary to define the model to be trained and its corresponding configurations. A general process for model training can typically be used across different algorithms.

For most types of ML, the general training procedure begins by randomly initializing the model and its associated parameters. This model is then trained using data to fine-tune it, generating and comparing predictions for each entry in the training dataset. Using these predictions, the model training computes error and other performance metrics between model prediction and ground truth for each entry (e.g., case, record, image). The computed error, often referred to as “loss,” is then used to adjust model parameters to optimize for desired metrics. The training script will continue to loop through the aforementioned steps until it achieves the desired performance or until the maximum number of training steps has been reached.

The next step, model versioning, is crucial to ensure a workflow is explainable and can be replicated. In this step, models are versioned based on the data used for training, selected specifications (e.g., learning rate, batch size, number of epochs),6 and the resulting output. An identifier should be assigned to each model to provide context. Versioning enables your team to track model lineage and pedigree, which will improve results as your project progresses.

Here are some items we recommend tracking during your development efforts: software used to train and run the model; development environment; model architecture; training data used; model parameters; and the artifacts collected during training, such as reports and log files. This process should be standardized and conducted for each model training event with parameters stored in a common repository.

The final step in development, model validation, is vetting a model’s training by using a validation dataset to determine if any additional performance adjustments are necessary. It’s important that this validation dataset not include any data used in the model training stage to ensure validation metrics are fully independent from the training procedure.

In the validation stage, the model you’ve created is used to make predictions for every entry in the test dataset. Prediction error and other model performance metrics are calculated to ensure observed model performance meets user-defined requirements before deploying the model.

Model Vetting

The next step in the ML life cycle is to vet your model, as shown in Figure 5-9. Vetting a trained ML model is a crucial, yet underappreciated, step in the path to operationalizing machine learning. Rigorous model vetting enables the local data science team to discover costly bugs and correct them while still in a low-risk environment. Once vetting is executed locally, it is then replicated in the production environment. Deploying properly working models will build organizational trust, not only in the deployed system but in AI projects as a whole.

Figure 5-9. Model vetting steps

Trained models that meet initial performance requirements are deployed for testing to a production-like environment for a rigorous test and evaluation stage. This step involves conducting computational and performance testing to ensure the model meets requirements. For the sake of consistency, deploying a model to the test environment should parallel the steps taken to deploy a model to an actual production environment. Your development team should work closely with subject matter experts and functional users to ensure the test environment mirrors actual production.

Similar to model development stages, it is important to verify results. This step ensures model performance is in line with expectations and has not suffered from performance degradation. It’s important to note that, based on our experience, most preproduction model issues are discovered during this step. As noted previously, it’s crucial that the data sent to the model during this stage be representative of the data the model will analyze while in a production environment.

Part of the results verification process is exploring how the model handles malformed or otherwise improper data. The model should be able to handle improper data and ensure that important data is not lost, an appropriate error notification is sent to the software development team, and the model remains functional after encountering the error.

Result tracking is critical to ensure your model is effectively reviewed and audited. The ability to store and review historical model output allows subject matter experts and functional users to conduct periodic reviews, ensuring the model is making acceptable and appropriate predictions.

The final step, staging for deployment, prepares the model to be moved to the production environment. Once the model is deployed to the production environment, the model is considered “live” and will be available to all authorized end users and systems. Data scientists, software engineers, and all appropriate experts should be consulted to ensure the model is deployed in a manner that matches expectations in the production environment.

Model Operation

The operationalizing ML workflow’s ultimate goal is the successful deployment and smooth operation of a well-functioning model. The ideal model provides end users with predictions when requested in a timely manner, performs automatic monitoring of model performance and usage, and notifies people of relevant model activity through reporting and alert mechanisms. In short, a successful deployment means that your ML application performs as expected and alerts you when your attention is needed. Model operations (shown in Figure 5-10) will allow you to meet these needs.

Figure 5-10. Model operation steps

The first step in operationalizing a model is deploying the model to the production environment and verifying that the model has been successfully transferred and no issues exist with running the model in the intended environment. The deployment verification is typically done by closely monitoring the model immediately after deployment to ensure it’s receiving, analyzing, and returning appropriate data. This step also involves verifying input data formatting is correct and analyzing output for correctness to ensure that no model performance degradation has occurred. These tests are best performed on the production hosting environment by the software engineering team immediately after deployment.

The operation step involves executing the model when required. During this step, the software engineering team must ensure all aspects of the deployed model are working as intended. Any system issues should be fixed based on their order of importance to ensure the model continues to operate as intended. This process typically requires using regularly scheduled automated tests of the system, validating model responsiveness, ensuring models are being retrained when necessary, and verifying that models are resilient against ingesting malformed or improper data. These events result in system alerts to request human inspection when necessary and a support request system that allows end users to submit help tickets.

Smooth operation of a deployed model requires robust monitoring capabilities. A deployed monitoring system should be able to track model input and output, capture any warning or error message produced by the software environment, compute anomalous model activity (e.g., drift detection analysis, adversarial AI attacks detection), and forward all relevant information to the system’s reporting and alerting interfaces.

Effective monitoring of a deployed system requires several component systems to operate together seamlessly. This is best done by automating as much of the system as possible. Given the speed and scale at which many AI systems operate, it is insufficient to rely solely on human monitoring and analysis to capture all necessary details in a timely fashion. As such, automated monitoring tasks are required for desired monitoring activity and set to run as needed. All results of these automated monitoring tasks are reported and sent as an alert, when deemed necessary.

Reporting allows for human inspection of the deployed model. The speed and scale at which many AI systems operate may make it impractical for a human to review model performance in real time. However, automated reporting at regular time intervals will enable users to verify the system is working as intended. It is also possible for these reports to be used by other automated systems to monitor the deployed model. These reports are examined using visually engaging formats such as a dashboard, spreadsheet, or document to provide user insights.

Alerting modules are responsible for creating real-time notifications about the deployed system’s status. These notifications are meant to either signal people for immediate system review or to create micro-reports to be included in automated system logs. Events that require alert notifications are initially flagged by the model monitoring module. These flagged events are received by the alerting module, where the alerts are drafted based on prewritten notification templates. These notifications are then sent out via a messaging service to the recipient. Based on requirements, notifications can be sent as text messages, emails, Slack messages, app push notifications, or automated phone calls.

Machine Learning Operations (MLOps)

Early in the history of software development, innovations in tooling, automation, and process optimization were focused almost exclusively on efficient delivery of new functionality. DevOps, and later DevSecOps, established frameworks that introduced similar innovations to balance functionality against priorities like automation, reliability, distributed development, and security.

As reviewed in this ML section, there are many steps to the design, training, testing, deployment, and monitoring of ML models. In many cases, these steps are executed manually. However, this greatly reduces the ability to scale the model development process. MLOps7 provides a potential solution to this problem through integration and automation to allow for increased efficiencies in the development and training of ML models.

MLOps is similar to DevSecOps, with some unique differences (e.g., use of production data for model training, rapid iterations of models, accounting for bias, need for continuous monitoring) that you must consider. Organizations today are primarily prioritizing the development of AI models and not the process used to develop those models. Through the integration of these manual processes using the rapidly expanding open source and commercial solutions, MLOps will allow an organization to scale their AI development pipelines to meet future demand and operate at the enterprise level.

Scalable Training Infrastructure

AI model training currently requires larger training datasets (e.g., streaming, IoT), an increasing number of modeling approaches (e.g., supervised, unsupervised, reinforcement learning), and more complex models (e.g., deep learning) composed of larger numbers of training parameters. With this growth, we see an increasing need for more robust and scalable training infrastructures. This infrastructure requires additional automation to evaluate the range of required models while collecting training results for comparisons across model versions and approaches.

MLOps combines more traditional DevSecOps techniques like continuous build systems, distributed processing, and “As a Service” (AaS) infrastructure with ML-specific approaches like model bias, model training, hardware acceleration (e.g., distributed processing, GPUs), model optimization infrastructure (see next paragraph), and model monitoring. Through this process, models are developed more quickly and follow standard processes with detailed documentation of the results.

Model Optimization Infrastructure

In the early days of ML, models were typically trained individually and infrequently (weekly, monthly) due to hardware constraints and immature optimization techniques.

Today, we have the operational requirements and the computational resources to train models more frequently (daily, hourly, continuously) and in large batches (10s to 100s) using scalable training infrastructures, modern optimization techniques, and scalable computing infrastructure. In addition, teams are increasingly deploying AI models to a range of edge-hosting environments. If models are to operate effectively in these edge environments, you must optimize them to operate on those lower capacity hardware configurations.

MLOps can help developers by integrating with scalable training infrastructure and coordinating the accumulation, storage, and comparison of trained models. Additionally, centralized storage and analysis of models can provide valuable insights into model development over time, providing an approach to evaluate model and data drift. This standard process also supports optimization of trained models based on the target hosting environment to ensure models can effectively operate in production. In the longer term, these model optimization repositories may also play an important role in explainability considerations, given the metadata collected as part of this process.

Model Deployment Infrastructure

For many AI teams, deploying a completed model to operational infrastructure for use by its intended end user can represent a significant technical and process challenge. This is because AI teams often lack the in-depth software development, system administration, and network engineering knowledge to deploy such systems.

AI development teams can benefit from defining standardized, automated deployment workflows for their ML models using standard automated deployment applications (e.g., Jenkins, Ansible). These pipelines can be developed with direct contributions from subject matter experts in software development and system administrators as part of the integrated AI development team. This approach ensures that deployments meet organizational requirements and minimize the burden placed on the AI development team while decreasing time to execute initial and retraining deployments.

O’Reilly has additional resources you can explore to learn more about MLOps, including Introducing MLOps and Practical MLOps.

Operationalizing ML capabilities within your organization requires practitioners with very specific skill sets. Table 5-1 lists some of these roles alongside elements to keep in mind as you staff your AI team.

Table 5-1. Key technical ML roles
Key player Elements to keep in mind as you staff your AI team
Data scientist/AI developer
  • Data scientists tend to be default actors during AI development. This is less a concern during the pilot stage but can become problematic when a system enters operations. One reason to create and staff specialized positions like data engineers and ethical/policy advisers is to allow data scientists to remain focused on development.

  • Most data scientists are involved in both manual and AI-based analytics, although some data scientists prefer to specialize in AI. Specific subdisciplines like neural networks often call for specialized experience.

Subject matter experts
  • Due to the rarity and novelty of AI development skills, most organizations are forced to look externally for talent. This leads to a situation where employees understand AI development but lack familiarity with the organization and its data, processes, and challenges.

  • While a quality data scientist can eventually reverse engineer this information through observation, documentation, code, and data, it is far more cost-effective to involve a subject matter expert.

  • While subject matter experts deliver much of their value in informing the functionality of the AI system, they are also a source of valuable information and feedback on operational considerations such as security and policy.

End users
  • Every AI development system exists to address a use case. Typically, that boils down to a critical decision in an important business process. The person making that decision is the end user.

  • All software systems should pursue feedback from their end users to inform future functionality. AI adds a layer to that relationship by making it possible for structured feedback from the end user to become part of the training data that writes the business logic.

  • End users are often the first and last line of defense when a system decides to misbehave. Helping end users develop intuition around a model’s effective business logic is important to making them more effective in this role.

1 See the report “In Gartner’s 2020 Hype Cycle, Explainable AI Leaps Ahead” on the dramatic increase in demand for Explainable AI.

2 For a definition of accuracy, precision, and recall, see Hands-On Machine Learning for Cybersecurity (Packt).

3 While there have been recent developments to help expose the interworking of these black box models to improve explainability, a significant challenge of explainability remains when considering some of these more complex deep learning models.

4 “Daily Archives”, SAS Institute, May 6, 2015.

5 “3 NLP Trends Prime for Improvement in the New Year”, Inside Big Data, Jan 3, 2021.

6 For more information on these specifications, see this TowardsDataScience article.

7 In this report, we are defining “MLOps” as “machine learning operations.” However, we recognize that there is an emerging definition of MLOps to mean “model operations.” When we’re discussing MLOps in this report, we’re focusing on the operationalization of machine learning, whereas model ops may focus on a broader universe. For additional reading on this topic, see Gartner’s definition of ModelOps or the Forbes article “ModelOps Is Just The Beginning Of Enterprise AI”.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.102.112