The past chapters in this book have introduced data analysis methods, feature extraction techniques, and traditional machine learning and deep learning techniques. We have conducted multiple experiments on numeric, textual, and visual data and found how to analyze and tweak the performance.
In this chapter, we’re going to discuss strategies for planning data science and artificial intelligence projects, tools for persisting the models, and hosting the models as a microservice that can be used in the evolving applications.
Data Science Life Cycle
The process usually begins with focus on defining the business or research objectives and coming up with the artifacts that properly define the problem we are trying to solve. This leads to a clear understanding about the data that will be required, which then expands to analysis of data sources, technical expertise and cost required to obtain the data, and evaluation of data in terms of how nicely will it support in reaching the business objective. Once the data has been obtained, we might need to clean, preprocess, and, in some cases, combine multiple data sources to enrich the quality of data.
Next step in the process is model creation. Based on the business objectives and technological constraints, we decide what kind of solutions might be applicable to this problem. We often begin with simple experiments with basic feature engineering and out-of-the-box solutions and then proceed to more thorough model developments. Based on the type of data, chosen solution, and availability of computation power, this can take hours to days to development as well as training. This is closely tied with thorough evaluation and tuning.
This life cycle is not a rigid structure but shows the process at a top level. The aim of such processes is to provide a standard set of steps involved, along with details about information required for each such step, and the deliverables and documentations that are produced. One such highly popular framework is CRISP-DM.
CRISP-DM Process
Figure 16-3 shows how the process model is designed at four levels of abstraction. At the top level, the phases define several generic tasks that are meant to be well-defined, complete, and stable tasks, which are carried out as special tasks. There may be a generic task called Collect User Data, which may require specialized tasks like (1) export users table from the database, (2) find user location using external service, and (3) download data from user’s LinkedIn profile using the API. The fourth level covers the actual implementation of the specialized tasks – also covering a record of actions, decision, and results of the task that is performed.
There are six phases of the CRISP-DM model. The following sections describe each one.
Phase 1: Business Understanding
Before diving deeper into the project, the first step is to understand the end goal of the project from the stakeholder’s point of view. There might be conflicting objectives, which, if not analyzed at this level, may lead to unnecessary repetition costs. By the end of this phase, we will have a clear set of business objectives and business success criteria. We also conduct analysis of resources availability and risk during assess situation. After this, we then define the goals of the project from a technical data mining perspective and produce a project plan.
Phase 2: Data Understanding
This phase involves tasks for collecting initial data. Most projects require data from multiple sources which need to be integrated – that can be covered either in this phase of the next. However, the important part here is to create an initial data collection report that explains how the data was acquired and what problems were encountered. This phase also covers data exploration and describing the data along with verifying data quality. Any potential data quality issues must be addressed.
Phase 3: Data Preparation
The data preparation phase assumes that initial data has been obtained and studied and potential risks have been planned. The end goal of this goal is to produce the ready-to-use datasets that will be used for modelling or analysis. An additional artifact will describe the dataset.
As a part of this phase, select the datasets – and for each dataset, document the reasons for inclusion and exclusion. This is followed by data cleaning, in which the data quality is improved. This may involve transformation, deriving more attributes or enriching the datasets. After cleaning, transformation, and integration, the data is formatted to make it simpler to load the data in the future stages.
Phase 4: Modelling
Modelling is the phase in which you build and assess various models based on the different modelling and machine learning techniques we have studied so far. At the first step, the modelling technique to be used is selected. There will be different instances of this task based on different modelling methods or algorithms that you wish to explore and evaluate. You will generate a test design, build the model, assess it thoroughly, and evaluate how closely the model fits the technical needs of the system.
Phase 5: Evaluation
Evaluation phase looks broadly at which model meets the business needs. The tasks involved in this phase test the models in real application and assess the results generated. After this, there are tasks on Review process, in which we do a thorough review of the data mining engagement in order to determine if there is any important factor or task that should have been covered. Finally, we determine the next steps to decide whether the models require further tuning or move to the deployment of the model. At the end of this phase, we have documented the quality of the models and a list of possible actions that should be taken next.
Phase 6: Deployment
The final phase, deployment, is the phase that brings the work done so far to the actual use. This phase varies widely based on the business needs, organization policies, and engineering needs. This begins with planning deployment, involves developing a deployment plan containing the strategy for deployment. We also need to plan a thorough monitoring and maintenance plan to avoid issues after the end-to-end project has been launched. Finally, the project team documents a summary of the project and conducts a project review to discuss and document what went well, what could have been better, and how to improve in the future.
In practice, most organizations use these phases as guidelines and create their own processes based on their budgets, governance requirements, and needs. Many small-scale teams might not follow these steps and get captured in a long loop of iterations and iterations of development and improvements, not being able to avoid the pitfalls that otherwise could have been well planned and handled if these were studied.
In the next part of this chapter, we will study the technical aspects of development and deployment of data science and AI projects.
How ML Applications Are Served
In larger applications, these servers are hosted on cloud often through Docker for easy deployment. The concept of deploying, monitoring, and maintaining machine learning models for AI applications is being expanded into well-structured concepts in the form of MLOps.
In the next few pages, we will take a small project that will be eventually hosted as an ML application.
Learning with an Example
In this mini-project, we will build a sentiment analysis tool using PyTorch with an aim to experiment with model architecture to achieve relatively good performance, save the parameters, and host it using flask.
The first attempt toward sentiment analysis was General Inquirer system, published in 1961. The typical task in sentiment analysis is text polarity classification, where the classes of interest are positive and negative, sometimes with a neutral class. With advancement in computational capabilities, machine learning algorithms, and later deep learning, sentiment analysis is much more accurate and prevalent in a lot of situations.
Defining the Problem
Sentiment analysis is a vast field that covers the problem of identifying emotions, opinions, moods, and attitudes. There are also many names and slightly different tasks, for example, sentiment analysis, opinion mining, opinion extraction, sentiment mining, subjectivity analysis, affect analysis, emotion analysis, review mining, etc.
In this problem, we will build a model for classifying whether a movie review sentence is positive, negative, or neutral. In traditional machine learning approaches, feature engineering would be the primary task. A feature vector is a representation of actual content (document, tweet, etc.) that the classification algorithm takes as an input. The purpose of a feature, other than being an attribute, would be much easier to understand in the context of a problem. A feature is a characteristic that might help when solving the problem.
In a deep learning solution, we can either use embeddings or sequence of characters. But first, we have to obtain the data.
In some cases, you will collect the data through your database logs, or hire a data gathering team, or like in our case, get lucky and stumble over a freely available dataset. A 50,000-item movie review dataset1 has been gathered and prepared by Stanford, published in 2011.
Data
You can download the data from their webpage though the solutions we’re going to explain here will work equally fine with other datasets including the ones from product reviews or social media text. The dataset downloaded from the website contains a compressed tar file, which after decompression expands into two folders, namely, test and train, and some additional files containing information about the dataset. An alternate copy of the dataset that has been preprocessed is available on Kaggle,2 which is shared by Lakshmipathi N.
The dataset contains 50,000 reviews, each of which is marked as positive or negative. This gives an indication about the last layer of the neural network structure – all we need is a single node with sigmoid activation function. If there were more than two classes, say, positive, negative, or neutral, we would create three nodes, each representing a sentiment class label. The node with the highest value would indicate the predicted result.
The output has been truncated here for brevity. You can see the labelled sentiment for this review using sample_row['sentiment'], which is positive for this sample.
We know that most models require the data to be converted to a particular format. In our RNN-based model, we will need to convert the data into a sequence of numbers – where each number represents a word in the vocabulary.
In preprocessing stage, we will need to (1) convert all the words to lowercase, (2) tokenize and clean the string, (3) remove stop words, and (4) based on our knowledge of words in the training corpus, prepare a dictionary of words and convert all the words to numbers based on the dictionary.
This means we do not have a list of stopwords in nltk, which we can install using nltk.download(). This is required only once in your python environment. For more details, you can refer to the NLTK3 documentation.
Alternatively, you can construct a list of stopwords and add the logic to remove the words present in the stopwords list.
Before proceeding further, we should verify if the objects are in the right shape and size.
vocab should be a dictionary of length 2000 (the number we limit the vocabulary size to). X_train, y_train, X_test, and y_test should be numpy.ndarray with the size same as the split of the original dataset.
The review lengths look quite long by looking at these. However, in general, we’ll have a lot of reviews that are short, and there will be very few that are very long. It will not be wrong to truncate review sequence length to 200 words, thus, of course, losing information in reviews that are more than 200 words long, but we assume they will be quite rare and should not have much impact on the performance of the model.
Preparing the Model
The model is a simple model that will have an input layer, followed by the LSTM layer, followed by a one-unit output layer with sigmoid activation. We will use a dropout layer for basic regularization to avoid overfitting.
Let’s define the training loop. We will use binary cross entropy loss function, which is a good choice for simple binary classification problems. We will keep learning rate as 0.01, and the optimizer is Adam optimization algorithm.
The training loop can be made to run for a large number of epochs. We will keep a track of accuracy and losses over each epoch to see how the performance improves over multiple iterations of training.
For further improvements and tuning, you can play with the model architecture and hyperparameters and, the easiest of all, increase the number of epochs or add more labelled data.
Serializing for Future Predictions
Usually, you’d play around with modifications in terms of network architecture, tune factors like how you are creating features (vocabulary), and other hyperparameters. Once you’ve got sufficiently high accuracy that can be reliably used in the application, you would save the model state so that we don’t have to repeat computationally intensive training process every time we want to predict sentiment for a sentence.
Logic to convert sentence into sequence
Vocabulary dictionary containing mapping from words to numbers
Network architecture and forward propagation computations
Remember, this only saves the model parameters. You would still need the model definition that you have specified in the code before.
This will preprocess the sentence, split it into tokens, convert into a sequence of vocabulary indices, pass the input sequence to the network, and return the value obtained in the output layer after a forward propagation pass. This returned a value of 0.6632 , which denotes a positive sentiment. If the use case requires, you can add a conditional statement to return a string containing a word “positive” or “negative” instead of a number.
Hosting the Model
One of the highly popular methods to use a train model in a larger application is to host the model as a microservice. This means a small HTTP server will be used that can accept GET requests.
In this example, we can build and create a server that can accept GET data, which will be a review. The server will read the data and respond with a sentiment label.
Hello World in Flask
The front-end application can send a request to http://server:port/getsentiment and send the data as a reviewtext argument and receive a json/dictionary with sentimentscore and sentimentlabel.
What’s Next
The field of machine learning, artificial intelligence, and data science has been evolving over the past decades and will keep evolving as newer hardware technologies and algorithmic perspectives keep evolving.
AI is not a magic wand that will solve our unsolvable problems – but a well-structured suite of concepts, theories, and techniques that help us understand and implement solutions that help the machines learn by looking at the data that we offer them. It is important to understand the implications of potential biases and thoroughly inspect the ethical aspects of the projects and products that are the outcome of our practice. This book serves not as an end but as a handy tool to navigate the steps in your data science journey.