© Vaibhav Verdhan 2020
V. VerdhanSupervised Learning with Pythonhttps://doi.org/10.1007/978-1-4842-6156-9_1

1. Introduction to Supervised Learning

Vaibhav Verdhan1 
(1)
Limerick, Ireland
 

“The future belongs to those who prepare for it today.”

— Malcom X

The future is something which always interests us. We want to know what lies ahead and then we can plan for it. We can mold our business strategies, minimize our losses, and increase our profits if we can predict the future. Predicting is traditionally intriguing for us. And you have just taken the first step to learning about predicting the future. Congratulations and welcome to this exciting journey!

You may have heard that data is the new oil. Data science and machine learning (ML) are harnessing this power of data to generate predictions for us. These capabilities allow us to examine trends and anomalies, gather actionable insights, and provide direction to our business decisions. This book assists in developing these capabilities. We are going to study the concepts of ML and develop pragmatic code using Python. You are going to use multiple datasets, generate insights from data, and create predictive models using Python.

By the time you finish this book, you will be well versed in the concepts of data science and ML with a focus on supervised learning. We will examine concepts of supervised learning algorithms to solve regression problems, study classification problems, and solve different real-life case studies. We will also study advanced supervised learning algorithms and deep learning concepts. The datasets are structured as well as text and images. End-to-end model development and deployment process are studied to complete the entire learning.

In this process, we will be examining supervised learning algorithms, the nuts and bolts of them, statistical and mathematical equations and the process, what happens in the background, and how we use data to create the solutions. All the codes use Python and datasets are uploaded to a GitHub repository (https://github.com/Apress/supervised-learning-w-python) for easy access. You are advised to replicate those codes yourself.

Let’s start this learning journey.

What Is ML?

When we post a picture on Facebook or shop at Amazon, tweet or watch videos on YouTube, each of these platforms is collecting data for us. At each of these interactions, we are leaving behind our digital footprints. These data points generated are collected and analyzed, and ML allows these giants to make logical recommendations to us. Based on the genre of videos we like, Netflix/YouTube can update our playlist, what links we can click, and status we can react to; Facebook can recommend posts to us, observing what type of product we frequently purchase; and Amazon can suggest our next purchase as per our pocket size! Amazing, right?

The short definition for ML is as follows: “In Machine Learning, we study statistical/mathematical algorithms to learn the patterns from the data which are then used to make predictions for the future.”

And ML is not limited to the online mediums alone. Its power has been extended to multiple domains, geographies, and use cases. We will be describing those use cases in detail in the last section of this chapter.

So, in ML, we analyze vast amounts of data and uncover the patterns in it. These patterns are then applied on real-world data to make predictions for the future. This real-world data is unseen, and the predictions will help businesses shape their respective strategies. We do not need to explicitly program computers to do these tasks; rather, the algorithms take the decisions based on historical data and statistical models.

But how does ML fit into the larger data analysis landscape? Often, we encounter terms like data analysis, data mining, ML, and artificial intelligence (AI). Data science is also a loosely used phrase with no exact definition available. It will be a good idea if these terms are explored now.

Relationship Between Data Analysis, Data Mining, ML, and AI

Data mining is a buzzword nowadays. It is used to describe the process of collecting data from large datasets, databases, and data lakes, extracting information and patterns from that data, and transforming these insights into usable structure. It involves data management, preprocessing, visualizations, and so on. But it is most often the very first step in any data analysis project.

The process of examining the data is termed data analysis . Generally, we trend the data, identify the anomalies, and generate insights using tables, plots, histograms, crosstabs, and so on. Data analysis is one of the most important steps and is very powerful since the intelligence generated is easy to comprehend, relatable, and straightforward. Often, we use Microsoft Excel, SQL for EDA. It also serves as an important step before creating an ML model.

There is a question quite often discussed—what is the relationship between ML, AI, and deep learning? And how does data science fit in? Figure 1-1 depicts the intersections between these fields. AI can be thought of as automated solutions which replace human-intensive tasks. AI hence reduces the cost and time consumed as well as improving the overall efficiency.
../images/499122_1_En_1_Chapter/499122_1_En_1_Fig1_HTML.png
Figure 1-1

Relationship between AI, ML, deep learning, and data science shows how these fields are interrelated with each other and empower each other

Deep learning is one of the hottest trends now. Neural networks are the heart and soul of deep learning. Deep learning is a subset of AI and ML and involves developing complex mathematical models to solve business problems. Mostly we use neural networks to classify images and analyze text audio and video data.

Data science lies at the juxtaposition of these various domains. It involves not only ML but also statistics understanding, coding expertise and business acumen to solve business problems. A data scientist’s job is to solve business problems and generate actionable insights for the business. Refer to Table 1-1 to understand the capabilities of data science and its limitations.
Table 1-1

Data Science: How Can It Help Us, Its Usages, and Limitations

-../images/499122_1_En_1_Chapter/499122_1_En_1_Figa_HTML.gif

With the preceding discussion, the role of ML and its relationship with other data-related fields should be clear to you. You would have realized by now that “data” plays a pivotal role in ML. Let’s explore more about data, its types and attributes.

Data, Data Types, and Data Sources

You already have some understanding of data for sure. It will be a good idea to refresh that knowledge and discuss different types of datasets generated and examples of it. Figure 1-2 illustrates the differentiation of data.
../images/499122_1_En_1_Chapter/499122_1_En_1_Fig2_HTML.jpg
Figure 1-2

Data can be divided between structured and unstructured. Structured data is easier to work upon while generally deep learning is used for unstructured data

Data is generated in all the interactions and transactions we do. Online or offline: we generate data every day, every minute. At a bank, a retail outlet, on social media, making a mobile call: every interaction generates data.

Data comes in two flavors: structured data and unstructured data. When you make that mobile call to your friend, the telecom operator gets the data of the call like call duration, call cost, time of day, and so on. Similarly, when you make an online transaction using your bank portal, data is generated around the amount of transaction, recipient, reason of transaction, date/time, and so on. All such data points which can be represented in a row-column structure are called structured data . Most of the data used and analyzed is structured. That data is stored in databases and servers using Oracle, SQL, AWS, MySQL, and so on.

Unstructured data is the type which cannot be represented in a row-column structure, at least in its basic format. Examples of unstructured data are text data (Facebook posts, tweets, reviews, comments, etc.), images and photos (Instagram, product photos), audio files (jingles, recordings, call center calls), and videos (advertisements, YouTube posts, etc.). All of the unstructured data can be saved and analyzed though. As you would imagine, it is more difficult to analyze unstructured data than structured data. An important point to be noted is that unstructured data too has to be converted into integers so that the computers can understand it and can work on it. For example, a colored image has pixels and each pixel has RGB (red, green, blue) values ranging from 0 to 255. This means that each image can be represented in the form of matrices having integers. And hence that data can be fed to the computer for further analysis.

Note

We use techniques like natural language processing, image analysis, and neural networks like convolutional neural networks, recurrent neural networks, and so on to analyze text and image data.

A vital aspect often ignored and less discussed is data quality . Data quality determines the quality of the analysis and insights generated. Remember, garbage in, garbage out.

The attributes of a good dataset are represented in Figure 1-3. While you are approaching a problem, it is imperative that you spend a considerable amount of time ascertaining that your data is of the highest quality.
../images/499122_1_En_1_Chapter/499122_1_En_1_Fig3_HTML.jpg
Figure 1-3

Data quality plays a vital role in development of an ML solution; a lot of time and effort are invested in improving data quality

We should ensure that data available to us conforms to the following standards:
  • Completeness of data refers to the percentage of available attributes. In real-world business, we find that many attributes are missing, or have NULL or NA values. It is advisable to ensure we source the data properly and ensure its completeness. During the data preparation phase, we treat these variables and replace them or drop them as per the requirements. For example, if you are working on retail transaction data, we have to ensure that revenue is available for all or almost all of the months.

  • Data validity is to ensure that all the key performance indicators (KPI) are captured during the data identification phase. The inputs from the business subject matter experts (SMEs) play a vital role in ensuring this. These KPIs are calculated and are verified by the SMEs. For example, while calculating the average call cost of a mobile subscriber, the SME might suggest adding/deleting few costs like spectrum cost, acquisition cost, and so on.

  • Accuracy of the data is to make sure all the data points captured are correct and no inconsistent information is in our data. It is observed that due to human error or software issues, sometimes wrong information is captured. For example, while capturing the number of customers purchasing in a retail store, weekend figures are mostly higher than weekdays. This is to be ensured during the exploratory phase.

  • Data used has to be consistent and should not vary between systems and interfaces. Often, different systems are used to represent a KPI. For example, the number of clicks on a website page might be recorded in different ways. The consistency in this KPI will ensure that correct analysis is done, and consistent insights are generated.

  • While you are saving the data in databases and tables, often the relationships between various entities and attributes are not consistent or worse may not exist. Data integrity of the system ensures that we do not face such issues. A robust data structure is required for an efficient, complete, and correct data mining process.

  • The goal of data analytics is to find trends and patterns in the data. There are seasonal variations, movements with respect to days/time and events, and so on. Sometimes it is imperative that we capture data of the last few years to measure the movement of KPIs. The timeliness of the data captured has to be representative enough to capture such variations.

Most common issues encountered in data are missing values, duplicates, junk values, outliers, and so on. You will study in detail how to resolve these issues in a logical and mathematical manner.

By now, you have understood what ML is and what the attributes of good-quality data are to ensure good analysis. But still a question is unanswered. When we have software engineering available to us, why do we still need ML? You will find the answer to this question in the following section.

How ML Differs from Software Engineering

Software engineering and ML both solve business problems. Both interact with databases, analyze and code modules, and generate outputs which are used by the business. The business domain understanding is imperative for both fields and so is the usability. On these parameters, both software engineering and ML are similar. However, the key difference lies in the execution and the approach used to solve the business challenge.

Software writing involves writing precise code which can be executed by the processor, that is, the computer. On the other hand, ML collects historical data and understands trends in the data. Based on the trends, the ML algorithm will predict the desired output. Let us look at it with an easy example first.

Consider this: you want to automate the opening of a cola can. Using software, you would code the exact steps with precise coordinates and instructions. For that, you should know those precise details. However, using ML, you would “show” the process of opening a can to the system many times. The system will learn the process by looking at various steps or “train” itself. Next time, the system can open the can itself. Now let’s look at a real-life example.

Imagine you are working for a bank which offers credit cards. You are in the fraud detection unit and it is your job to classify a transaction as fraudulent or genuine. Of course, there are acceptance criteria like transaction amount, time of transaction, mode of transaction, city of transaction, and so on.

Let us implement a hypothetical solution using software; you might implement conditions like those depicted in Figure 1-4. Like a decision tree, a final decision can be made. Step 1: if the transaction amount is below the threshold X, then move to step 2 or else accept it. In step 2, the transaction time might be checked and the process will continue from there.
../images/499122_1_En_1_Chapter/499122_1_En_1_Fig4_HTML.jpg
Figure 1-4

Hyphothetical software engineering process for a fraud detection system. Software engineering is different from ML.

However using ML, you will collect the historical data comprising past transactions. It will contain both fraudulent and genuine transactions. You will then expose these transactions to the statistical algorithm and train it. The statistical algorithm will uncover the relationship between attributes of the transaction with its genuine/fraud nature and will keep that knowledge safe for further usage.

Next time, when a new transaction is shown to the system, it will classify it fraudulent or genuine based on the historical knowledge it has generated from the past transactions and the attributes of this new unseen transaction. Hence, the set of rules generated by ML algorithms are dependent on the trends and patterns and offer a higher level of flexibility.

Development of an ML solution is often more iterative than software engineering. Moreover, it is not exactly accurate like software is. But ML is a good generalized solution for sure. It is a fantastic solution for complex business problems and often the only solution for really complicated problems which we humans are unable to comprehend. Here ML plays a pivotal role. Its beauty lies in the fact that if the training data changes, one need not start the development process from scratch. The model can be retrained and you are good to go!

So ML is undoubtedly quite useful, right! It is time for you to understand the steps in an ML project. This will prepare you for a deeper journey into ML.

ML Projects

An ML project is like any other project. It has a business objective to be achieved, some input information, tools and teams, desired accuracy levels, and a deadline!

However, execution of an ML project is quite different. The very first step in the ML process is the same, which is defining a business objective and a measurable parameter for measuring the success criteria. Figure 1-5 shows subsequent steps in an ML project.
../images/499122_1_En_1_Chapter/499122_1_En_1_Fig5_HTML.png
Figure 1-5

An ML project is like any other project, with various steps and process. Proper planning and execution are required for an ML project like any other project.

The subsequent steps are
  1. 1.

    Data discovery is done to explore the various data sources which are available to us. Dataset might be available in SQL server, excel files, text or .csv files, or on a cloud server.

     
  2. 2.

    In the data mining and calibration stage, we extract the relevant fields from all the sources. Data is properly cleaned and processed and is made ready for the next phase. New derived variables are created and variables which do not have much information are discarded.

     
  3. 3.

    Then comes the exploratory data analysis or EDA stage. Using analytical tools, general insights are generated from the data. Trends, patterns, and anomalies are the output of this stage, which prove to be quite useful for the next stage, which is statistical modeling.

     
  4. 4.

    ML modeling or statistical modeling is the actual model development phase. We will discuss this phase in detail throughout the book.

     
  5. 5.

    After modeling, results are shared with the business team and the statistical model is deployed into the production environment.

     

Since most of the data available is seldom clean, more than 60%–70% of the project time is spent in data mining, data discovery, cleaning, and data preparation phase.

Before starting the project, there are some anticipated challenges. In Figure 1-6, we discuss a few questions we should ask before starting an ML project.
../images/499122_1_En_1_Chapter/499122_1_En_1_Fig6_HTML.jpg
Figure 1-6

Preparations to be made before starting an ML project. It is imperative that all the relevant questions are clear and KPIs are frozen.

We should be able to answer these questions about the data availability, data quality, data preparation, ML model prediction measurements, and so on. It is imperative to find the answers to these questions before kicking off the project; else we are risking stress for ourselves and missing deadlines at a later stage.

Now you know what is ML and the various phases in an ML project. It will be useful for you to envisage an ML model and what the various steps are in the process. Before going deeper, it is imperative that we brush up on some statistical and mathematical concepts. You will also agree that statistical and mathematical knowledge is required for you to appreciate ML.

Statistical and Mathematical Concepts for ML

Statistics and mathematics are of paramount importance for complete and concrete knowledge of ML. The mathematical and statistical algorithms used in making the predictions are based on concepts like linear algebra, matrix multiplications, concepts of geometry, vector-space diagrams, and so on. Some of these concepts you would have already studied. While studying the algorithms in subsequent chapters, we will be studying the mathematics behind the working of the algorithms in detail too.

Here are a few concepts which are quite useful and important for you to understand. These are the building blocks of data science and ML:
  • Population vs. Sample : As the name suggests, when we consider all the data points available to us, we are considering the entire population. If a percentage is taken from the population, it is termed as a sample. This is seen in Figure 1-7.
    ../images/499122_1_En_1_Chapter/499122_1_En_1_Fig7_HTML.png
    Figure 1-7

    Population vs. a sample from the population. A sample is a true representation of a population. Sampling should be done keeping in mind that there is no bias.

  • Parameter vs. Statistic : Parameter is a descriptive measure of the population: for example, population mean, population variance, and so on. A descriptive measure of a sample is called a statistic. For example, sample mean, sample variance, and so on.

  • Descriptive vs. Inferential Statistics : When we gather the data about a group and reach conclusions about the same group, it is termed descriptive statistics. However, if data is gathered from a sample and statistics generated are used to generate conclusions about the population from which the sample has been taken, it is called inferential statistics.

  • Numeric vs. Categorical Data : All data points which are quantitative are numeric, like height, weight, volume, revenue, percentages returns, and so on.
    • The data points which are qualitative are categorical data points: for example, gender, movie ratings, pin codes, place of birth, and so on. Categorical variables are of two types: nominal and ordinal. Nominal variables do not have a rank between distinct values, whereas ordinal variables have a rank.

    • Examples of nominal data are gender, religion, pin codes, ID number, and so on. Examples of ordinal variables are movie ratings, Fortune 50 ranking, and so on.

  • Discrete vs. Continuous Variable : Data points which are countable are discrete; otherwise data is continuous (Figure 1-8).

../images/499122_1_En_1_Chapter/499122_1_En_1_Fig8_HTML.png
Figure 1-8

Discrete variables are countable while continuous variables are in a time frame

For example, the number of defects in a batch is countable and hence discrete, whereas time between customer arrivals at a retail outlet is continuous.
  • Measures of Central Tendency : Mean, median, mode, standard deviation, and variance are the measures of central tendency. These measures are central to measuring the various KPIs. There are other measures too which are tracked like total, decile, or quartile distribution. For example, while reporting the number of transactions done in a day, we will report the total number of transactions in a day and the average number per day. We will also report time/date movement for the KPI.

  • Poisson’s Distribution : Poisson’s distribution determines the probability of occurrences of a given number of events in a fixed interval of time or space. The assumption is that these events are independent of each other and occur with constant mean.

The equation of Poisson’s distribution is as follows:
$$ mathrm{P}left(mathrm{k} mathrm{events} mathrm{in} mathrm{the} mathrm{in}mathrm{terval}
ight)=frac{lambda^k;{e}^{-lambda }}{mathrm{k}!} $$

For example: if we want to model the number of customers visiting a store between 4 PM and 5 PM or the number of transactions hitting the server between 11 PM and 4 AM, we can use Poisson’s distribution.

You can generate Poisson’s distribution using the following Python code:
import numpy as np
import matplotlib.pyplot as plt
s = np.random.poisson(5, 10000)
count, bins, ignored = plt.hist(s, 14, normed=True)
plt.show()
  • Binomial Distribution : We use binomial distribution to model the number of successes in a sample “n” drawn from population “N.” The condition is this sample should be drawn with replacement. Hence, in a sequence of “n” independent events, a Boolean result is decided for success of each of the events. Obviously if the probability of success is “p,” then the probability of failure is “1–p”.

The equation of binomial distribution is as follows:

P (X) = nCx px (1–p)n-x

The easiest example for a binomial distribution is a coin toss. Each of the coin toss events is independent from the others.

You can generate binomial distribution using the following Python code:
import numpy as np
import matplotlib.pyplot as plt
n, p = 10, .5
s = np.random.binomial(n, p, 1000)
count, bins, ignored = plt.hist(s, 14, normed=True)
plt.show()
  • Normal or Gaussian Distribution : Normal distribution or the Gaussian distribution is the most celebrated distribution. It is the famous bell curve and occurs often in nature.

For normal distribution, it is said that when a large number of small, random disturbances influence together then the distribution takes this format. And each of these small disturbances have their own unique distributions.

Gaussian distribution forms the basis of the famous 68–95–99.7 rule, which states that in normal distribution, 68.27%, 95.45%, and 99.73% of the values lie within the one, two, and three standard deviations of the mean. The following Python code is used to generate a normal distribution as shown in Figure 1-9.
import numpy as np
import matplotlib.pyplot as plt
mu, sigma = 0, 0.1
s = np.random.normal(mu, sigma, 1000)
count, bins, ignored = plt.hist(s, 30, normed=True)
plt.plot(bins, 1/(sigma * np.sqrt(2 * np.pi)) * np.exp( - (bins - mu)**2 / (2 * sigma**2) ),linewidth=2, color="r")
plt.show()
../images/499122_1_En_1_Chapter/499122_1_En_1_Fig9_HTML.jpg
Figure 1-9

The normal distribution curve is the most celebrated curve

  • BiasVariance Trade-off : To measure how a model is performing we calculate the error, which is the difference between actual and predicted values. This error can occur from two sources, bias and variance, which are shown in Figure 1-10 and defined in Table 1-2.

../images/499122_1_En_1_Chapter/499122_1_En_1_Fig10_HTML.jpg
Figure 1-10

Bias is underfitting of the model, while variance is overfitting of the model. Both have to be managed for a robust ML solution.

Table 1-2

Comparison Between Bias and Variance

Bias

Variance

The measure of how far off the predictions are from the actual value.

The measure of how different the predictions are for only data point.

Bias should be low for a good model.

Variance should be low for a good model.

High bias occurs due to wrong assumptions made during training or underfitting the model by not accounting for all the information present in the data.

High variance occurs when the model is overfitted with the training data and a general rule is not derived. Thus it performs badly with new datasets for predictions.

../images/499122_1_En_1_Chapter/499122_1_En_1_Fig11_HTML.jpg
Figure 1-11

Low variance/low bias and high variance/high bias. Low variance and low bias are desirable from the final shortlisted model.

Error can be represented as follows:

Error = Bias2 + Variance + Irreducible error

  • Vector and Matrix : The datasets we use can be represented in a vector-space diagram. Hence, let’s briefly visit the definition of vector and matrix.

Vector can be defined as follows:
  1. a)

    Vector is an object which has both magnitude and direction.

     
  2. b)

    They are an excellent tool for representing any form of data in mathematical form.

     
  3. c)

    A typical vector will look like this: [1,2,3,4,5].

     
  4. d)

    A vector in mathematical terms is represented as $$ overrightarrow{v} $$ with an arrow at the top.

     
  5. e)

    It can be used for both numerical and non-numerical data. Representation of unstructured data in a mathematical format is achieved through vectorizations and embeddings.

     
Matrix can be defined as follows:
  1. a)

    Matrix is an extension of vectors.

     
  2. b)

    They are a bunch of vectors placed on top of each other; thus, a matrix consists of numbers arranged in rows and columns.

     
  3. c)

    They are a very easy way of representing and holding datasets to perform mathematical operations.

     
  4. d)

    A typical matrix will look like this:

     
$$ mathrm{A}=left(egin{array}{ccc}1& 2& 3\ {}4& 5& 6\ {}7& 8& 9end{array}
ight) $$
  • Correlation and Covariance : Correlation and covariance are very important measures when we try to understand the relationships between variables.
    1. a)

      Covariance and correlation are a measure of dependence between two variables.

       
    2. b)

      For example, for a child, as height increases weight generally increases. In this case height and weight are positively correlated.

       
    3. c)

      There can be negative and zero correlation between data points.

       
    4. d)

      For example an increase in absences to class may decrease grades. If the same trend can be observed over a collection of samples, then these parameters are negatively correlated.

       
    5. e)

      Zero correlation shows no linear dependence but there can be nonlinear dependencies. For example, an increase in the price of rice has zero correlation with an increase/decrease in the price of mobiles.

       
    6. f)

      Correlation is the scaled value of covariance.

       
    7. g)

      The three types of correlation—positive, negative, and no-correlation—are shown in Figure 1-12.

       
../images/499122_1_En_1_Chapter/499122_1_En_1_Fig12_HTML.png
Figure 1-12

Positive, negative, and no correlation in the data

There are still few concepts like measuring the accuracy of the algorithms, R2, adjusted R2, AIC values, concordance ratio, KS value, and so on, which are to be discussed. We will be discussing them in two parts: for regression problems in Chapter 2 and for classification problems in Chapter 3.

Great job! Now you have brushed up the major statistical concepts. ML and statistics go hand in hand so kudos on it.

It is now time for you to dive deep into ML concepts, and we will start with different types of algorithms. There are different types of ML algorithms: supervised learning, unsupervised learning, semi-supervised learning, self-learning, feature learning, and so on. We will examine supervised learning algorithms first.

Supervised Learning Algorithms

Supervised learning is arguably the most common usage of ML. As you know, in ML, statistical algorithms are shown historical data to learn the patterns. This process is called training the algorithm. The historical data or the training data contains both the input and output variables. It contains a set of training examples which are learned by the algorithm.

During the training phase, an algorithm will generate a relationship between output variable and input variables. The goal is to generate a mathematical equation which is capable of predicting the output variable by using input variables. Output variables are also called target variables or dependent variables, while input variables are called independent variables.

To understand this process, let us take an example. Suppose we want to predict the expected price of a house in pounds based on its attributes. House attributes will be its size in sq. m, location, number of bedrooms, number of balconies, distance from the nearest airport, and so on. And for this model we have some historical data at our disposal as shown in Table 1-3. This data is the training data used to train the model.
Table 1-3

Structure of Dataset to Predict the Price of a House

Area (sq. m)

Number of bedrooms

Number of balconies

Dist. from airport (km)

Price (mn £)

100

2

0

20

1.1

200

3

1

60

0.8

300

4

1

25

2.9

400

4

2

5

4.5

500

5

2

60

2.5

If the same data is represented in a vector-space diagram it will look like Figure 1-13. Each row or training examples is an array or a feature vector.
../images/499122_1_En_1_Chapter/499122_1_En_1_Fig13_HTML.png
Figure 1-13

Representation of price and other variables in a vector-space diagram. If we have more than one variable, it can be thought as a multidimensional vector space.

Now we start the training process. Iteratively, we will try to reach a mathematical function and try to optimize it. The goal is always to improve its accuracy in predicting the house prices.

Basically, what we want to achieve is a function “f” for the price:

price = f (size, location, bedrooms, proximity to city center, balconies ….)

The goal of our ML model is to achieve this equation. Here, price is the target variable and rest are the independent variables. In Figure 1-14, price is our target variable or y, the rest of the attributes are independent variables or x, and the red line depicts the ML equation or the mathematical function. It is also called the line of best fit.
../images/499122_1_En_1_Chapter/499122_1_En_1_Fig14_HTML.png
Figure 1-14

ML equation using regression in vector-space diagram. This equation is the line of best fit used to make the predictions for the unseen dataset.

The sole aim of this problem is to arrive at this mathematical equation. With more training data points, better and advanced algorithms, and more rigor we constantly strive to improve the accuracy of this equation. This equation is said to be the best representation of our data points. Or using this equation, we can capture the maximum randomness present in the data.

The preceding example is solved using a supervised learning algorithm called linear regression. A different supervised algorithm like a decision tree will require a different approach for the same problem.

Hence, the definition of supervised learning algorithm will be as follows: “Supervised learning algorithms create a statistical ML model to predict the value of a target variable. The input data contains both the independent and the target variable.”

The aim of the supervised learning algorithm is to reach an optimized function which is capable of predicting the output associated with new, unseen data. This new, unseen data is not a part of the training data. The function we want to optimize is called the objective function.

Supervised learning algorithms are used to solve two kinds of problems: regression and classification. Let’s discuss them now.

Regression vs. Classification Problems

Simply put, regression is used when we want to predict the value of a continuous variable while classification is used when we want to predict the value of a categorical variable. The output of a regression problem will be a continuous value.

Hence, the house prediction example quoted previously is an example of a regression problem, since we want to predict the exact house prices. Other examples are predicting the revenue of a business in the next 30 days, how many customers will make a purchase next quarter, the number of flights landing tomorrow, and how many customers will renew their insurance policies.

On the other hand, let’s suppose we want to predict if a customer will churn or not, whether a credit card transaction is fraudulent or not, whether the price will increase or not: these all are examples of binary classification problems. If we want to classify between more than two classes, it will be a multiclass classification problem. For example, if we want to predict what is going to be the next state of a machine—running, stopped, or paused—we will use a multiclass classification problem. The output of a classification algorithm may be a probability score. So, if we want to decide whether an incoming transaction is fraudulent or not, the classification algorithm can generate a probability score between 0 and 1 for the transaction to be called fraudulent or genuine.

There are quite a few supervised learning algorithms:
  1. 1.

    Linear regression for regression problems

     
  2. 2.

    Logistic regression for classification problems

     
  3. 3.

    Decision tree for both regression and classification problems

     
  4. 4.

    Random forest for both regression and classification problems

     
  5. 5.

    Support vector machine (SVM) for both regression and classification problems

     

There are plenty of other algorithms too, like k-nearest neighbor, naive Bayes, LDA, and so on. Neural networks can also be used for both classification and regression tasks. We will be studying all of these algorithms in detail throughout the book. We will be developing Python solutions too.

Tip

In general business practice, we compare accuracies of four or five algorithms and select the best ML model to implement in production.

We have examined the definition of supervised learning algorithm and some examples. We will now examine the steps in a supervised learning problem. You are advised to make yourself comfortable with these steps, as will be following them again and again during the entire book.

Steps in a Supervised Learning Algorithm

We discussed steps in an ML project earlier. Here, we will examine the steps specifically for the supervised learning algorithms. The principles of data quality, completeness, and robustness are applicable to each of the steps of a supervised problem.

To solve a supervised learning problem, we follow the steps shown in Figure 1-15. To be noted is that it is an iterative process. Many times, we uncover a few insights that prompt us to go back a step. Or we might realize that an attribute which was earlier thought useful is no longer valid. These iterations are part and parcel of ML, and supervised learning is no different.

Step 1: When we have to solve a supervised learning algorithm, we have the target variable and the independent variables with us. The definition of the target variable is central to solving the supervised learning problem. The wrong definition for it can reverse the results.

For example: if we have to create a solution to detect whether an incoming email is spam or not, the target variable in the training data can be “spam category.” If spam category is “1,” the email is spam; if it is “0,” the email is not spam. In such a case, the model’s output will be a probability score for an incoming email to be spam or not. The higher the probability score, higher the chances of the email being spam.

In this step, once we have decided the target variable, we will also determine if it is a regression or a classification problem. If it is a classification problem, we will go one level deeper to identify if it is a binary-classification or multiclass classification problem.

At the end of step 1, we have a target variable defined and a decision on whether it is regression or a classification problem.
../images/499122_1_En_1_Chapter/499122_1_En_1_Fig15_HTML.png
Figure 1-15

The broad steps followed in a supervised learning algorithm from variable definition to model selection

Step 2: In the second step , we identify the training data for our model. In this step, best principles regarding the data quality are to be adhered to. The training data consists of both the independent variables and the target variable. This training data will be sourced from all the potential data sources. It should be also representative enough to capture variations from all the time periods to ensure completeness.

Step 3: In this step, data is prepared for statistical modeling. Often, the data is unclean and contains a lot of anomalies. We find null values, NA, NaN, duplicates, and so on in the data. In the date field we might find string values, names might contain integers, and so on, and we have to clean up all the data. In this phase, the target variable is identified, and the independent variables are also known. The independent variables can be categorical or continuous variables. Similarly, the target variable will be either continuous or categorical.

This step also involves creation of new derived variables like average revenue, maximum duration, distinct number of months, and so on.

Step 4: Exploratory analysis is the next step where initial insights are generated from the data. Distribution of independent variables, relationships with each other, co-relations, scatter plots, trends, and so on are generated. We generate a good understanding of the data. Often, many new variables are created in this step too.

Step 5: Now we perform statistical modeling. From a range of supervised learning algorithms, we start by creating a model. And then subsequent algorithms are also used. The algorithms to be used are generally decided based on the problem statement and experience. The accuracy plots for different methods are generated.

While training the algorithm, these steps are followed:
  1. a)

    The entire data can be split into a 60:20:20 ratio for test, train, and validation data sets. Sometimes, we split into an 80:20 ratio as train and test data.

     
  2. b)

    But if the numbers of raw data (for example images) are quite high (1 million), a few studies suggest having 98% train, 1% test, and 1% validation datasets.

     
  3. c)

    All three datasets though, should always be randomly sampled from the original raw data with no bias in selection. It is imperative since if the testing or validation datasets are not a true representative of the training data, we will not be measuring the efficacy correctly.

     
  4. d)

    However, there can be instances wherein a sampling bias cannot be avoided. For example, if a demand forecasting solution is being modeled, we will use data from the historical time period to train the algorithm. The time dimension will be used while creating training and testing datasets.

     
  5. e)

    The training dataset will be used for training the algorithm. The independent variables act as a guiding factor and the target variable is the one which we try to predict.

     
  6. f)

    The testing dataset will be used to compare the testing accuracy. Testing/validation data is not exposed to the algorithm during the training phase.

     
  7. g)

    We should note that testing accuracy is much more important than training accuracy. Since the algorithm should be able to generalize better on unseen data, the emphasis is on testing accuracy.

     
  8. h)

    There are instances where accuracy may not be the KPI we want to maximize. For example, while we are creating a solution to predict if a credit card transaction is fraudulent or not, accuracy is not the target KPI.

     
  9. i)

    During the process of model development, we iterate through various input parameters to our model, which are called hyperparameters. Hyperparameter tuning is done to achieve the best and most stable solution.

     
  10. j)

    The validation dataset should be exposed to the algorithm only once, after we have finalized the network/algorithm and are done with tuning.

     

Step 6: In this step, we compare and contrast the accuracies we have generated from the various algorithms in step 4. A final solution is the output of this step. It is followed by discussion with the SME and then implementation in the production environment.

These are the broad steps in supervised learning algorithms. These solutions will be developed in great detail in Chapters 2, 3, and 4.

Note

Preprocessing steps like normalization are done for training data only and not for validation datasets, to avoid data leakage.

This brings us to the end of discussion on supervised learning algorithms. It is time to focus on other types of ML algorithms, and next in the queue is unsupervised learning.

Unsupervised Learning Algorithms

We know supervised learning algorithms have a target variable which we want to predict. On the other hand, unsupervised algorithms do not have any prelabeled data and hence they look for undetected patterns in the data. This is the key difference between supervised and unsupervised algorithms.

For example, the marketing team of a retailer may want to improve customer stickiness and customer lifetime value, increase the average revenue of the customer, and improve targeting through marketing campaigns. Hence, if the customers can be clubbed and segmented in similar clusters, the approach can be very effective. This problem can be solved using an unsupervised learning algorithm. Unsupervised analysis can be majorly categorized in cluster analysis and dimensionality reduction techniques like principal components analysis (PCA). Let’s discuss cluster analysis first.

Cluster Analysis

The most famous unsupervised learning application is cluster analysis. Cluster analysis groups the data based on similar patterns and common attributes visible in the data. The common patterns can be the presence or absence of those similar features. The point to be noted is that we do not have any benchmark or labeled data to guide; hence the algorithm is finding the patterns. The example discussed previously is a customer segmentation use case using cluster analysis. The customers of the retailer will have attributes like revenue generated, number of invoices, distinct products bought, online/offline ratio, number of stores visited, last transaction date, and so on. When these customers are visualized in a vector-space diagram, they will look like Figure 1-16(i). After the customers have been clustered based on their similarities, the data will look like Figure 1-16(ii).
../images/499122_1_En_1_Chapter/499122_1_En_1_Fig16_HTML.png
Figure 1-16

(i) Before clustering of data; (ii) after clustering of data

There are quite a few clustering algorithms available—k-means clustering, hierarchical clustering, DBScan, spectral clustering, and so on. The most famous and widely used clustering algorithm is k-means clustering.

PCA

In ML and data science, we always strive to make some sense from randomness, and gather insights from haphazard data sources. Recall from the supervised learning algorithm discussion and Figure 1-9 that we have represented the line of best fit, that is, the mathematical equation which is able to capture the maximum randomness present in the data. This randomness is captured using the various attributes or independent variables. But imagine if you have 80 or 100 or 500 such variables. Won’t it be a tedious task? Here PCA helps us.

Refer to Figure 1-17. The two principal components, PC 1 and PC 2, are orthogonal to each other and capturing the maximum randomness in the data. That is PCA.
../images/499122_1_En_1_Chapter/499122_1_En_1_Fig17_HTML.png
Figure 1-17

Principal components, PC1 and PC2, to capture the maximum randomness in the data. PCA is quite a popular dimensionality reduction technique.

So, in PCA we define the randomness in the data by a principal component which is able to capture the maximum variation. The next principal component is orthogonal to the first one so that it can capture maximum variation, and so on. Hence, PCA serves as a dimensionality reduction solution for us. Instead of requiring all the attributes, we use the principal components and use them instead of all the attributes.

Now, we will examine the semi-supervised type of ML algorithms.

Semi-supervised Learning Algorithms

Semi-supervised algorithms can be called a combination of supervised and unsupervised learning algorithms; or, they fall between the two. When we have a small set of labeled data and a large amount of unlabeled data, semi-supervised algorithms help us in resolving the issue.

The assumption in semi-supervised learning is that the data points which belong to the same cluster or group will tend to have the same label. Hence, once unsupervised algorithms like k-means clustering will share the output clusters, using labeled data we can further improve the quality of the data.

Semi-supervised learning algorithms are used in use cases either where labeled data is not generated or where labeling the target variable is going to be a time-consuming and costly affair. Generative models, graph-based methods, and low-density separation are some of the methods used in semi-supervised learning algorithms.

This marks the end of discussing major types of ML algorithms. There are other families of algorithms like association rule–based market basket analysis, reinforcement learning, and so on. You are advised to explore them too.

In the next section, we will go through the available list of technical tools which help us in data management , data analysis, ML, and visualizations.

Technical Stack

Tools are an integral part of data science and ML. Tools are required for data management, data mining, analysis, and building the actual ML model.

A brief list of the various utilities, languages, and tools follows:
  • Data Engineering: Spark, Hadoop, SQL, Redshift, Kafka, Java, C++

  • Data Analysis: Excel, SQL, postgres, mySQL, NoSQL, R, Python

  • ML: SAS, R, Python, Weka, SPSS, MATLAB

  • Visualizations: Tableau, PowerBI, Qlik, COGNOS

  • Cloud Services: Microsoft Azure, Amazon Web Services, Google Cloud Platform

These tools or rather a combination of them is used for the complete project—starting from data management and data mining to ML and to visualization.

Tip

All tools are good to be used and will generate similar results. The choice is often between open source and licensed or how scalable the solution is envisioned to be.

These tools act as a building block for the project. You are advised to get some level of understanding for each of the components. It will be helpful to appreciate all the facets of a data science project.

While making a choice of an ML tool, to arrive at the best solution suite, we should consider the following parameters too:
  • Ease of Deployment: how easy it is to deploy the model in the production environment

  • Scalability: whether the solution is scalable to other products and environment

  • Maintenance and Model Refresh: ease of maintaining and refreshing the model regularly

  • Speed: speed of making the predictions; sometimes, the requirement is in real-time

  • Cost (license and man hours required): what are the license cost and efforts required

  • Support Available: What type of support is available with us from the team; for example, the MATLAB team extends support since it requires license while Python being open source does not have a support system like MATLAB

You are advised to get a decent level of understanding of at least one or two tools from each of the buckets. SQL and Microsoft Excel are ubiquitous and hence are recommended. Python is a leading tool for ML and AI. With the release of deep learning frameworks like TensorFlow and Keras, Python has generated a huge base of users. And in this book too we are going to use Python only.

We are heading towards the end of the first chapter. We will discuss the reasons for ML being highly used in businesses in the next section.

ML’s Popularity

The beauty of ML algorithms lies in their capability to solve complex problems which are otherwise quite difficult for us to solve. We humans can only visualize a few dimensions simultaneously. Using ML algorithms, not only can multiple dimensions be visualized and analyzed together, trends and anomalies can be detected easily.

Using ML we can work on complex data like images and text that are quite difficult to analyze otherwise. ML and particularly deep learning allows us to create automated solutions.

Here are some factors which are playing a vital role in making ML popular:
  1. 1)

    Interest by the business: Now businesses and stakeholders have renewed interest in harnessing the power of data and implementing ML. Data science departments are set up in the organizations, and there are dedicated teams which are leading the discussions with various processes. We have also witnessed a surge in the number of startups entering the field.

     
  2. 2)

    Computation power: The computation power now available to us is huge as compared to a few decades back. It is also cheaply available. GPU and TPU are making the computations faster and now we have repositories to store terabytes of data. Cloud-based computations are making the process faster. We now have Google Colaboratory to run the codes by using the excellent computations available there.

     
  3. 3)

    Data explosion : The amount of data available to us has increased exponentially. With more social media platforms and interactions and the world going online and virtual, the amount of data generated is across domains. Now, more and more businesses and processes are capturing time-based attributes and creating system dynamics and virtualizations to capture the data points. We now have more structured data points stored in ERP, SAP systems. More videos are uploaded to YouTube, photos are uploaded to FaceBook and Instagram, and text news is flowing across the globe—all refer to zettabytes of data getting generated and ready for analysis.

     
  4. 4)

    Technological advancements: Now we have more sophisticated statistical algorithms at our disposal. Deep learning is pushing the limits further and further. Great emphasis and effort are now put into data engineering, and with emerging technologies we are constantly evolving and improving the efficiency and accuracy of the systems.

     
  5. 5)

    Availability of the human capital: There is an increased interest in mastering data science and ML; hence the number of data analysts and data scientists is increasing.

     

These are some of the factors making ML one of the most sought emerging technologies. And indeed it is delivering fantastic results too across domains and processes. A few of the important ones are listed in the next section, where we discuss the uses of ML.

Use Cases of ML

ML is getting its application across domains and business. We are sharing a few use cases already implemented in the industry. This is not an exhaustive list but only few:
  • Banking, financial services, and insurance sector : The BFSI sector is quite advanced in implementing ML and AI. Throughout the value chain, multiple solutions have been implemented, a few of which follow:
    • Credit card fraud detection model is used to classify credit card transactions as fraudulent or genuine. Rule-based solutions do exist, but ML models strengthen the capability further. Supervised classification algorithms can be used here.

    • Cross-sell and up-sell products: this allows banks and insurance companies to increase the product ownership by the customers. Unsupervised clustering can be used to segment the customers followed by targeted marketing campaigns.

    • Improve customer lifetime value and increase the customer retention with the business. Customer propensity models can be built using supervised classification algorithms.

    • Identify potential insurance premium defaulters using supervised classification algorithms.

  • Retail : Grocery, apparel, shoes, watches, jewelry, electronic retail, and so on are utilizing data science and ML in a number of ways. A few examples follow:
    • Customer segmentation using unsupervised learning algorithms is done to improve the customer engagement. Customer’s revenue and number of transactions can be improved by using targeted and customized marketing campaigns.

    • Demand forecasting is done for better planning using supervised regression methods. Pricing models are being developed to price the good in an intelligent manner too.

    • Inventory optimization is done using ML, leading to improved efficiency and decrement in the overall costs.

    • Customer churn propensity models predict which customers are going to churn in the next few months. Proactive action can be taken to save these customers from churn. Supervised classification algorithms help in creating the model.

    • Supply chain optimization is done using data science and ML to optimize the inventory.

  • Telecommunication : The telecom sector is no different and is very much ahead in using data science and ML. Some use cases are as follows:
    • Customer segmentation to increase the ARPU (average revenue per user) and VLR (visitor location register) using unsupervised learning algorithms

    • Customer retention is improved using customer churn propensity models using supervised classification algorithms

    • Network optimization is done using data science and ML algorithms

    • Product recommendation models are used to recommend the next best prediction and next best offer to customers based on their usage and behavior

  • Manufacturing industry : The manufacturing industry generates a lot of data in each and every process. Using data science and ML, a few use cases follow:
    • Predictive maintenance is done using supervised learning algorithms; this avoids breakdown of the machines and proactive action can be taken

    • Demand forecasting is done for better planning and resource optimization

    • Process optimization is done to identify bottlenecks and reduce overheads

    • Identification of tools and combinations of them to generate the best possible results are predicted using supervised learning algorithms

There are multiple other domains and avenues where ML is getting implemented like aviation, energy, utilities, health care, and so on. AI is opening new capabilities by implementing speech-to-text conversion in call centers, object detection and tracking is done in surveillance, image classification is done to identify defective pieces in manufacturing, facial recognition is used for security systems and crowd management, and so on.

ML is a very powerful tool and should be used judiciously. It can help us automate a lot of processes and enhance our capabilities multifold.

With that, we are coming to the end of the first chapter. Let’s proceed to the summary now!

Summary

Data is changing our approach towards the decision-making process. More and more business decisions are now data-driven. Be it marketing, supply chain, human resources, pricing, product management—there is no business process left untouched. And data science and ML are making the journey easier.

ML and AI are rapidly changing our world. Trends are closely monitored, anomalies are detected, alarms are raised, aberrations from the normal process are witnessed, and preventive actions are being taken. Preventive actions are resulting in cost saving, optimized processes, saving of time and resources, and in some instances saving life too.

It is imperative to assure that data available to use is of good quality. It defines the success factor of the model. Similarly, tools often play a vital role in final success or failure. Organizations are looked up to for their flexibility and openness to using newer tools.

It is also necessary that due precaution is taken while we are designing the database, conceptualizing the business problem, or finalizing the team to deliver the solution. It requires a methodical process and a deep understanding of the business domain. Business process owners and SMEs should be an integral part of the team.

In this first chapter, we have examined various types of ML, data and attributes of data quality, and ML processes. With this, you have gained significant knowledge of the data science domain. We also learned about various types of ML algorithms, their respective use cases and examples, and how they can be used for solving business problems. This chapter serves as a foundation and stepping-stone for the next chapters. In the next chapter, we are going to start focusing on supervised learning algorithms and we will be starting with regression problems. The various types of regression algorithms, the mathematics, pros and cons, and Python implementation will be discussed.

You should be able to answer these questions.

Exercise Questions

Question 1: What is ML and how is it different from software engineering?

Question 2: What are the various data types available and what are the attributes of good-quality data?

Question 3: What are the types of ML algorithms available to be used?

Question 4: What is the difference between Poisson’s, binomial, and normal distributions and what are some examples of each?

Question 5: What are the steps in a supervised learning algorithm?

Question 6: Where is ML being applied?

Question 7: What are the various tools available for data engineering, data analysis, ML, and data visualizations?

Question 8: Why is ML getting so popular and what are its distinct advantages?

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.235.145.95