© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2022
P. SinghMachine Learning with PySparkhttps://doi.org/10.1007/978-1-4842-7777-5_3

3. Introduction to Machine Learning

Pramod Singh1  
(1)
Bangalore, Karnataka, India
 

When we are born, we are literally incapable of doing anything. We can’t even hold our head straight for a few months, but eventually, we start learning. During those days, we all fumble, make tons of mistakes, fall down, and bang our head many times but slowly learn our way to sit, walk, speak, and finally run. As a built-in mechanism, we don’t require a lot of examples to learn about things around us. For example, just by observing two to three dogs around, we can easily learn to recognize a dog from a cat. We can easily differentiate between a cycle and a car by seeing a few cars and bikes around. Even though it seems very easy and intuitive to us as human beings, for machines it can be a herculean task.

Machine learning is the mechanism through which we try to make machines learn without explicitly programming them to do so. For example, we expose the machine to a lot of photos of cats and dogs, just enough for the machine to learn the difference between the two and predict on unseen photos correctly. The question here might be that why do we need so many photos to learn the difference between cats and dogs. The challenge that the machines face is that they’re not able to learn the entire pattern or abstraction features just from a few images. They would need enough examples (different in some way) to learn as much features as possible to be able to make right predictions, whereas as humans we have this amazing ability to draw abstraction at different levels and easily recognize objects. This example might be specific to the image recognition case, but for other applications as well, the machine would need a good amount of data to learn from before being able to predict on unseen data with reasonable accuracy.

There is no denying the fact that the world has seen significant progress in terms of machine learning and AI applications in the last decade or so. In fact, if they are to be compared with any other technology, ML and AI have been breaking paths in multiple ways. Businesses such as Amazon, Apple, Google, and Facebook are thriving on these advancements in AI and are partly responsible to them as well. The research and development wings of organizations like these are pushing the limits and making incredible progress in bringing AI to everyone. Not only big names like these but thousands of start-ups have emerged on the landscape specializing in AI-based products and services. The numbers only continue to grow as I write this chapter. As mentioned earlier, the adoption of ML and AI by various businesses has exponentially grown over the last decade due to the following core reasons:
  1. 1.

    Rise in data

     
  2. 2.

    Increased computational efficiency

     
  3. 3.

    Improved ML algorithms

     
  4. 4.

    Availability of data scientists

     

Rise in Data

The first most prominent reason for this trend is the massive rise in data generation in the past couple of decades. Every device is generating data these days, as well as mobile apps, online shopping behavior, and server logs. There is so much data generated that there is a huge demand of people who can process and analyze data.

Increased Computational Efficiency

We have to understand the fact that ML and AI at the end of the day are simply dealing with a huge set of numbers being put together and made sense out of. To apply ML or AI, there is a heavy need of powerful processing systems, and all of us have witnessed significant improvements in computational power at a neckbreaking pace. Just to observe the changes that we have seen in the last decade or so, the size of mobiles has reduced drastically, and speed has increased to a great extent. This is not just in terms of physical changes in the microprocessor chips for faster processing using GPUs and TPUs but also the presence of data processing frameworks such as Spark. The combination of advancement in processing capabilities and in-memory computations using Spark made it possible for a lot of ML algorithms to be able to run successfully in the last decade.

Improved ML Algorithms

Over the last few years, there has been a tremendous progress in terms of availability of new and upgraded algorithms that have not only improved the prediction accuracy but also solved multiple challenges that traditional ML faced. In the first phase, which was rule-based systems, one had to define all the rules first and then design the system within that set of rules. It became increasingly difficult to control and update the number of rules as the environment was too dynamic. Hence, traditional ML came into the picture to replace rule-based systems. The challenge with this approach was that the data scientist had to spend a lot of time to hand-design the features for building the model (known as feature engineering) and there was an upper threshold in terms of prediction accuracy that these models could never go above no matter if the input data size was increased. The third phase was introduction of deep neural networks where a network would figure out the most important features on its own and also outperform other ML algorithms. Apart from these, some other approaches that are creating a lot of buzz over the last few years are
  1. 1.

    Meta-learning

     
  2. 2.

    Transfer leaning (nanonets)

     
  3. 3.

    Capsule networks

     
  4. 4.

    Deep reinforcement learning

     
  5. 5.

    Generative adversarial networks (GANs)

     

Availability of Data Scientists

ML and AI are a specialized field as there are multiple skills required to play this role well. To be able to build and apply ML models, one needs to have a sound knowledge of math and statistics fundamentals along with a good understanding of Machine Learning algorithms and various optimization techniques. The next important skill is to be extremely comfortable at coding to package your code for production grade. There is a huge excitement in the job market with respect to data scientist roles; and there are a huge number of requirements for data scientists everywhere especially in regions like the USA, the UK, and India.

As mentioned previously, Machine Learning has got a lot of attention in the last few years. More and more businesses want to adopt it to maintain the competitive edge. However, very few really have the right resources and the appropriate data to implement it. In this chapter, we will cover basic types of Machine Learning and how businesses can benefit from using Machine Learning. Those who are already aware of basic concepts and applications of Machine Learning can feel free to jump to the next chapter directly.

There are tons of definitions of Machine Learning on the Internet. However, if I tried to put in simple terms, it would look something like this:

Machine Learning is using statistical techniques and sometimes advanced algorithms to either make predictions or learn hidden patterns within the data and essentially replacing rule-based systems to make data-driven systems more powerful.

Let’s go through this definition in some more detail. Machine learning as the name suggests makes a machine learn, although there are many components that come into the picture when we talk about making a machine learn.

The first one is data that is the backbone for any sort of model. Machine learning thrives on relevant data. The more signals present in the data, the better the predictions. Machine Learning can be applied in different domains such as finance, retail, healthcare, and manufacturing. The other part is the underlying algorithm. Based on the nature of the problem, we choose the algorithm accordingly. Some algorithms do a better job compared with others in a particular context. The last part is the hardware and software aspect. The availability of open source distributed computing frameworks like Spark and TensorFlow has made Machine Learning tools easily accessible to more people. The rule-based systems came into the picture when the scenarios were limited and all the rules could be configured manually to handle the situations. For example, the manner in which a fraud can happen has dramatically changed over the past few years, and hence, creating manual rules for preventing such incident can be very difficult, whereas Machine Learning can be leveraged in such a scenario where the model learns from the data and adapts itself to the new data to make correct decisions accordingly.

Let’s look at the different types of machine learning and their applications. We can categorize machine learning into four major categories as shown in Figure 3-1:
  1. 1.

    Supervised Machine Learning

     
  2. 2.

    Unsupervised Machine Learning

     
  3. 3.

    Semi-supervised Machine Learning

     
  4. 4.

    Reinforcement Learning

     
Figure 3-1

Machine Learning categories

Each of the preceding categories is used for specific purposes, and the data that is used also differs from each other. At the end of the day, machine learning is learning from data (historical or real time) and making decisions (offline or real time) based on the model training.

Supervised Machine Learning

This is the prime category of machine learning, which drives a lot of applications and value for businesses. In supervised learning, we train our models on labeled data. By labeled, it means to have the correct answers or outcome for the data. Let’s take an example to illustrate supervised learning. If there is a financial company that wants to filter customers based on their profiles before accepting their loan request, the machine learning model would get trained on historical data that contains information regarding profiles of past customers and the label column on whether a customer has defaulted on loan or not. The sample data looks as given in Table 3-1.
Table 3-1.

Sample Customer base

Customer ID

Age

Gender

Salary

No. of Loans

Job Type

Loan Default

AL23

32

M

80K

3

Contract

Y

AX43

45

F

150K

1

Permanent

N

BG76

51

M

110K

2

Permanent

N

As I mentioned in my earlier version of this book, the supervised ML models learn from the training data that has also got a label/outcome/target column and use this to make predictions on unseen data. In the preceding example, columns such as Age, Gender, and Salary are known as attributes or features, whereas the last column (Loan Default) is known as the target or label, which the model tries to predict for unseen data. One complete record with all these values is known as an observation. The model would require a sufficient amount of observations to get trained and then make predictions on a similar kind of data. There needs to be at least one input feature/attribute for the model to get trained along with the output column in supervised learning. The reason the machine is able to learn from the training data is that some of these input features individually or in combination have an impact on the output column (Loan Default).

There are many applications that use supervised learning settings such as
  1. 1.

    If a particular customer would buy the product or not

     
  2. 2.

    If the visitor would click the ad or not

     
  3. 3.

    If the person would default on loan or not

     
  4. 4.

    What is the expected sale price of a given property?

     
These are some of the applications of supervised learning, and there are many more. The methodology that is used sometimes varies based on the kind of output the model is trying to predict. If the target feature is a categorical type, then its falls under the classification category; and if the target feature is a numerical value, it would fall under the regression category. Some of the supervised ML algorithms are
  1. 1.

    Linear regression (LR)

     
  2. 2.

    Logistic regression

     
  3. 3.

    Support vector machines

     
  4. 4.

    Naive Bayesian classifier

     
  5. 5.

    Decision trees

     
  6. 6.

    Ensemble methods

     

Another property of supervised learning is that the model’s performance can be evaluated. Based on the type of model (classification/regression/time-series), the evaluation metric can be applied, and performance can be measured. This happens mainly by splitting the training data into two sets (train set and validation set) and training the model on the train set and testing its performance on the validation set since we already know the right label/outcome for the validation set. We can then make the changes in the hyperparameters (covered in later chapters) or introduce new features using feature engineering to improve the performance of the model.

Unsupervised Machine Learning

This is another category of machine learning that is used heavily in business applications. It is different from supervised learning in terms of the output labels. In unsupervised learning, we build the models on a similar sort of data as that of supervised learning except for the fact that this dataset does not contain any label or outcome column. Essentially, we apply the model on data without any right answers. In unsupervised learning, the machine tries to find hidden patterns and useful signals in the data, which can be later used for other applications. The main objective is to probe the data and come up with hidden patterns and similarity structure within the dataset as shown in Figure 3-2. One of the use cases is to find patterns within customer data and group the customers into different clusters. It can also identify those attributes that distinguish between any two groups. From a validation perspective, there is no measure of accuracy for unsupervised learning. The clustering done by person A can be totally different from that of person B based on parameters used to build the model. There are different types of unsupervised learning:
  1. 1.

    K-means clustering

     
  2. 2.

    Mapping of nearest neighbor

     
Figure 3-2

Clustering

Semi-supervised Learning

As the name suggests, semi-supervised learning lies somewhere in between both supervised and unsupervised learning. In fact, it uses both of the techniques. This type of learning is mainly relevant in scenarios where we are dealing with a mixed sort of dataset, which contains both labeled and unlabeled data. Sometimes it’s just unlabeled data completely, but we label some part of it manually. The whole idea of semi-supervised learning is to use this small portion of labeled data to train the model and then use it for labeling the other remaining part of data, which can then be used for other purposes. This is also known as pseudo-labeling as it labels the unlabeled data using the predictions made by the supervised model. To quote a simple example, let’s say we have a lot of images of different brands from social media, and most of them are unlabeled. Now using semi-supervised learning, we can label some of these images manually and then train our model on the labeled images. We then use the model predictions to label the remaining images to transform the unlabeled data to labeled data completely.

The next step in semi-supervised learning is to retrain the model on the entire labeled dataset. The advantage that it offers is that the model gets trained on a bigger dataset, which was not the case earlier, and the model is now more robust and better at predictions. The other advantage is that semi-supervised learning saves a lot of effort and time, which could go into manually labeling the data. The flipside of doing all this is that it’s difficult to get high performance of the pseudo-labeling as it uses a small part of labeled data to make the predictions. However, it’s still a better option rather than manually labeling the data, which can be very expensive and time consuming at the same time. This is how semi-supervised learning uses both supervised and unsupervised learning to generate the labeled data. Businesses that face challenges regarding costs associated with the labeled training process usually go for semi-supervised learning.

Reinforcement Learning

This is the fourth kind of learning and is a little different in terms of data usage and its predictions. Reinforcement learning is a big research area in itself, and an entire book can be written just on it. The main difference between the other kinds of learning and reinforcement learning is that in the other kinds of learning, we need data, mainly historical data, to train the models, whereas reinforcement learning works in a reward system as shown in Figure 3-3. It is primarily decision-making based on certain actions that the agent takes to change its state in order to maximize the rewards. Let’s break this down to individual elements using a visualization.
Figure 3-3

Reinforcement learning

  1. 1.

    Autonomous Agent: This is the main character in this whole learning who is responsible to take action. If it is a game, the agent makes the moves to finish or reach the end goal.

     
  2. 2.

    Actions: These are a set of possible steps the agent can take in order to move forward in the task. Each action will have some effect on the state of the agent and can result in either reward or penalty. For example, in a game of tennis, actions might be to serve, return, move left or right, etc.

     
  3. 3.

    Reward: This is the key to make the progress in reinforcement learning. The agent takes actions that can result in either reward or penalty. It is an instant feedback mechanism, which differentiates it from traditional supervised and unsupervised learning techniques.

     
  4. 4.

    Environment: This is the territory in which the agent gets to play in. The environment decides whether the actions the agent takes result in reward or penalty.

     
  5. 5.

    State: The position the agent is in at any given point in time defines the state of the agent. To move forward or reach the end goal, the agent has to keep changing states in the positive direction to maximize the rewards.

     

The unique thing about reinforcement learning is that there is an immediate feedback mechanism that drives the next behavior of the agent based on the reward system. Most of the applications that use reinforcement learning are in navigation, robotics, and gaming. However, it can be also used to build recommender systems. Now that we have a basic understanding of Machine Learning, let’s look at different applications of ML in various domains.

Industrial Application and Challenges

In the final section of this chapter, we would go though some of the real applications of ML and AI. Businesses are heavily investing in ML and AL across the globe and establishing standard procedures to leverage capabilities of ML and AI to build their competitive edge. There are multiple areas where ML and AI are being currently applied and providing great values for the businesses. We will look at a few of the major domains where ML and AI are transforming the landscape.

Retail

One of the business verticals that is making incredible use of ML and AI is retail. Since retail business generates a lot of customer data, it offers a perfect platform for applying ML and AI. The retail sector has always faced multiple challenges such as a stock-out situation, suboptimal pricing, limited cross-sell or upsell, and inadequate personalization. ML and AI have been able to address many of these challenges and offered incredible impact in the retail space. There have been numerous applications built in the retail space that are powered by ML and AI in the last decade, and it continues to grow at a neckbreaking pace. The most prominent application is the recommender system. Online retail businesses are thriving on recommender systems as these systems can increase their revenue by a great deal. Apart from recommender systems, retail uses ML and AI capabilities for stock optimization to control inventory levels and reduce costs. Dynamic pricing is another area where AI and ML are being used comprehensively to get maximum returns. Customer segmentation is also done using ML as it uses not only demographics information of customers but also transactional data and takes multiple other variables into consideration before revealing the different groups within the customer base. Product categorization is also being done using ML as it saves huge manual effort and increases the accuracy level of labeling the products. Demand forecasting and stock optimization are tackled using ML and AI in order to save costs. Route planning has also been handled by ML and AI in the last few years as it enables businesses to fulfill orders in a more effective way. As a result of ML and AI applications in retail, cost savings have improved, businesses are able to take informed decisions, and overall customer satisfaction has gone up.

Healthcare

Another business vertical to be deeply impacted by ML and AI is healthcare. Diagnosis based on image data using ML and AI is being adopted at a quick rate across the healthcare spectrum. The prime reason is the level of accuracy offered by ML and AI and the ability to learn from data of the past decades. ML and AI algorithms on X-rays, MRI scans, and various other images in the healthcare domain are being heavily used to detect any anomalies. A virtual assistant or chatbot is also being deployed as a part of applications to assist with explaining lab reports. Finally, insurance verification is also being done using ML models in healthcare to avoid any inconsistency.

Finance

The finance domain has always had data, lots of it. Out of all the domains, finance always has been data enriched. Hence, there have been multiple applications built over the last decade based on ML and AI. The most prominent one is the fraud detection system that uses anomaly detection algorithms in the background. Other areas are portfolio management and algorithmic trading. ML and AI have the ability to scan over 100 years of past data and learn the hidden patterns in order to suggest the best calibration to a portfolio. Complex AI systems are being used to make extremely fast decisions on trading to maximize the gains. ML and AI are also used in risk mitigation and loan insurance underwriting. Again, recommender systems are used to upsell and cross-sell various financial products by various institutions. They also use them for predicting the churn of the customer base in order to formulate a strategy to retain the customers who are likely to discontinue with a specific product or service. Another important usage of ML and AI in the finance sector is to check if the loan should be granted or not to various applicants based on predictions made by the model. Apart from that, ML is being used to validate if the insurance claims are genuine or fraud based on the ML model predictions.

Travel and Hospitality

Just as retail, the travel and hospitality domain is thriving on ML- and AI-based applications. To name a few, recommender systems, price forecasting, and virtual assistants are all ML- and AI-based applications that are being leveraged in travel and hospitality verticals. Starting from recommending best deals to alternative travel dates, recommender systems are supercritical to drive customer behavior in this sector. It also recommends new travel destinations based on user preferences, which are highly tailored using ML in the background. AI is also being used to send timely alerts to customers by predicting future price movements based on various factors. Virtual assistants nowadays are part of every travel website as customers don’t want to wait to get relevant information. On top of it, interactions with these virtual assistants are very humanlike as natural language understanding intelligence is already embedded into these chatbots to a great extent so as to understand simple questions and reply in a similar manner.

Media and Marketing

Every business more or less depends on marketing in order to get more customers, and reaching out to the right customer has always been a big challenge. Thanks to ML and AI, that problem is now better handled as the technology can anticipate customer behavior to a great extent. ML- and AI-based applications are being used to differentiate between potential prospects who are more likely to buy or subscribe to the offer or product and casual candidates. They’re also being used to provide an absolute personalized offer in order to convert or retain customers. A churn predictor is again used heavily to identify the group of consumers who are likely to discontinue the usage of any particular product or service. Advanced customer segmentation for hypertargeting is being done using ML and AI. Finally, a lot of marketing content is being generated artificially using ML and AI these days in order to send out the best-performing content.

Manufacturing and Automobile

The manufacturing domain has not been able to escape the wave of ML and AI as well. The most predominant one is predictive maintenance as ML- and AI-based applications can help in preventing potential damages by predicting the need of maintenance in advance based on earlier data. Automobile companies use telematics data in order to learn the driving patterns of the customers and act more promptly to help them in many ways. They also use web data to understand their customers better to try and personalize the experience for seamless navigation during the online journey.

Social Media

Most of the people out there, the young generation in particular, spend a great deal of time on social media without realizing the fact that a lot of the applications are using ML and AI. Facebook, YouTube, LinkedIn, Twitter, and other similar apps use ML heavily in providing the experience. Right from photo auto-tag suggestion to recommendation of friends, everything is driven by ML and AI. They are also used to generate subtitles and language translations for various platforms such as YouTube. Various search engines and voice assistants are using good amount of ML implementation in them.

Others

There are many other applications where ML and AI are used. For example, email spam filtering nowadays uses ML instead of a rule-based system. One advantage the ML approach offers over the traditional rule-based system is that the former automatically updates and upgrades itself as per the new mails to make this distinction. Another area is the oil and gas industry where ML and AI help in analyzing underground minerals and finding alternative energy sources. They are also being used in transportation as they can predict likely traffic conditions and alert you in advance.

Conclusion

In this chapter, we went over the fundamentals of Machine Learning and also covered the different applications of Machine Learning along with its existing challenges.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.189.170.206