Understanding supervised and unsupervised ML

In supervised learning, the algorithm is given a set of training examples where the data and target are known. It can then predict the target value for new datasets, containing the same attributes. For supervised algorithms, human intervention and validation are required, for example, in photo classification and tagging.

In unsupervised learning, the algorithm is provided with massive amounts of data, and it must find patterns and relationships between the data. It can then draw inferences from datasets.

In unsupervised learning, human intervention is not required, for example, auto-classification of documents based on context. It addresses the problem, where correct output is not available for training examples, and the algorithm must find patterns in data using clustering.

Reinforcement learning is another category where you don't tell the algorithm what action is correct, but give it a reward or penalty after each action in a sequence, for example, learning how to play soccer.

Following are the popular ML algorithm types used for supervised learning:

Linear regression: Let's use the price of houses as a simple example to explain linear regression. Say we have collected a bunch of data points that represent the prices of houses and the sizes of the houses on the market, and we plot them on a two-dimensional graph. Now we try to find a line that best fits these data points and use it to predict the price of a house of a new size.
Logistic regression: Estimates the probability of the input belonging to one of two classes, positive and negative.
Neural networks: In a neural network, the ML model acts like the human brain where layers of nodes are connected to each other. Each node is one multivariate linear function, with a univariate nonlinear transformation. The neural network can represent any non-linear function and address problems that are generally hard to interpret, such as image recognition. Neural networks are expensive to train but fast to predict.

K-nearest neighbors: It chooses the number of k neighbors. Find the k-nearest neighbors of the new observation that you want to classify and assign the class label by majority vote. For example, you want to categorize your data in five clusters, so your k value will be five.
Support Vector Machines (SVM): Support vectors are a popular approach in research, but not so much in industry. SVMs maximize the margin, the distance between the decision boundary (hyperplane), and the support vectors (the training examples closest to the boundary). SVM is not memory efficient because it stores the support vectors, which grow with the size of the training data.
Decision trees: In a decision tree, nodes are split based on features to have the most significant Information Gain (IG) between the parent node and its split nodes. The decision tree is easy to interpret and flexible; not many feature transformations are required.
Random forests and ensemble methods: Random forest is an ensemble method where multiple models are trained and their results combined, usually via majority vote or averaging. Random forest is a set of decision trees. Each tree learns from a different randomly sampled subset. Randomly selected features are applied to each tree from the original features sets. Random forest increases diversity through the random selection of the training dataset and a subset of features for each tree, and it reduces variance through averaging.

K-means clustering uses unsupervised learning to find data patterns. K-means iteratively separates data into k clusters by minimizing the sum of distances to the center of the closest cluster. It first assigns each instance to the nearest center and then re-computes each center from assigned instances. Users must determine or provide the k number of clusters.

Zeppelin and Jupyter are the most common environments for data engineers doing data discovery, cleansing, enrichment, labeling, and preparation for ML model training. Spark provides the Spark ML library, containing implementations of many common high-level estimator algorithms such as regressions, page rank, k-means, and more.

For algorithms that leverage neural networks, data scientists use frameworks such as TensorFlow and MxNet, or higher-level abstractions such as Keras, Gluon, or PyTorch. Those frameworks and common algorithms can found in the Amazon SageMaker service, which provides a full ML model development, training, and hosting environment.

Data scientists leverage the managed Jupyter environment to do data preparation, set up a model training cluster with a few configuration settings, and start their training job. When complete, they can one-click-deploy the model and begin serving inferences over HTTP. ML model development is performed almost exclusively these days on files in HDFS, and thus querying the Amazon S3 Data Lake directly is the perfect fit for these activities.

Machine learning is a very vast topic and warrants a full book to understand in more detail. In this section, you just learned about an overview of machine learning models.

Table of Contents for Understanding supervised and unsupervised ML

Create new playlist

Sign In

Sign Up

Table of Contents for
Understanding supervised and unsupervised ML