Classification

Classification determines or categorizes a model based on its attributes, and is the process of identifying the genre to which a new observation belongs to, as per the membership category, which is known in advance. It is a technique of determining which class a dependent variable belongs to based on one or more independent variables. The output variable in the classification problem is either a group or a category.

Some examples include, credit scoring (differentiating between high risk and low risk based on earning and saving), medical diagnosis (predicting the risk of disease), web advertising (predicting whether a user will click on advertisements or not), and more.

The ability of the classification model can be determined by using model evaluation procedures and model evaluation metrics.

Model evaluation procedures

Model evaluation procedures help you find out how well a model will adapt to the sample data:

Training and testing the data: The training data is used to train the model so that it fits the parameter. The testing data is a masked dataset for which a prediction has to be made.
Train and test split: Typically, when the data is separated, most of the data is used for training, whereas a small portion of the data is used for testing.
K-fold cross-validation: K-train and test splits are created and averaged together. The process runs k-times slower than train and test splits.

Model evaluation metrics

Model evaluation metrics are employed to quantify the performance of the model. The following metrics can be implemented in order to measure the ability of a classification predictive model.

Evaluation metrics are managed with the help of the following:

Confusion matrix: This is a 2 x 2 matrix, also known as an error matrix. It helps picture the performance of an algorithm – typically a supervised learning one—with the help of classification accuracy, classification error, sensitivity, precision measures, and predictions. The choice of metrics depends on the business objective. Hence, it is necessary to identify whether false positives or false negatives can be reduced based on the requirements.

Logistic regression: Logistic regression is a statistical model that aids in analyzing the dataset. It has several independent variables that are responsible for determining the output. The output is measured with diploid variables (involving two possible outcomes). The aim of logistic regression is to find the best-fitting model to describe the relationship between diploid variables (dependent variables) and a set of independent variables (predictors). Hence, it is also known as a predictive learning model.
Naives Bayes: This works on the concept of conditional probability, as given by Bayes theorem. Bayes theorem calculates the conditional probability of an event based on the prior knowledge that might be in relation to the event. This approach is widely used in face recognition, medical diagnosis, news classification, and more. The Naives Bayes classifier is based on Bayes theorem, where the conditional probability of A given B can be calculated as follows:

P(A | B) = ( P(B | A) * P( A ))/ P( B  )
Given:
P(A | B) = Conditional probability of A given B
P(B | A) = Conditional probability of B given A
P( A )= Probability of occurrence of event A
P( B  )= Probability of occurrence of event B

Decision tree: A decision tree is a type of supervised learning model where the final outcome can be viewed in the form of a tree. The decision tree includes leaf nodes, decision nodes, and the root node. The decision node has two or more branches, whereas the leaf node represents the classification or decision. The decision tree breaks down the dataset further into smaller subsets, thus incrementally developing the associated tree. It is simple to understand and can easily handle categorical and numerical datasets.
Random forest algorithm: This algorithm is a supervised ML algorithm that is easy to use and provides great results, even without hyperparameter tuning. Due to its simplicity, it can be used for both regression and classification tasks. It can handle larger sets of data in order to maintain missing values. This algorithm is also considered the best at performing classification-related tasks compared to regression.

Neural network: Although we already have linear and classification algorithms, a neural network is the state of art technique for many ML problems. A neural network is comprised of units, namely neurons, which are arranged into layers. They are responsible for the conversion of an input vector into some output. Each unit takes an input, applies a function, and passes the output to the next layer. Usually, nonlinear functions are applied to this algorithm.
Support Vector Machine (SVM) algorithm: The SVM learning algorithm is a supervised ML model. It is used for both classification and regression analysis, and is widely known as a constrained optimization problem. SVM can be made more powerful using the kernel trick (linear, radial basis function, polynomial, and sigmoid). However, the limitations of the SVM approach lies in the selection of the kernel.

Table of Contents for Classification

Create new playlist

Sign In

Sign Up

Table of Contents for
Classification