Srinivasa Manikant Upadhyayulaand Kannan Venkataramanan
CRISIL Global Research & Analytics, CRISIL (A S&P Company), CRISIL House, Central Avenue, Hiranandani Business Park, Powai, Mumbai, 400 076, India
Deep learning allows building quantitative models that constitute a series of processing layers to learn representations of data coupled with multiple levels of abstraction [1]. Neural networks were developed to understand the basic functioning of human brain and the entire central nervous system. Later, the model designed to capture the working of the human nervous system is applied to financial service domains such as link analysis of payments, fraud detection in customer transactions, and anomalies in transactions for potential money laundering.
A neural network works in the same pattern as that of a neuron in the human nervous system. The fundamental epitome of this learning technique is that it consists of a large number of highly organized and connected neurons working in harmony to solve a specific problem including pattern recognition or data classification. Neural networks are not a recent phenomenon but started before the advent of modern computers. It began with the work of McCulloch and Pitts [2] who created a theoretical representation of neural networks using a combination of human nervous system and mathematics (application of calculus and linear algebra). McCulloch–Pitts networks (or referred as MP networks) represent a finite state automaton embodying the logic of propositions, with quantifiers, in the form of computer programs [3].
With the advent of parallel distributed processing in mid‐1980s, Rumelhart, McClelland, and coworker [4] applied the concepts of parallel distributed processing to the neural networks. Their work signaled the dawn of applying advanced techniques in neural networks in the domain of medical research. Qian and Sejnowski [5] presented a novel method predicting the secondary structure of globular proteins based on nonlinear neural network models. The average accuracy of the developed model on a testing set of proteins nonhomologous with the corresponding training set was 64.3%. Kneller et al. [6] have applied neural networks to predict the mapping between protein sequence and secondary structure. By adding neural network units that detect periodicities in the input sequence and use of tertiary structural class, the accuracy for predicting the class of all‐α proteins is at 79%. Rost and Sander [7] applied evolutionary information contained in multiple sequence alignments as inputs to neural networks and predicted the secondary structure with significant accuracy. The model developed has demonstrated an overall accuracy of 71.6% in a multiple cross‐validation test on 126 unique protein chains.
In the recent years, the applications of neural networks have increased exponentially across domains. Courbariaux et al. [8] introduced a method to train binarized neural networks (BNNs) – neural networks with binary weights and activations at run‐time. BNNs drastically reduce memory size and accesses and replace most arithmetic operations with bit‐wise operations, which is expected to substantially improve power efficiency. Silver et al. [9] have developed a new approach to computer Go in which deep neural networks are trained by a novel combination of supervised learning from human expert games and reinforcement learning from games of self‐play. The neural networks play Go at the level of state‐of‐the‐art Monte Carlo tree search programs that simulate thousands of random games of self‐play. Their program Alpha Go achieved a 99.8% winning rate against other Go programs. Esteva et al. [10] have applied deep convolutional neural networks (CNNs) to classify skin lesions, trained end‐to‐end from images directly, using only pixels and disease labels as inputs using a dataset of 129 450 clinical images. The CNN achieves performance on par with all tested experts across both tasks, demonstrating an artificial intelligence capable of classifying skin cancer with a level of competence comparable to dermatologists.
Artificial neural network (ANN) is a computational model that is based on the structure and functioning of human neural networks, and the models are built using an interconnected network of neurons with multiple hidden layer(s) for processing data from input variables.
ANN consists of input variables, hidden layers, and an output layer. The input layer represents the set of neurons that represent the independent features (or input variables). These data points are passed to the hidden layer. In the hidden layer with “n” neurons, each neuron is assigned a weight and the inputs are multiplied with the designated weights, thereby transforming the input parameters and segregating the data to obtain the desired output. The combination of inputs and weights are then passed to the activation function, which determines the processing of different neurons in the hidden layer and passes the results to the output layer. The output layer considers the output results and displays the final result (Figure 4.1).
A single neuron in the hidden layer obtains information from the set of independent variables from the input data (numbered x1 to xn). Each input variable will be assigned a weight wi (where i represents the value from 1 to n).
Mathematically, a single neuron is represented as
In simple terms, it can be written as
where xi represents the inputs from independent variables,
wi represents the weights associated with these independent variables, and
ε0 represents the error or bias associated with the neuron in the hidden layer (Figure 4.2).
For a single hidden layer ANN with “N” neurons, the neurons in the hidden layer are represented as
The entire single hidden layer with “N” neurons with activation function can be represented as
where xi represents the inputs from independent variables,
wij represents the weights associated with these independent variables, and
εj represents the error or bias associated with each neuron in the hidden layer.
The output y will be represented as an activation function of Z, and it is dependent on the activation function of the neural network.
Activation function is defined as the computational logic of the neural networks that takes into account both the input variables and their correspondent weights to determine the impact of the variables on the desired output and segregate the relevant input information necessary to process a particular neuron. Also known as transfer function or cost function, activation function influences the output based on a given set of inputs.
In terms of mathematical representation, an activation function can be either a linear or a nonlinear function. For a linear activation function, the output is linear in nature and lies within a range of {−∞, ∞}. In simple terms, it can be represented as
The identity activation function is the simplest and most commonly linear activation function used in regression problems. The identity function is monotonic in nature, and the derivative of an identity function g ′ (z) is 1.
However, this kind of activation function does not support when the input variables have complexity in the data or the input data of a variable follow nonlinear patterns. In such cases, a nonlinear activation function is used. It is extremely useful when the data follow a parabolic or exponential curve, and this function makes it easier for the model to adapt to the variety of data points for an input variable.
The most common applications of nonlinear functions are used when the slope follows a differential or derivative function. Also, if the input variable conforms to monotonic functions, then a nonlinear activation function would help in determining the output. Some of the commonly used nonlinear activation functions are – sigmoid function, tanh function, and Rectified Linear Unit (ReLU) function.
A sigmoid function has a characteristic “S”‐shaped curve. Also, it is considered to be a special case of logistic function. Mathematically, it is represented as
Graphically, it is presented in Figure 4.3.
For sigmoid activation function, the output is nonlinear in nature and lies within a range of (0,1). Because most of the problems that involve neural networks are classification problems (especially binary classification for areas such as predicting default of a loan or identifying a transaction to be fraud or not), this function is most appropriate for predicting the probability of the output. Because this function is differentiable and monotonic, the derivative of a sigmoid function g ′ (z) is
Also, the derivative of this function conforms to a bell‐shaped curve or a simple normal distribution, and it is the most appropriate form for calculating the gradients used in the neural networks. The gradients for the layer can be estimated using arithmetic operations such as simple subtraction and multiplication. Also, it is used as an activation function in neural networks to bring nonlinearity into the model. For example, Gershenfeld et al. [11] has used the logistic sigmoid function as an activation function to keep the response of the neural network bounded. The function used is represented as
A sigmoid function‐based activation function has its set of limitations during training of the neural networks. For such scenarios where highly negative inputs are fed to the logistic sigmoid function, then the output value is almost zero. This affects the calculations of gradient parameters for the feedforward neural networks when there are large numbers of neurons in the hidden layer and the activation function gets stuck during training.
In such cases, an alternative to the sigmoid activation function is hyperbolic tangent function (tanh). The points in the hyperbolic functions (cosh θ and sinh θ) form a semiequilateral hyperbola. The hyperbolic functions can be defined as the two sides of a right‐angled triangle covering the hyperbolic sector. Mathematically, a hyperbolic tangent is represented as
Graphically, the function is presented in Figure 4.4.
For tanh activation function, the output is similar to sigmoid function (“S” curve) and lies within a range of (−1,1). Like the sigmoid function, this function is differentiable and monotonic, and the derivative of a hyperbolic tangent function g ′ (z) is
Similar to the sigmoid function, the tanh activation function is applied in feedforward neural networks for classification and prediction problems.
The ReLU function is the most used activation function in all forms of neural networks including CNNs [12]. Also known as ramp function, the ReLU function is rectified for all the negative values of the input while it conforms to linear or identity function for all the positive values of the input.
Deep networks with ReLU as an activation function are easy to optimize and train in comparison with the network models with sigmoid‐based or tanh‐based activation functions because the gradient parameters are able to flow easily when there are multiple hidden layers with a large number of neurons. This has made ReLU a popular activation function and has great applications in speech recognition.
Mathematically, the rectifier in the activation function is defined as
where x is the input to the neuron.
In simple terms, this function is represented as
Graphically, the function is presented in Figure 4.5.
This function does not have negative outputs for negative input values. Unlike the above two functions, Sigmoid and tanh, both the ReLU function and its derivative are monotonic in nature and the output lies in the range of [0, ∞) for all values of input. The derivative of the ReLU function is
However, this function does not include any negative inputs because of which only the positive inputs will be available for training the model. Hence, all negative inputs will not be part of the model, and this drastically affects the predictability and accuracy of the model.
In this variant of ReLU, the function is represented as a logistic function
The derivative of this variant of ReLU function is
In this variant of ReLU, the function is represented as a ramp function with Gaussian noise α
where α ∼ η (0, σ(x))
In simple terms, this function is represented as
The derivative of this variant of ReLU function is
This variant is used in restricted Boltzmann machines for computer vision activities.
In this variant of ReLU, the function is represented as a ramp function with a small, positive gradient for negative inputs
The derivative of this variant of ReLU function is
In this variant of ReLU, the function is represented as a ramp function with the coefficient of leakage into a parameter for learning for negative inputs
In cases where a ≤ 1, then the function is represented as
The derivative of this variant of ReLU function is
ANN is a supervised learning technique, i.e. the algorithm is trained on labeled data to identify different patterns in the input data that contribute to a given input. For obtaining a specific output, the weights assigned to inputs in each neuron are adjusted accordingly by the model. The higher the weight assigned to a particular input, the more impact the input variable has on the neuron, and this impact will continue to affect the neurons in subsequent layers as well. To show inhibition, negative weights are sometimes assigned to the input variables. This entire process of adjusting the weights of the neuron and obtaining the right set of values to obtain the desired output results is known as training the neural network.
There are multiple algorithms to train a neural network model. The most commonly used algorithm is backpropagation algorithm. This algorithm calculates the error in estimation and correspondingly determines the weights of each layer to obtain the desired output.
Werbos's [13] backpropagation algorithm provided a breakthrough in the field of neural networks paving way for ANNs. Johansson et al. [14] developed a backpropagation learning for multilayer feedforward neural networks using the conjugate gradient method for improving and optimizing the learning rates. Chen and Jain [15] derived a robust backpropagation learning algorithm that is resistant to the noise effects and is capable of rejecting gross errors during the approximation process.
Yu et al. [16] proposed a general backpropagation algorithm for feedforward neural network learning with time‐varying inputs. In this approach, the Lyapunov function is used to analyze the convergence of weights, with the use of the algorithm for minimization of the error function. Khashman [17] proposed a modified backpropagation learning algorithm, with additional emotional weights for the two additional emotional parameters: anxiety and confidence. The proposed neural network was implemented to a facial recognition problem, and the results showed an improved performance with higher recognition rates and faster recognition time in comparison to the results of conventional neural network.
Sapna et al. [18] proposed a novel way of building backpropagation algorithm based on Levenberg–Marquardt algorithm to obtain an intellectual and efficient diabetic prediction method for assisting medical practitioners, special educators, occupational therapists, and psychologists in better assessment of diabetes.
Backpropagation algorithm follows a gradient descent approach and implements the chain rule (used in calculus) for calculating the derivative of two or more functions. It is similar to the Gauss–Newton algorithm (Figure 4.6).
For training the neural network, the inputs from input layer (represented by xi) are multiplied with weights (represented by wij, where i represents the feature from the input layer and j represents the neuron from the hidden layer). Each neuron is represented by a mathematical function
Similarly, if there are “n” neurons in the hidden layer, then the neurons are represented as
On applying the activation function on the neurons, the result at each neuron is represented as
Now, each neuron has its influence on the output (represented by Zj) and each will be assigned a weightage (represented by Wjk, where j represents the neuron from the hidden layer and k represents the values from the output layer). Hence, the first output from “N” neurons represented by Z1 will be influenced by a combination of results from neuron multiplied by the assigned weightage with the error term E.
The output y will be represented as an activation function of Z
For “k” values of output obtained, the activation function is represented as
As part of training the neural network and optimizing the model for all the input parameters used for the model, the predicted output obtained from the model is compared with the actual output. Based on the difference, the model is optimized to ensure that the variance between the predicted output and actual output is minimum. For the model, the standard error function is sum of squared difference between the predicted output of the model and the actual output. Mathematically, it is represented as
The backpropagation algorithm effectively solved the pattern identification and data classification problems through multiple iterations of training the algorithm and assigning the calculated weights at each node. The constraints for model training and optimization depend on the minimization of error function E and the optimal use of different activation functions.
Training and optimizing are the two important aspects in the process of building a neural network model for solving classification and prediction problems. In order to achieve the best‐fit and most optimal model, there are multiple parameters that are to be tuned and optimized. Some of the most important parameters are number of hidden layers, activation function used in the hidden layer, learning rate for gradient descent (for backpropagation algorithm), momentum for gradient descent, number of epochs, and output function. Applying the right combination of optimized parameters to obtain a best‐fit model is a tricky process and requires multiple repetition of model training through modification of each parameter for achieving the least error and maximum accuracy.
As part of building an ANN model, the developer (or modeler) has to assess the model's variable selection criteria and transformation process to ensure a strong relationship between the transformed predictor variables and dependent variable. As part of model's variable selection criteria, the developer needs to
In order to evaluate the predictive power of all the selected input variables, the following two statistical tests need to be performed:
Statistical test | Test description | Points of consideration |
Weight of evidence (WoE) | Measures the strength of a set of categories across different values of the predictor variable to separate “good” and “bad” outcomes | High negative or positive values are an indication of strong variable predictive power |
Information value (IV) | Assesses the overall power of a variable in separating “good” and “bad” outcomes by summing the product of WoE and the difference of “good” and “bad” across all categories within the variable | Higher IV levels indicate stronger relationship between the variable and the good/bad odds ratio. Can be used to compare predictive power among competing variables |
In order to test and review the model building parameter settings and their impact on the model performance, the following parameters need to be considered:
These parameter settings are essential to regulate the complexity of the ANN model as they ensure optimization of the computational power, the number of variables involved in model training, and the weights to each variable for obtaining the best‐fit model.
The following tests can be used to formally assess the performance and accuracy of an ANN model.
Statistical test | Test description | Points of consideration |
Traditional performance metrics (from confusion matrix) – accuracy, precision, F‐measure, sensitivity, and specificity | The confusion matrix compares the predictions of the model with respect to actual values | The higher the key performance metrics, the better the model performance |
Receiver operating characteristic (ROC) | The ROC curve displays the trade‐off between sensitivity and specificity The area under the ROC curve is a measure of discriminatory power |
The closer the curve is to the left border and then the top border, the more accurate the model The closer the curve is to diagonal, the less accurate is the model |
Somers' D | Calculated from the difference between the number of concordant and discordant pairs | Value close to zero indicates random model while value close to 1 indicates higher discriminatory power |
Kolmogorov–Smirnov (KS) | The KS test is used to test for differences between distribution functions | Higher values of KS statistic correspond to higher level of discriminatory power |
Error attribution analysis | Discrepancies between predicted and actual values for the models show the magnitude of error | Lower is the difference between the predicted and actual values, lower is the error in the model |
We have developed an ANN model based on synthetic data that closely represents the customer credit transaction data to predict the potential defaulters in the credit card payment.
The current dataset is synthetic data prepared for the purpose of prototype development. Despite a very high performance in the validation dataset, the model may be overfitting while identifying the defaulters. Therefore, owing to human error in the existing system and model overperformance, we have created two new datasets by extremely swapping the potential defaulters (PDs) with genuine customers (GCs) by 35% and 65%. This process ensures randomness to the dataset and reduces the problem of overfitting to satisfactory levels. If the performance of the model deteriorates significantly, then we can assess that the model has captured random noise in the dataset, thereby necessitating model redevelopment. Therefore, this process will indirectly assess the sensitivity of ANN algorithm.
As part of data sampling and feature engineering, we have aggregated the credit card transaction data at customer level to determine customer‐level behavior. We obtain customer‐level average transaction amount, maximum transaction amount, and number of transactions executed for each of the modes of transfer. We create a new feature based on the average duration and the standard deviation of the durations between any two transactions irrespective of the mode of transfer. We obtained the final dataset with a total of 38 746 values and 27 variables, which can be used for training the model. Although actual data may contain more variables than the synthetic dataset and has more variability, the synthetic dataset is created such that it closely represents the actual data as careful segregation is performed while preparing the dataset.
For variable selection, stepwise logistic regression is used and all 27 variables were considered to build the logistic regression model. Then, bidirectional stepwise logistic regression based on Akaike information criterion (AIC) value was implemented to determine the best combination of independent variables to estimate the dependent variable. Finally, we obtained a model with 13 variables.
We used the variables that were significant for initially training the ANN model. After multiple iterations, including variable addition and deletion, the final model variables were identified. After rigorous tuning, the identified variables based on random trials and prudential tweaking, the final set of parameters was identified for best‐fit ANN model. For example, the identified model parameters include
For comparison, we have tested the performance of this ANN against the standard logistic regression model for predicting the defaulters. We found that both ANN and logistic regression yielded similar Area Under the Curve (AUC) values. Other factors also showed similar results:
However, on actual dataset, ANN and other forms of neural network algorithms are likely to predict nonlinearity in the dataset better than logistic regression. The objective of the ANN modeling is to identify potential areas of improvement in identifying potential defaulters in Credit Card domain using neural networks. Although we have used a synthetic dataset, we have tested various assumptions and patterns of the transaction dataset and developed a generalized ANN with backpropagation model. This model can be customized to cater to the requirements of a specific scenario in Credit Risk and Fraud detection domains.
3.85.167.119