Genmod procedure

Let's start the model building process by using the Genmod procedure. We will cover some aspects of data cleaning and transformation in more detail as we evaluate the initial results from modeling. The data has been rolled over to an account level where each customer only has one row per account, which showcases the customer's current position. Some new variables have been created to capture the past customer behavior. One of the new variables that we will use to assess the model fit is the length of the customer relationship.

The Genmod procedure fits generalized linear models, not just the traditional linear models. Readers may be aware that Proc logistic, a widely used procedure for primarily modeling binary outcomes, is a generalized linear model. This procedure is superior to Proc logistic, as it allows the response probability distribution to be any member of an exponential family of distributions. As a start, we will try to generate, from the Genmod procedure model, results that are similar to what we would generate using Proc logistic:

Proc Genmod data=model_latest_record descending; 
Class collateral_type customer_type; 
Model dflt = utilisation ltv collateral_type borrowing_portfolio_ratio postcode_index customer_type arrears/ dist=binomial; 
Output out=preds p=pred l=lower u=upper; 
Run;

In Figure 3.3, the log of the model run shows that we have achieved model convergence, but the p-values we get for chi-square and most of the metrics are odd. There is something wrong with the model:

Figure 3.3: Output showing issues with the model

The roots of the problem can be traced immediately by looking at the response profile in Figure 3.4. In the non-aggregated data at the customer level, we observed 20 instances of defaults. In the aggregated data at the customer level, we are left with only five defaults. We have lost a lot of information about the defaults by only incorporating the latest records:

Figure 3.4: Model information showing lower defaults than expected

One of the key aspects for data preparation for modeling is to capture the information in the best way possible, when transforming or aggregating information. The defaults could have happened any time between 2009 and 2017. However, by only picking up the latest records, we ended up ignoring crucial default information when the defaults did not happen in the most recent record of the customer.

In the preceding model, we have specified the distribution as binomial. The assumption is that the profile of defaulted and non-defaulted customers is quite different. Let's test this assumption out, after ensuring that the right numbers of defaults are represented in the modeling dataset. In the data preparation step for the model rerun, the data was first bifurcated into defaulted and non-defaulted customers. In the case of defaulted customers, no transformation was carried out, and the records were incorporated into the modeling dataset. If a customer had double defaulted, then both of the records were included. For the non-defaulted customers, the latest available record was treated as being representative of the customer.

Additionally, a length of relationship variable was calculated at a customer level. If the customer had any arrears in its transaction history, the customer was assigned a value of 1 in the newly created arrears_flag variable. This was done to identify arrears at any point in time. Some of the customers at the point of default weren't recorded as being in arrears, which was deemed to be a data quality issue. The collateral type variable was adjudged to be of dubious quality, as most of the collateral for the borrowing should be the property that is being mortgaged, and not the populated values, such as cash, guarantees, and stocks and shares. The customer type variable was also removed; its accuracy was doubtful, since it was based on the free text data entered by the mortgage advisor, rather than the customer application verification process.

A validation dataset was also created. The randomly selected 10 records (defaulted and non-defaulted customers totaled five each) were not part of the training dataset. This was specified in the data using the weight statement. Since we have selected the validation records based on random sampling, each time the data cleaning code is run, we will get a different model output.

In Figure 3.5, the model information states that there are 14 observations with defaults. We have only selected five defaulted and five non-defaulted customers for validation. We have a total of 20 observations with defaults. There is a customer with a re-default that is part of the validation dataset. Hence, even though we have only five customers as part of the defaulted validation dataset, one customer in particular is contributing to two defaults.

Hence, we have six defaults in the validation dataset, and we are left with 14 only defaults (not 15) in the training dataset:

Proc Genmod data=model_validation descending; 
Weight validation_sample; 
Model dflt = utilisation ltv borrowing_portfolio_ratio postcode_index arrears_flag  
relationship_length/ dist=binomial; 
Output out=preds(where=(validation_sample=0)) p=pred l=lower u=upper; 
Run;

Figure 3.5: Model information for rerun model

In Figure 3.6, there is only one variable, arrears_flag, which is significant at p-value <0.05. At the <0.10 significance, the percentage of mortgage utilized is also significant. Remember that the significance changes with a different validation dataset created when a random sample of validation dataset is created by the Proc surveyselect data cleaning code (for further information, please download the data cleaning code):

Figure 3.6: Model parameter estimates after removing collateral and customer type

In Figure 3.7, we have listed the defaulted customers selected for validation:

Figure 3.7: Defaulted customers in validation dataset

In Figure 3.8, we have listed the non-defaulted customers selected for validation:

Figure 3.8: Non-defaulted customers in validation dataset

In Figure 3.9, we can see that the means for the variables in the defaulted (dflt=1) and non-defaulted (dflt=0) groups are different for all variables. The standard deviation of values is also quite different. Hence, a binomial distribution fit was chosen in the model. In terms of percentage differences, the mean of LTV and arrears is very different in the two groups. The defaulted customers have leveraged positions, and mostly tend to be in arrears at some point in their transaction history:

Figure 3.9: Variance between defaulted and non-defaulted groups

Figure 3.10 showcases the predicted value of the validation customer dataset. The predicted value, in this case, can be termed as the probability of default. The defaulted customers have a probability of default of 1. All but one instance of default sees our predicted probability being at least 0.90 for the defaulted customers. The non-defaulted customers have been assigned a very low probability of default predicted score. It seems that the model is able to predict accurately, to a large degree:

Figure 3.10: Validation dataset prediction

Let's fit an ROC curve for the predicted values of the validation dataset. The Genmod procedure doesn't support an ROC curve. Let's utilize the logistic procedure without fitting any model. The predicted values have been used in the procedure to generate the ROC curve:

Proc logistic data=preds descending; 
Model dflt = / nofit;  
Roc "Genmod model" pred=pred; 
Run;

Figure 3.11 contains the ROC curve. But what is the ROC curve, and what is its use?

Figure 3.11: ROC curve

The ROC curve is an acronym for receiver operating characteristic curve. The ROC curve is used as a discriminatory test, to see how well the model has performed. Remember that our aim for modeling is to build a behavioral PD model. As part of that, we need to predict the PD. We have a validation dataset, and we just observed that we have had success in predicting accurately. But what are the scenarios that we might encounter when predicting? Let's look at the scenarios of an application scorecard where we will use some level of PD as a cut-off point to reject applications:

	Account performance
	Good	Bad
Insufficient score	Rejected application but good performance (false negative)	Rejected and bad performance (true negative)
Sufficient score	Accepted and good performance (true positive)	Accepted and bad performance (false positive)

Figure 3.12: Application decision and outcome

The true positive and true negative scenarios from Figure 3.12 are acceptable outcomes from implementing an application scorecard. However, the false negative and false positive scenarios aren't acceptable, but they are scenarios that are generally experienced. The false negative scenario is less costly than the false positive scenario. In the false negative scenario, there is only opportunity cost, whereas in the false positive scenario, the whole lending may have to be written off.

In an ROC curve, the true positive error is plotted on the y-axis, and the false positive error, on the x-axis. The y-axis is also known as the sensitivity, and the x-axis as 1-specifity. A perfect scenario with all true-positives in a prediction will have an ROC curve like the one shown in Figure 3.13. Somers' D can range between -1 to 1. Our model has a value of 0.93. In simplistic terms, this means that the model has good predictive ability. A perfectly predictive model will have a Somers' D value of 1:

Figure 3.13: True positive perfect prediction

The area under the curve (AUC), in this case, is at the maximum (value 1). The AUC can be useful in comparing models. An AUC of 1 is rarely achieved in the practical world. In general, banks would assign the following red, amber, and green (RAG) statuses, based on the AUC:

AUC	RAG
0.9 - 1	Green
0.7-0.9	Amber
<0.7	Red

Figure 3.14: AUC and RAG status

Based on the ROC curve in Figure 3.11, the model would get an RAG status of green. However, this is just one of the metrics by which a model can be compared. Another useful metric is the GINI, or the Somers' D value.

With the limited data at hand, we have only built a model with one procedure. Most of the variables aren't significant. Let's build a PD model using Proc logistic and observe the output.

Table of Contents for Genmod procedure

Create new playlist

Sign In

Sign Up

Table of Contents for
Genmod procedure