Credit Risk Modelling with Lazy Predict and SHAP

Predict the likelihood of loan defaults and get insights with SHAP values.

5 min readMar 14, 2021

Credit default risk is a measurement that looks at the probability that a loan amount will not be paid back. Using the credit risk data set from Kaggle we can build a model using machine learning to predict the likelihood of a individual not being able to pay back their loan.

I’ve used the Lazy classifier package to test different models with the data and the SHAP package to get a feature importance insights to see what drove the predictions. This article borrows code from this article for the initial data treatment, so make sure to give it a look if you’re unsure about certain aspects of the code.

The full code for this article can be found here.

Initial Data Exploration

The initial data exploration followed a pretty standard process. This meant getting a feel for what type of date was in each column, class balance for what we need to predict(how many defaults vs non defaults), checking for NA values and outliers.

Data Transformation

The initial exploration led to a few changes being made to the data. The first step was getting rid of outliers as shown in the original article.

# Remove Outliersdf = df[df["person_age"]<=100]df = df[df["person_emp_length"]<=100]df = df[df["person_income"]<=4000000]

Then one hot encoding was done to change categorical values to numerical values with drop_first=True to avoid the dummy variable trap.

#One hot encoding of categorical variables df = pd.get_dummies(data=df,columns=['person_home_ownership','loan_intent','loan_grade','cb_person_default_on_file'],drop_first=True)

The initial data exploration also showed that there was a class imbalance with around 78% of rows in the “loan_status” column being non defaults. This would mean looking at balanced accuracy for a scoring metric to assess our models, as it takes into account the ratio of correctly predicted labels. Simply put if we used normal accuracy as a metric, the score would be inflated due to the way the inputs were statically distributed and would not be a accurate reflection on the way the model accurately predicted things.

#Percentage of non-default cases 
data_0 = df[df.loan_status == 0].loan_status.count() / df.loan_status.count() data_00.7833892148644873#Percentage of default cases
data_1 = df[df.loan_status == 1].loan_status.count() / df.loan_status.count() data_10.2166107851355127

After this, train and test data sets were created to test out models with Lazy Predict.

#Train and test split
Y = df['loan_status']
X = df.drop('loan_status',axis=1)
x_train, x_test, y_train, y_test = train_test_split(X, Y, random_state=0, test_size=.20)

Model Selection

To help choose a classification model for the prediction, a package called Lazy Predict was used to generate models and assess the scores. It is a quick way to generate models without any parameter tuning. With two lines of code you can generate multiple models with different scores for each model.

clf = LazyClassifier(verbose=0,ignore_warnings=True, custom_metric=None)models,predictions = clf.fit(x_train, x_test, y_train, y_test)

I chose the top performing model, XGBClassifier to use as the model for the prediction.

Model Creation

After this stage I created an individual model with the XGBoost library and got similar scores to the Lazy Classifier package.

accuracy = accuracy_score(y_test, predictions) print("Accuracy: %.2f%%" % (accuracy * 100.0))#Accuracy: 93.84%balanced_accuracy = balanced_accuracy_score(y_test, predictions) print("Balanced_Accuracy: %.2f%%" % (balanced_accuracy * 100.0))#Balanced_Accuracy: 86.78%

Next, we can use this model to figure out key features.

Feature Importance with SHAP

To get further insight into the created model, the SHAP package was used to assess which features were driving the predictions.

The package uses game theoretic approaches to explain the output of any machine learning model. SHAP quantifies how much a feature brings to a prediction for a model. A similar analogy for this would be quantifying how much a basketball player contributes to a a score in a game.

We can use this tool to help uncover why our predictions have been created from the machine learning model with the inputs from our dataset.

We pass in our model and generate SHAP values for the test data set with the code below.

# load JS visualization code to notebook
shap.initjs()

model = xgb_model

# explain the model's predictions using SHAP
# (same syntax works for LightGBM, CatBoost, scikit-learn, transformers, Spark, etc.)explainer = shap.Explainer(model)
shap_values = explainer.shap_values(x_test)

After this, we can generate a summary plot to get an idea of what features were important for the model.

shap.summary_plot(shap_values, x_test)

This plot shows the features ranked from most important to least important. High feature values are in red and low feature values are indicated with blue. The lines that are directed right indicate a correlation with a positive prediction (ie high SHAP values) and left indicates a negative correlation.

In this example, a persons income has been identified as the most important feature in the model. Individuals with a low income (indicated with blue) show a positive correlation to default on their loan. This makes sense as someone with a lower income would be less likely to pay off a loan compared to someone with more income.

The second feature identified is the loan_int_rate feature which is the interest rate for the loan. The plot shows that higher interest rates increases the likelihood of a person defaulting on a loan.

We can also plot dependence plots for these features to get even more insight from these SHAP values.

The above plot shows there seems to be pretty linear relationship in SHAP values and the interest rate.

The plot above shows that a loan as a percent of their income shows significantly sharp increase in the SHAP value at around 0.3 of loan_percent_income.

Main takeaways:

person_income, loan_int_rate and loan_percent_income are the top 3 features that contribute to predicting the likelihood of someone defaulting on their loan.

person_income - A lower income leads to a increased likelihood of someone defaulting.

loan_int_rate - A higher interest rate leads to a increased likelihood of someone defaulting.

loan_percent_income - A higher loan as a percentage of income leads to a increased likelihood of someone defaulting. The dependency plot also shows a threshold of 0.3 percent being a key tipping point at which the likelihood increases significantly.

For more advanced use of the SHAP library this article goes through another example.

Conclusion

New packages such as Lazy Classifier and SHAP help speed up the ability to create models and find insights quickly. This article is a quick intro for people new theses packages and I hope it helps you out. Let me know if you have any questions!