Loan data analysis
Loan Data
This dataset consists of data from almost 10,000 borrowers that took loans with some paid back and others still in progress. It was extracted from lendingclub.com which is an organization that connects borrowers with investors.
Understand your data
Variable | class | description |
---|---|---|
credit_policy | numeric | 1 if the customer meets the credit underwriting criteria; 0 otherwise. |
purpose | character | The purpose of the loan. |
int_rate | numeric | The interest rate of the loan (more risky borrowers are assigned higher interest rates). |
installment | numeric | The monthly installments owed by the borrower if the loan is funded. |
log_annual_inc | numeric | The natural log of the self-reported annual income of the borrower. |
dti | numeric | The debt-to-income ratio of the borrower (amount of debt divided by annual income). |
fico | numeric | The FICO credit score of the borrower. |
days_with_cr_line | numeric | The number of days the borrower has had a credit line. |
revol_bal | numeric | The borrower’s revolving balance (amount unpaid at the end of the credit card billing cycle). |
revol_util | numeric | The borrower’s revolving line utilization rate (the amount of the credit line used relative to total credit available). |
inq_last_6mths | numeric | The borrower’s number of inquiries by creditors in the last 6 months. |
delinq_2yrs | numeric | The number of times the borrower had been 30+ days past due on a payment in the past 2 years. |
pub_rec | numeric | The borrower’s number of derogatory public records. |
not_fully_paid | numeric | 1 if the loan is not fully paid; 0 otherwise. |
Load packages
# install.packages('ggridges')
# install.packages('patchwork')
# install.packages('gt')
# install.packages('xgboost')
# install.packages('ranger')
# install.packages('vip')
# install.packages('usemodels')
# install.packages('GGally')
# install.packages('doParallel')
# install.packages('glmnet')
library(skimr)
library(tidyverse)
library(tidymodels)
library(scales)
library(ggridges)
library(patchwork)
library(gt)
library(xgboost)
library(ranger)
library(vip)
library(usemodels)
library(GGally)
library(glmnet)
doParallel::registerDoParallel()
theme_set(theme_light())
Load the Data
loans <- readr::read_csv('loans.csv.gz')
Exploratory data analysis
As a strong visual learner, it helps me to see the data in both table and plot formats. Let’s get started by looking at a few the summary statistics generated from the skimr::skim()
combined with summary()
. I will also use glimpse()
to get a concise view of the data names, data types and the first few entries.
loans %>%
skim() %>%
select(-(numeric.p0:numeric.p100)) %>%
select(-(complete_rate)) %>%
summary()
Name | Piped data |
Number of rows | 9578 |
Number of columns | 14 |
_______________________ | |
Column type frequency: | |
character | 1 |
numeric | 13 |
________________________ | |
Group variables | None |
loans %>%
glimpse()
## Rows: 9,578
## Columns: 14
## $ credit_policy <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ purpose <chr> "debt_consolidation", "credit_card", "debt_consolida…
## $ int_rate <dbl> 0.1189, 0.1071, 0.1357, 0.1008, 0.1426, 0.0788, 0.14…
## $ installment <dbl> 829.10, 228.22, 366.86, 162.34, 102.92, 125.13, 194.…
## $ log_annual_inc <dbl> 11.350407, 11.082143, 10.373491, 11.350407, 11.29973…
## $ dti <dbl> 19.48, 14.29, 11.63, 8.10, 14.97, 16.98, 4.00, 11.08…
## $ fico <dbl> 737, 707, 682, 712, 667, 727, 667, 722, 682, 707, 67…
## $ days_with_cr_line <dbl> 5639.958, 2760.000, 4710.000, 2699.958, 4066.000, 61…
## $ revol_bal <dbl> 28854, 33623, 3511, 33667, 4740, 50807, 3839, 24220,…
## $ revol_util <dbl> 52.1, 76.7, 25.6, 73.2, 39.5, 51.0, 76.8, 68.6, 51.1…
## $ inq_last_6mths <dbl> 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 2, 2, 0, 0, 0, 1, 0, 1…
## $ delinq_2yrs <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0…
## $ pub_rec <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0…
## $ not_fully_paid <dbl> 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
loans %>%
head() %>%
gt() %>%
tab_header(title = md("**First 6 rows of loan data from lendingclub.com**")) %>%
tab_options(container.height = 400,
container.overflow.y = TRUE,
heading.background.color = "#21908CFF",
table.width = "75%",
column_labels.background.color = "black",
table.font.color = "black") %>%
tab_style(style = list(cell_fill(color = "#35B779FF")),
locations = cells_body())
First 6 rows of loan data from lendingclub.com | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
credit_policy | purpose | int_rate | installment | log_annual_inc | dti | fico | days_with_cr_line | revol_bal | revol_util | inq_last_6mths | delinq_2yrs | pub_rec | not_fully_paid |
1 | debt_consolidation | 0.1189 | 829.10 | 11.35041 | 19.48 | 737 | 5639.958 | 28854 | 52.1 | 0 | 0 | 0 | 0 |
1 | credit_card | 0.1071 | 228.22 | 11.08214 | 14.29 | 707 | 2760.000 | 33623 | 76.7 | 0 | 0 | 0 | 0 |
1 | debt_consolidation | 0.1357 | 366.86 | 10.37349 | 11.63 | 682 | 4710.000 | 3511 | 25.6 | 1 | 0 | 0 | 0 |
1 | debt_consolidation | 0.1008 | 162.34 | 11.35041 | 8.10 | 712 | 2699.958 | 33667 | 73.2 | 1 | 0 | 0 | 0 |
1 | credit_card | 0.1426 | 102.92 | 11.29973 | 14.97 | 667 | 4066.000 | 4740 | 39.5 | 0 | 1 | 0 | 0 |
1 | credit_card | 0.0788 | 125.13 | 11.90497 | 16.98 | 727 | 6120.042 | 50807 | 51.0 | 0 | 0 | 0 | 0 |
The loan data consists of one character variable (‘purpose’ = loan type) and 13 numeric variables. Let’s use geom_col()
to plot the number of times each loan type is represented in the data.
loans %>%
count(purpose) %>%
mutate(purpose = fct_reorder(purpose, n)) %>%
ggplot(aes(n, purpose, fill = purpose)) +
geom_col() +
labs(title = 'Loan count by type',
x = 'Count',
y = '') +
theme(legend.position = 'none') +
scale_fill_viridis_d()
Numeric variables can be viewed as count distributions using geom_histogram()
. Here, I use the gather()
function to reshape the data into a long format and then facet_wrap()
is used to display each histogram in a single figure.
loans[c(3:13)] %>%
mutate(log_revol_bal = log(revol_bal)) %>%
select(-revol_bal) %>%
gather() %>%
ggplot(aes(value, fill = key)) +
geom_histogram(alpha = 0.7) +
facet_wrap(~key, scales = 'free') +
labs(title = 'Count distributions for numeric variables') +
theme(legend.position = 'none')
## Warning: Removed 321 rows containing non-finite values (stat_bin).
The credit_policy
and not_fully_paid
variables (not shown above) are better represented as factors. They can be converted using the recode_factor()
function. Let’s convert them to factors and explore their distributions with geom_boxplot()
.
loans <- loans %>%
mutate(not_fully_paid = recode_factor(not_fully_paid, '1' = 'No', '0' = 'Yes'),
credit_policy = recode_factor(credit_policy, '1' = 'pass', '0' = 'fail'),
purpose = factor(purpose))
loans %>%
select(-c(12:13)) %>%
pivot_longer(!c(credit_policy, purpose, not_fully_paid)) %>%
#filter(name != 'revol_bal') %>%
ggplot(aes(value, fill = not_fully_paid)) +
geom_boxplot() +
facet_wrap(~name, scales = 'free') +
labs(title = "Comparing loan payment status across other variables",
x = "") +
guides(fill = guide_legend(reverse=TRUE))
library(ggpubr)
loans %>%
select(c(1:10)) %>%
pivot_longer(!c(purpose, credit_policy)) %>%
filter(name != 'revol_bal') %>%
ggplot(aes(value, fill = credit_policy)) +
geom_boxplot() +
facet_wrap(~name, scales = 'free') +
labs(title = "Comparing credit policy across other variables",
x = "") +
guides(fill = guide_legend(reverse=TRUE))
Loan type can also be viewed with geom_boxplot()
but the categories get a little crowded.
loans %>%
select(-c(9, 11:13)) %>%
pivot_longer(!c(credit_policy, purpose, not_fully_paid)) %>%
#filter(name != 'revol_bal') %>%
ggplot(aes(value, fill = purpose)) +
geom_boxplot() +
facet_wrap(~name, scales = 'free') +
labs(title = "Comparing loan payment status across other variables",
x = "") +
guides(fill = guide_legend(reverse=TRUE))
The geom_density_ridges()
function from the ggridges package is an alternative way to view a numeric distribution across multiple categories.
a <- loans %>%
add_count(purpose) %>%
mutate(purpose = fct_reorder(purpose, n)) %>%
group_by(purpose) %>%
ggplot(aes(int_rate, purpose, fill = purpose)) +
geom_density_ridges() +
labs(title = 'Density of the interest rate by loan purpose',
x= 'Interest rate',
y= '') +
theme(legend.position = 'none') +
scale_fill_viridis_d(alpha = 0.8) +
theme(panel.grid.minor = element_blank(), panel.border = element_blank())
b <- loans %>%
add_count(purpose) %>%
mutate(purpose = fct_reorder(purpose, n)) %>%
group_by(purpose) %>%
ggplot(aes(dti, purpose, fill = purpose)) +
geom_density_ridges() +
# scale_y_discrete(expand = c(0.01, 0)) +
# scale_x_continuous(expand = c(0.01, 0)) +
#theme_ridges() +
labs(title = 'Density of the debt-to-income ratio by loan purpose',
x= 'dti',
y= '') +
theme(legend.position = 'none') +
scale_fill_viridis_d(alpha = 0.8) +
theme(panel.grid.minor = element_blank(), panel.border = element_blank())
(a / b)
## Picking joint bandwidth of 0.00625
## Picking joint bandwidth of 1.57
c <- loans %>%
add_count(purpose) %>%
mutate(purpose = fct_reorder(purpose, n)) %>%
group_by(purpose) %>%
ggplot(aes(revol_util, purpose, fill = purpose)) +
geom_density_ridges() +
labs(title = 'Density of the revol_util by loan purpose',
x= 'revol_util',
y= '') +
theme(legend.position = 'none') +
scale_fill_viridis_d(alpha = 0.8) +
theme(panel.grid.minor = element_blank(), panel.border = element_blank())
d <- loans %>%
add_count(purpose) %>%
mutate(purpose = fct_reorder(purpose, n)) %>%
group_by(purpose) %>%
ggplot(aes(installment, purpose, fill = purpose)) +
geom_density_ridges() +
labs(title = 'Density of the installment by loan purpose',
x= 'installment',
y= '') +
theme(legend.position = 'none') +
scale_fill_viridis_d(alpha = 0.8) +
theme(panel.grid.minor = element_blank(), panel.border = element_blank())
(c / d)
## Picking joint bandwidth of 6.58
## Picking joint bandwidth of 41.9
The ggpairs()
function from the GGally package allows us to summarize correlation values between our variables.
loans[c(3:8, 10)] %>%
sample_n(size = 500) %>%
ggpairs()
Modeling challenges
Now that we have a feel for our data, let’s create some models!
- Create a model that predicts the loan type (purpose)
- Create model that predicts whether the loan has been payed back (not_fully_paid)
- Create a model that predicts whether the loanee meets the accepted level of risk (credit_policy)
Splitting the data
The data is split into training and test sets using the default (prop = 0.75) split. 5-fold cross validation is used to create resamples from the training set.
split <- loans %>% initial_split()
train <- training(split) %>% sample_n(size = 500)
test <- testing(split)
resamps <- vfold_cv(train, v = 5)
Predicting loan type
To predict the loan type we will need to build a multiclass classification model. I am going to try three different models and compare the results to determine which model has the best predictive power.
Model 1: Xgboost
#usemodels::use_xgboost(purpose ~ ., data = train)
xgboost_recipe <-
recipe(formula = purpose ~ ., data = train) %>%
step_novel(all_nominal(), -all_outcomes()) %>%
step_dummy(all_nominal(), -all_outcomes(), one_hot = TRUE) %>%
step_zv(all_predictors())
xgboost_spec <-
boost_tree(
trees = 1000,
tree_depth = tune(),
min_n = tune(),
mtry = tune(),
sample_size = tune(),
learn_rate = tune()
) %>%
set_engine("xgboost") %>%
set_mode("classification")
xgboost_workflow <-
workflow() %>%
add_recipe(xgboost_recipe) %>%
add_model(xgboost_spec)
# xgboost_grid_1 <- grid_regular(
# parameters(xgboost_spec),
# levels = 3 )
# xgboost_grid_2 <- grid_regular(
# tree_depth(c(5L, 10L)),
# min_n(c(10L, 40L)),
# mtry(c(5L, 10L)),
# learn_rate(c(-2, -1)),
# sample_size=20
# )
set.seed(8390)
xgboost_tune <-
tune_grid(xgboost_workflow, resamples = resamps, grid = 20)
## i Creating pre-processing data to finalize unknown parameter: mtry
xgboost_tune %>%
autoplot()
xg_best_params <- xgboost_tune %>%
select_best(metric = 'accuracy')
xg_best_spec <- xgboost_spec %>%
finalize_model(xg_best_params)
xg_final_fit <- workflow() %>%
add_recipe(xgboost_recipe) %>%
add_model(xg_best_spec) %>%
last_fit(split = split)
xg_final_fit %>%
collect_metrics()
## # A tibble: 2 × 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 accuracy multiclass 0.489 Preprocessor1_Model1
## 2 roc_auc hand_till 0.695 Preprocessor1_Model1
Model 2: Random forest
# usemodels::use_ranger(purpose ~ ., data = train)
ranger_recipe <-
recipe(formula = purpose ~ ., data = train)
ranger_spec <-
rand_forest(mtry = tune(),
min_n = tune(),
trees = 1000) %>%
set_mode("classification") %>%
set_engine("ranger")
rand_forest(mode = "classification") %>%
set_engine("ranger", importance = "impurity") %>%
fit(purpose ~ .,
data = train) %>%
vip::vip(geom = 'point',
aesthetics = list(color = 'midnightblue', size = 4)) +
labs(title = 'test')
ranger_workflow <-
workflow() %>%
add_recipe(ranger_recipe) %>%
add_model(ranger_spec)
# ranger_grid <- grid_regular(
# parameters(ranger_spec),
# mtry(range = c(10, 30)),
# min_n(range = c(2, 8)),
# levels = 3)
set.seed(57341)
ranger_tune <-
tune_grid(ranger_workflow,
resamples = resamps,
grid = 20)
## i Creating pre-processing data to finalize unknown parameter: mtry
ranger_tune %>%
autoplot()
rf_best_params <- ranger_tune %>%
select_best(metric = 'accuracy')
rf_best_spec <- ranger_spec %>%
finalize_model(rf_best_params)
rf_final_fit <- workflow() %>%
add_recipe(ranger_recipe) %>%
add_model(rf_best_spec) %>%
last_fit(split = split)
rf_final_fit %>%
collect_metrics()
## # A tibble: 2 × 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 accuracy multiclass 0.489 Preprocessor1_Model1
## 2 roc_auc hand_till 0.693 Preprocessor1_Model1
Model 3: K-nearest neighbors
#usemodels::use_kknn(purpose ~ ., data = train)
kknn_recipe <-
recipe(formula = purpose ~ ., data = train) %>%
step_novel(all_nominal(), -all_outcomes()) %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
step_zv(all_predictors()) %>%
step_normalize(all_predictors(), -all_nominal())
kknn_spec <-
nearest_neighbor(neighbors = tune(), weight_func = tune()) %>%
set_mode("classification") %>%
set_engine("kknn")
kknn_workflow <-
workflow() %>%
add_recipe(kknn_recipe) %>%
add_model(kknn_spec)
knn_grid <- grid_regular(
parameters(kknn_spec),
mtry(range = c(10, 30)),
levels = 5)
set.seed(78590)
kknn_tune <-
tune_grid(kknn_workflow,
resamples = resamps,
grid = knn_grid)
kknn_tune %>%
autoplot()
knn_best_params <- kknn_tune %>%
select_best(metric = 'accuracy')
knn_best_spec <- kknn_spec %>%
finalize_model(knn_best_params)
knn_final_fit <- workflow() %>%
add_recipe(kknn_recipe) %>%
add_model(knn_best_spec) %>%
last_fit(split = split)
knn_final_fit %>%
collect_metrics()
## # A tibble: 2 × 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 accuracy multiclass 0.445 Preprocessor1_Model1
## 2 roc_auc hand_till 0.623 Preprocessor1_Model1
knn_final_fit %>%
collect_predictions()
## # A tibble: 2,395 × 12
## id .pred_all_other .pred_credit_ca… .pred_debt_cons… .pred_education…
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 train/test split 0.2 0.333 0.333 0
## 2 train/test split 0.267 0.133 0.467 0.0667
## 3 train/test split 0.2 0.533 0.0667 0.0667
## 4 train/test split 0.2 0.2 0.4 0
## 5 train/test split 0.0667 0.333 0.267 0.2
## 6 train/test split 0.0667 0.0667 0.667 0.0667
## 7 train/test split 0.267 0.267 0.4 0
## 8 train/test split 0.333 0.0667 0.333 0.0667
## 9 train/test split 0.333 0.0667 0.0667 0.0667
## 10 train/test split 0.467 0.267 0.2 0.0667
## # … with 2,385 more rows, and 7 more variables: .pred_home_improvement <dbl>,
## # .pred_major_purchase <dbl>, .pred_small_business <dbl>, .row <int>,
## # .pred_class <fct>, purpose <fct>, .config <chr>
models <- c('xgboost', 'xgboost', 'random forrest', 'random forrest', 'knn', 'knn')
bind_rows(xg_final_fit %>% collect_metrics(), rf_final_fit %>% collect_metrics(), knn_final_fit %>%
collect_metrics()) %>%
mutate(models = models) %>%
select(models, everything(), -.config) %>%
gt() %>%
tab_header(title = 'Model Results') %>%
tab_options(container.height = 400,
container.overflow.y = TRUE,
heading.background.color = "#21908CFF",
table.width = "75%",
column_labels.background.color = "black",
table.font.color = "black") %>%
tab_style(style = list(cell_fill(color = "#35B779FF")),
locations = cells_body())
Model Results | |||
---|---|---|---|
models | .metric | .estimator | .estimate |
xgboost | accuracy | multiclass | 0.4885177 |
xgboost | roc_auc | hand_till | 0.6949143 |
random forrest | accuracy | multiclass | 0.4893528 |
random forrest | roc_auc | hand_till | 0.6930445 |
knn | accuracy | multiclass | 0.4446764 |
knn | roc_auc | hand_till | 0.6231237 |
Predicting loan risk status (credit policy)
For the next two models, I will use xgboost.
# #usemodels::use_xgboost(credit_policy ~ ., data = train)
xgboost_recipe <-
recipe(formula = credit_policy ~ ., data = train) %>%
step_novel(all_nominal(), -all_outcomes()) %>%
step_dummy(all_nominal(), -all_outcomes(), one_hot = TRUE) %>%
step_zv(all_predictors())
xgboost_spec <-
boost_tree(trees = tune(), min_n = tune(), tree_depth = tune(), learn_rate = tune(),
loss_reduction = tune(), sample_size = tune()) %>%
set_mode("classification") %>%
set_engine("xgboost")
xgboost_workflow <-
workflow() %>%
add_recipe(xgboost_recipe) %>%
add_model(xgboost_spec)
set.seed(10500)
xgboost_tune <-
tune_grid(xgboost_workflow, resamples = resamps, grid = 10)
xgboost_tune %>%
autoplot()
xg_best_params <- xgboost_tune %>%
select_best(metric = 'accuracy')
xg_best_spec <- xgboost_spec %>%
finalize_model(xg_best_params)
xg_final_fit <- workflow() %>%
add_recipe(xgboost_recipe) %>%
add_model(xg_best_spec) %>%
last_fit(split = split)
xg_final_fit %>%
collect_metrics() %>%
gt() %>%
tab_header(title = 'Loan risk prediction results') %>%
tab_options(container.height = 400,
container.overflow.y = TRUE,
heading.background.color = "#21908CFF",
table.width = "75%",
column_labels.background.color = "black",
table.font.color = "black") %>%
tab_style(style = list(cell_fill(color = "#35B779FF")),
locations = cells_body())
Loan risk prediction results | |||
---|---|---|---|
.metric | .estimator | .estimate | .config |
accuracy | binary | 0.9787056 | Preprocessor1_Model1 |
roc_auc | binary | 0.9833877 | Preprocessor1_Model1 |
Predicting loan payment status
#usemodels::use_xgboost(not_fully_paid ~ ., data = train)
xgboost_recipe <-
recipe(formula = not_fully_paid ~ ., data = train) %>%
step_string2factor(purpose, not_fully_paid) %>%
step_novel(all_nominal(), -all_outcomes()) %>%
step_dummy(all_nominal(), -all_outcomes(), one_hot = TRUE) %>%
step_zv(all_predictors())
xgboost_spec <-
boost_tree(trees = tune(), learn_rate = tune()) %>%
set_mode("classification") %>%
set_engine("xgboost")
boost_tree(mode = "classification") %>%
set_engine("xgboost", importance = "impurity") %>%
fit(purpose ~ .,
data = train) %>%
vip::vip(geom = 'point',
aesthetics = list(color = 'midnightblue', size = 4)) +
labs(title = 'test')
## [10:20:02] WARNING: amalgamation/../src/learner.cc:573:
## Parameters: { "importance" } might not be used.
##
## This may not be accurate due to some parameters are only used in language bindings but
## passed down to XGBoost core. Or some parameters are not used but slip through this
## verification. Please open an issue if you find above cases.
##
##
## [10:20:02] WARNING: amalgamation/../src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
xgboost_workflow <-
workflow() %>%
add_recipe(xgboost_recipe) %>%
add_model(xgboost_spec)
xgboost_grid <- grid_regular(
parameters(xgboost_spec),
levels = 3 )
set.seed(0826)
xgboost_tune <-
tune_grid(xgboost_workflow,
resamples = resamps,
control = control_grid(),
grid = xgboost_grid)
xgboost_tune %>%
autoplot()
xg_best_params <- xgboost_tune %>%
select_best(metric = 'accuracy')
xg_best_spec <- xgboost_spec %>%
finalize_model(xg_best_params)
xg_final_fit <- workflow() %>%
add_recipe(xgboost_recipe) %>%
add_model(xg_best_spec) %>%
last_fit(split = split)
xg_final_fit %>%
collect_metrics() %>%
gt() %>%
tab_header(title = 'Payment status prediction results') %>%
tab_options(container.height = 400,
container.overflow.y = TRUE,
heading.background.color = "#21908CFF",
table.width = "75%",
column_labels.background.color = "black",
table.font.color = "black") %>%
tab_style(style = list(cell_fill(color = "#35B779FF")),
locations = cells_body())
Payment status prediction results | |||
---|---|---|---|
.metric | .estimator | .estimate | .config |
accuracy | binary | 0.8258873 | Preprocessor1_Model1 |
roc_auc | binary | 0.6187459 | Preprocessor1_Model1 |