Modelling loan interest rates

Suggested answers

Application exercise
Answers

In this application exercise we will be studying loan interest rates. The dataset is one you’ve come across before in your reading – the dataset about loans from the peer-to-peer lender, Lending Club, from the openintro package. We will use tidyverse and tidymodels for data exploration and modeling, respectively.

library(tidyverse)
library(tidymodels)
library(openintro)

Before we use the dataset, we’ll make a few transformations to it.

loans <- loans_full_schema %>%
  mutate(
    credit_util = total_credit_utilized / total_credit_limit,
    bankruptcy  = as.factor(if_else(public_record_bankrupt == 0, 0, 1)),
    verified_income = droplevels(verified_income),
    homeownership = str_to_title(homeownership),
    homeownership = fct_relevel(homeownership, "Rent", "Mortgage", "Own")
    ) %>%
  rename(credit_checks = inquiries_last_12m) %>%
  select(interest_rate, verified_income, debt_to_income, credit_util, bankruptcy, term, credit_checks, issue_month, homeownership) 

Here is a glimpse at the data:

glimpse(penguins)
Rows: 344
Columns: 7
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…

Interest rate vs. credit utilization ratio

The regression model for interest rate vs. credit utilization is as follows.

rate_util_fit <- linear_reg() |>
  fit(interest_rate ~ credit_util, data = loans)

tidy(rate_util_fit)
# A tibble: 2 × 5
  term        estimate std.error statistic   p.value
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)    10.5     0.0871     121.  0        
2 credit_util     4.73    0.180       26.3 1.18e-147

And here is the model visualized:

ggplot(loans, aes(x = credit_util, y = interest_rate)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm")
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 2 rows containing missing values (`geom_point()`).

  • Your turn: What is the estimated interest rate for a loan applicant with credit utilization of 0.8, i.e. someone whose total credit balance is 80% of their total available credit?
credit_util_80 <- tibble(credit_util = 0.8)

predict(rate_util_fit, new_data = credit_util_80)
# A tibble: 1 × 1
  .pred
  <dbl>
1  14.3

Interest rate vs. homeownership

Next we predict interest rates from homeownership, which is a categorical predictor with three levels:

levels(loans$homeownership)
[1] "Rent"     "Mortgage" "Own"     
  • Demo: Fit the linear regression model to predict interest rate from homeownership and display a tidy summary of the model. Write the estimated model output below.
rate_home_fit <- linear_reg() |>
  fit(interest_rate ~ homeownership, data = loans)

tidy(rate_home_fit)
# A tibble: 3 × 5
  term                  estimate std.error statistic  p.value
  <chr>                    <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)             12.9      0.0803    161.   0       
2 homeownershipMortgage   -0.866    0.108      -8.03 1.08e-15
3 homeownershipOwn        -0.611    0.158      -3.88 1.06e- 4
  • Your turn: Interpret each coefficient in context of the problem.

    • Intercept: Loan applicants who rent are predicted to receive an interest rate of 12.9%, on average.

    • Slopes:

      • The model predicts that loan applicants who have a mortgage for their home receive 0.866% lower interest rate than those who rent their home, on average.

      • The model predicts that loan applicants who own their home receive 0.611% lower interest rate than those who rent their home, on average.

Interest rate vs. credit utilization and homeownership

Main effects model

  • Demo: Fit a model to predict interest rate from credit utilization and homeownership, without an interaction effect between the two predictors. Display the summary output and write out the estimated regression equation.
rate_util_home_fit <- linear_reg() |>
  fit(interest_rate ~ credit_util + homeownership, data = loans)

tidy(rate_util_home_fit)
# A tibble: 4 × 5
  term                  estimate std.error statistic   p.value
  <chr>                    <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)              9.93      0.140    70.8   0        
2 credit_util              5.34      0.207    25.7   2.20e-141
3 homeownershipMortgage    0.696     0.121     5.76  8.71e-  9
4 homeownershipOwn         0.128     0.155     0.827 4.08e-  1

\[ \widehat{interest~rate} = 9.93 + 5.34 \times credit~util + 0.696 \times Mortgage - 0.128 \times Own \]

  • Demo: Write the estimated regression equation for loan applications from each of the homeownership groups separately.
    • Rent: \(\widehat{interest~rate} = 9.93 + 5.34 \times credit~util\)
    • Mortgage: \(\widehat{interest~rate} = 10.626 + 5.34 \times credit~util\)
    • Own: \(\widehat{interest~rate} = 10.058 + 5.34 \times credit~util\)
  • Question: How does the model predict the interest rate to vary as credit utilization varies for loan applicants with different homeownership status. Are the rates the same or different?

The same.

Interaction effects model

  • Demo: Fit a model to predict interest rate from credit utilization and homeownership, with an interaction effect between the two predictors. Display the summary output and write out the estimated regression equation.
rate_util_home_int_fit <- linear_reg() |>
  fit(interest_rate ~ credit_util * homeownership, data = loans)

tidy(rate_util_home_int_fit)
# A tibble: 6 × 5
  term                              estimate std.error statistic  p.value
  <chr>                                <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)                          9.44      0.199     47.5  0       
2 credit_util                          6.20      0.325     19.1  1.01e-79
3 homeownershipMortgage                1.39      0.228      6.11 1.04e- 9
4 homeownershipOwn                     0.697     0.316      2.20 2.75e- 2
5 credit_util:homeownershipMortgage   -1.64      0.457     -3.58 3.49e- 4
6 credit_util:homeownershipOwn        -1.06      0.590     -1.80 7.24e- 2

\[ \widehat{interest~rate} = 9.44 + 6.20 \times credit~util + 1.39 \times Mortgage + 0.697 \times Own - 1.64 \times credit_util:Mortgage - 1.06 \times credit_util:Own \]

  • Demo: Write the estimated regression equation for loan applications from each of the homeownership groups separately.
    • Rent: \(\widehat{interest~rate} = 9.44 + 6.20 \times credit~util\)
    • Mortgage: \(\widehat{interest~rate} = 10.83 + 4.56 \times credit~util\)
    • Own: \(\widehat{interest~rate} = 10.137 + 5.14 \times credit~util\)
  • Question: How does the model predict the interest rate to vary as credit utilization varies for loan applicants with different homeownership status. Are the rates the same or different?

Different.

Choosing a model

Rule of thumb: Occam’s Razor - Don’t overcomplicate the situation! We prefer the simplest best model.

glance(rate_util_home_fit)
# A tibble: 1 × 12
  r.squared adj.r.…¹ sigma stati…²   p.value    df  logLik    AIC    BIC devia…³
      <dbl>    <dbl> <dbl>   <dbl>     <dbl> <dbl>   <dbl>  <dbl>  <dbl>   <dbl>
1    0.0682   0.0679  4.83    244. 1.25e-152     3 -29926. 59861. 59897. 232954.
# … with 2 more variables: df.residual <int>, nobs <int>, and abbreviated
#   variable names ¹​adj.r.squared, ²​statistic, ³​deviance
glance(rate_util_home_int_fit)
# A tibble: 1 × 12
  r.squared adj.r.…¹ sigma stati…²   p.value    df  logLik    AIC    BIC devia…³
      <dbl>    <dbl> <dbl>   <dbl>     <dbl> <dbl>   <dbl>  <dbl>  <dbl>   <dbl>
1    0.0694   0.0689  4.83    149. 4.79e-153     5 -29919. 59852. 59903. 232652.
# … with 2 more variables: df.residual <int>, nobs <int>, and abbreviated
#   variable names ¹​adj.r.squared, ²​statistic, ³​deviance
  • Review: What is R-squared? What is adjusted R-squared?

R-squared is the percent variability in the response that is explained by our model. (Can use when models have same number of variables for model selection)

Adjusted R-squared is similar, but has a penalty for the number of variables in the model. (Should use for model selection when models have different numbers of variables).

  • Question: Based on the adjusted \(R^2\)s of these two models, which one do we prefer?

The interaction effects model, though just barely.

Another model to consider

  • Your turn: Let’s add one more model to the variable – issue month. Should we add this variable to the interaction effects model from earlier?
linear_reg() |>
  fit(interest_rate ~ credit_util * homeownership + issue_month, data = loans) |>
  glance()
# A tibble: 1 × 12
  r.squared adj.r.…¹ sigma stati…²   p.value    df  logLik    AIC    BIC devia…³
      <dbl>    <dbl> <dbl>   <dbl>     <dbl> <dbl>   <dbl>  <dbl>  <dbl>   <dbl>
1    0.0694   0.0688  4.83    106. 5.62e-151     7 -29919. 59856. 59921. 232641.
# … with 2 more variables: df.residual <int>, nobs <int>, and abbreviated
#   variable names ¹​adj.r.squared, ²​statistic, ³​deviance

No, the adjusted R-squared goes down.