classification

In a classification problem the response is a categorical variable

rather than predicting the value of \(Y\), one wants to estimate the posterior probability

\[P(Y=k\mid X=x_{i})\]

that is, the probability that the observation \(i\) belongs the the class \(k\), given that the predictor value for \(i\) is \(x_{i}\)

will a credit card owner default?

Code
set.seed(1234)
default=read_csv("./data/Default.csv")
default |> slice_sample(n=6) |> kbl() |> kable_styling(font_size=12)
default student balance income
No Yes 311.32186 22648.76
No Yes 697.13558 18377.15
No Yes 470.10718 16014.11
No No 1200.04162 56081.08
No No 553.64902 47021.49
No No 10.23149 27237.38

Do balance and income bring useful info to identify deafulters?

Code
p1 = default |> ggplot(aes(x = balance, y=income,color=default))+
  geom_point(alpha=.5,size=3)+theme_minimal() 
p1

Do balance and income bring useful info to identify deafulters?

Code
p2 = default |> pivot_longer(names_to = "predictor",values_to="value",cols=c(balance,income)) |> 
  ggplot(aes(x = default, y = value))+geom_boxplot(aes(fill=default))+
  facet_wrap(~predictor,scales="free")+theme_minimal()
p2

balance is useful, income is not

Code
p1 | p2

Note: to arrange multiple plots together, give a look at the patchwork package

linear regression for non-numeric response?

if \(Y\) is categorical

With \(K\) categories, one could code \(Y\) as an integer vector

  • for non ordinal factors it would not make sense

 

  • even for ordinal factors, one would implicitly assume constant differences between categories

linear regression for non-numeric response?

if \(Y\) is binary

The goal is estimate \(P(Y=1|X)\), which is, in fact, numeric . . .

  • So one can model \(P(Y=1|X)=\beta_{0}+\beta_{1}X\) . . .

 

  • What can possibly be wrong with this approach?

linear regression for non-numeric response?

\(P(\texttt{default}=\texttt{yes}|\texttt{balance})=\beta_{0}+\beta_{1}\texttt{balance}\)

Code
default |> mutate(def_01 = ifelse(default== "Yes",1,0)) |>  
  ggplot(aes(x=balance, y=def_01)) + geom_point(aes(color = default),size=4, alpha=.5) + 
  geom_smooth(method="lm") + theme_minimal()
\(P(\texttt{default}=\texttt{yes}|\texttt{balance})\) is bounded in \([0, 1]\), therefore the linear fit cannot be used

use the logistic function instead

\(P(\texttt{default}=\texttt{yes}|\texttt{balance})=\frac{e^{\beta_{0}+\beta_{1}\texttt{balance}}}{1+e^{\beta_{0}+\beta_{1}\texttt{balance}}}\)

modeling the posterior \(P(Y=1|X)\) by means of a logistic function is the goal of logistic regression

conditional expectation

just like in linear regression, the fit refers to the conditional expectation of \(Y\) given \(X\); since \(Y\in\{0,1\}\), it results that \[E[Y|X] \equiv P(Y=1|X)\]

The odds and the logit

\[ \begin{split} p(X)&=\frac{e^{\beta_{0}+\beta_{1}X}}{1+e^{\beta_{0}+\beta_{1}X}}\\ \left(1+e^{\beta_{0}+\beta_{1}X}\right)p(X)&=e^{\beta_{0}+\beta_{1}X}\\ p(X)+e^{\beta_{0}+\beta_{1}X}p(X)&=e^{\beta_{0}+\beta_{1}X}\\ p(X)&=e^{\beta_{0}+\beta_{1}X}+e^{\beta_{0}-\beta_{1}X}p(X)\\ p(X)&=e^{\beta_{0}+\beta_{1}X}\left(1-p(X)\right)\\ \frac{p(X)}{\left(1-p(X)\right)}&=e^{\beta_{0}+\beta_{1}X} \end{split} \]

  • \(\frac{p(X)}{\left(1-p(X)\right)}\) are the odds, that take value in \([0,\infty]\)
  • \(log\left(\frac{p(X)}{\left(1-p(X)\right)}\right)=\beta_{0}+\beta_{1}X\) is the logit: there is a linear relation between the logit and the predictor

Estimating the posterior probability via the logistic function

a toy sample

Estimating the posterior probability via the logistic function

a toy sample : fit the logistic function

Estimating the posterior probability via the logistic function

a toy sample : for a new point \(\texttt{balance}=1400\)

Estimating the posterior probability via the logistic function

a toy sample: one can estimate \(P(\texttt{default=Yes}|\texttt{balance}=1400)\)

Estimating the posterior probability via the logistic function

a toy sample: one can estimate \(P(\texttt{default=Yes}|\texttt{balance}=1400)=.62\)

Estimating the posterior probability via the logistic function

How to find the logistic function? estimate its parameters \(P(Y=1|X) = \frac{e^{\beta_{0}+\beta_{1}X}}{1 + e^{\beta_{0}+\beta_{1}X}}\)

Fitting the logistic regression

pre-process: specify the recipe

Code
def_rec = recipe(default~balance, data=default)

specify the model

Code
def_model_spec = logistic_reg(mode="classification", engine="glm")

put them together in the workflow

Code
def_wflow = workflow() |> add_recipe(def_rec) |> add_model(def_model_spec)

fit the model

Code
def_fit = def_wflow |> fit(data=default) 

Fitting the logistic regression

Look at the results

Code
def_fit |>  tidy() |> kbl() |> kable_styling(font_size=12)
term estimate std.error statistic p.value
(Intercept) -10.6513306 0.3611574 -29.49221 0
balance 0.0054989 0.0002204 24.95309 0
Code
def_fit |>  glance() |> kbl() |> kable_styling(font_size=12)
null.deviance df.null logLik AIC BIC deviance df.residual nobs
2920.65 9999 -798.2258 1600.452 1614.872 1596.452 9998 10000

Fitting the logistic regression: a qualitative predictor

Suppose you want to use \(\texttt{student}\) as the qualitative predictor for your logistic regression. You can update, within the workflow, the recipe only.

update the recipe in the workflow and re-fit

Code
def_wflow2 = def_wflow |> update_recipe(recipe(default~student,data=default))
def_fit2 = def_wflow2 |> fit(data=default) 

look at the results

Code
def_fit2 |>  tidy() |> kbl() |> kable_styling(font_size=12)
term estimate std.error statistic p.value
(Intercept) -3.5041278 0.0707130 -49.554219 0.0000000
studentYes 0.4048871 0.1150188 3.520181 0.0004313

Fitting the logistic regression: a qualitative predictor

It appears that if a customer is a student, he is more likely to default ( \(\hat{\beta}_{1} = 0.4\) ).

  • For non students \[ \begin{split} log\left(\frac{p(X)}{1-p(X)}\right) = -3.504 &\rightarrow& \ \frac{p(X)}{1-p(X)} = e^{-3.504}\rightarrow\\ p(X) = e^{-3.504}-e^{-3.504}p(X)&\rightarrow& \ p(X) +e^{-3.504}p(X) = e^{-3.504}\rightarrow\\ p(X)(1 +e^{-3.504}) = e^{-3.504} &\rightarrow & \ p(X) = \frac{e^{-3.504}}{1 +e^{-3.504}}=\color{blue}{0.029} \end{split} \]
  • For students \[ p(X) = \frac{e^{-3.504+0.4}}{1 +e^{-3.504+0.4}}=\color{red}{0.042} \]

Multiple logistic regression

In case of multiple predictors

\[log\left(\frac{p(X)}{1-p(X)} \right)=\beta_{0}+\beta_{1}X_{1}+\beta_{2}X_{2}+\ldots+\beta_{p}X_{p}\]

and following relation holds

\[p(X)=\frac{e^{{\beta}_{0}+{\beta}_{1}X_{1}+{\beta}_{2}X_{2}+\ldots+{\beta}_{p}X_{p}}}{1+e^{{\beta}_{0}+{\beta}_{1}X_{1}+{\beta}_{2}X_{2}+\ldots+{\beta}_{p}X_{p}}}\]

Multiple logistic regression

Let’s consider two predictors \(\texttt{balance}\) and \(\texttt{student}\), again we just update the recipe within the workflow

update the recipe in the workflow and re-fit

Code
def_wflow3 = def_wflow |> update_recipe(recipe(default~balance+student,data=default))
def_fit3 = def_wflow3 |> fit(data=default) 

look at the results

Code
def_fit3 |>  tidy() |> kbl() |> kable_styling(font_size=12)
term estimate std.error statistic p.value
(Intercept) -10.7494959 0.3691914 -29.116326 0.0e+00
balance 0.0057381 0.0002318 24.749526 0.0e+00
studentYes -0.7148776 0.1475190 -4.846003 1.3e-06

 

anything weird?

Multiple logistic regression

Code
p_stu = def_fit2 |> augment(new_data=default) |> 
  ggplot(aes(x=balance, y= .pred_Yes,color=student))+geom_line(size=3)+
  theme_minimal() + ylim(c(0,.1))+xlab("balance")+ylab("P(default=yes|student)")

p_stu 

Multiple logistic regression

Code
p_bal_stu = def_fit3 |> augment(new_data=default) |> ggplot(aes(x=balance, y= .pred_Yes,color=student)) +
  geom_line(size=3) + theme_minimal() + xlab("balance") + ylab("P(default=yes|balance,student)")

p_bal_stu

Multiple logistic regression

Multiple logistic regression

Two ways to estimate class probabilities

Suppose we want to estimate:

\[ P(Y = k \mid X = x) \]

that is, the probability that an observation with predictors (x) belongs to class (k).

Discriminative approach

Model the posterior probability directly:

\[ P(Y = k \mid X = x) \]

Examples:

  • logistic regression
  • multinomial logistic regression

Generative approach

Model how the data are distributed within each class: \(P(X = x \mid Y = k)\)

and combine this with the class proportions: \(P(Y = k)\)

Examples:

  • LDA
  • QDA
  • naive Bayes

Bayes rule for classification

Generative classifiers estimate class probabilities indirectly using Bayes’ rule:

\[ P(Y = k \mid X = x) = \frac{ P(X = x \mid Y = k)P(Y = k) }{ P(X = x) } \]

For classification, the denominator is the same for all classes, so we compare:

\[ P(X = x \mid Y = k)P(Y = k) \]

and assign the observation to the class with the largest value:

\[ \widehat{Y} = \arg\max_k P(X = x \mid Y = k)P(Y = k) \]

A generative classifier combines:

  1. how likely the observation is under each class;
  2. how frequent each class is overall.

From Bayes rule to LDA

LDA is obtained by making a specific assumption about the class-conditional distributions:

\[ X \mid Y = k \sim N(\mu_k, \Sigma) \]

That is, within each class the predictors are approximately normally distributed, and all classes share the same covariance structure.

Linear discriminant analysis: the idea

LDA is a generative classifier.

It assumes that, within each class, the predictor follows a normal distribution:

\[ X \mid Y = k \sim N(\mu_k, \sigma^2) \]

For each class, LDA estimates: the class mean \(\mu_k\); the common variability \(\sigma^2\) and the class proportion \(\pi_k\).

Then it assigns a new observation to the class with the largest estimated posterior probability:

\[ \widehat{Y} = \arg\max_k \widehat{P}(Y = k \mid X = x) \]

LDA with one predictor

With two classes, equal priors and common variance:

\[ X \mid Y = 1 \sim N(\mu_1, \sigma^2) \]

\[ X \mid Y = 2 \sim N(\mu_2, \sigma^2) \]

the LDA decision boundary is:

\[ x = \frac{\mu_1 + \mu_2}{2} \]

simple case

LDA classifies an observation according to which class mean it is closer to.

The boundary is estimated from data

In practice, the true means are unknown.

LDA estimates the class means from the training data:

\[ \hat{\mu}_1, \hat{\mu}_2 \]

and places the estimated boundary at:

\[ \frac{\hat{\mu}_1 + \hat{\mu}_2}{2} \]

Approximate Bayes boundary

The estimated boundary may be close to, but not exactly equal to, the theoretical Bayes boundary.

LDA with one predictor

Code
set.seed(1234)

p_1 <- ggplot() +
  xlim(-10, 10) +
  theme_minimal() +
  stat_function(
    fun = dnorm,
    args = list(mean = 4, sd = 2),
    geom = "area",
    fill = "dodgerblue",
    alpha = .25
  ) +
  stat_function(
    fun = dnorm,
    args = list(mean = -4, sd = 2),
    geom = "area",
    fill = "indianred",
    alpha = .25
  ) +
  geom_vline(xintercept = 0, size = 2, alpha = .5) +
  geom_vline(xintercept = -4, color = "grey", size = 3, alpha = .5) +
  geom_vline(xintercept = 4, color = "grey", size = 3, alpha = .5) +
  geom_point(
    aes(x = -2, y = 0),
    inherit.aes = FALSE,
    size = 10,
    alpha = .5,
    color = "darkgreen"
  ) +
  geom_point(
    aes(x = 1, y = 0),
    inherit.aes = FALSE,
    size = 10,
    alpha = .5,
    color = "magenta"
  ) +
  xlab("x")

p_1

The \(\color{darkgreen}{\text{green point}}\) is assigned to class 1; the \(\color{magenta}{\text{pink point}}\) is assigned to class 2.

The boundary is estimated from data

In practice, the true class means are unknown. LDA estimates them from the training data \(\hat{\mu}_1, \hat{\mu}_2\) - the estimated boundary at \(\frac{\hat{\mu}_1 + \hat{\mu}_2}{2}\)

Code
set.seed(1234)

class_12 <- tibble(
  class_1 = rnorm(50, mean = -4, sd = 2),
  class_2 = rnorm(50, mean = 4, sd = 2)
) |>
  pivot_longer(
    names_to = "classes",
    values_to = "values",
    cols = 1:2
  )

mu_12 <- class_12 |>
  group_by(classes) |>
  summarise(means = mean(values), .groups = "drop")

mu_12_mean <- mean(mu_12$means)

p_2 <- class_12 |>
  ggplot(aes(x = values, fill = classes)) +
  theme_minimal() +
  geom_histogram(aes(y = after_stat(density)), alpha = .5, color = "grey") +
  xlim(-10, 10) +
  geom_vline(
    xintercept = mu_12 |> pull(means),
    color = "grey",
    size = 3,
    alpha = .75
  ) +
  geom_vline(xintercept = mu_12_mean, size = 2, alpha = .75) +
  theme(legend.position = "none")

p_2

Optimal vs estimanted boundary

The theoretical Bayes boundary is at \(0\).
The estimated boundary is slightly off, at -0.31.

Fitting LDA

Code
default <- read_csv("./data/Default.csv") |>
  mutate(default = as.factor(default))

set.seed(1234)

def_split <- initial_split(default, prop = 3/4, strata = default)

default_train <- training(def_split)
default_test  <- testing(def_split)

def_rec <- recipe(default ~ balance, data = default_train)

def_lda_spec <- discrim_linear(
  mode = "classification",
  engine = "MASS"
)

def_wflow_lda <- workflow() |>
  add_recipe(def_rec) |>
  add_model(def_lda_spec)

def_fit_lda <- def_wflow_lda |>
  fit(data = default_train)

just change the model spec

The workflow is the same as before.
Only the model specification changes.

LDA with more than one predictor

With several predictors, LDA assumes that within each class:

\[ X \mid Y = k \sim N(\mu_k, \Sigma) \]

Class-specific errors

A classifier can produce predicted probabilities.

To obtain predicted classes, we choose a threshold.

For example, in the default problem:

\[ \widehat{P}(\text{default} = \text{Yes} \mid X) > c \]

where \(c\) is the classification threshold.

Code
lda_pred <- def_fit_lda |>
  augment(new_data = default_test) |>
  dplyr::select(default, .pred_class, .pred_Yes) |>
  mutate(
    .pred_class_0_05 = as.factor(ifelse(.pred_Yes > .05, "Yes", "No")),
    .pred_class_0_1  = as.factor(ifelse(.pred_Yes > .1,  "Yes", "No")),
    .pred_class_0_2  = as.factor(ifelse(.pred_Yes > .2,  "Yes", "No")),
    .pred_class_0_3  = as.factor(ifelse(.pred_Yes > .3,  "Yes", "No")),
    .pred_class_0_4  = as.factor(ifelse(.pred_Yes > .4,  "Yes", "No")),
    .pred_class_0_5  = as.factor(ifelse(.pred_Yes > .5,  "Yes", "No"))
  )

Moving the threshold

Different thresholds lead to different types of classification errors.

Class-specific errors

Moving the threshold

Lowering the threshold usually identifies more defaulters, but may also increase false alarms.

ROC curve

The ROC curve shows how sensitivity and specificity change as the threshold varies.

Code
def_fit_lda |>
  augment(new_data = default_test) |>
  mutate(default = factor(default, levels = c("Yes", "No"))) |>
  roc_curve(truth = default, .pred_Yes) |>
  ggplot(aes(x = 1 - specificity, y = sensitivity)) +
  ggtitle("LDA ROC curve") +
  geom_path(color = "indianred") +
  geom_abline(lty = 3) +
  coord_equal() +
  theme_minimal()

Quadratic discriminant analysis

QDA: the idea

QDA is similar to LDA, but it relaxes one assumption.

LDA assumes a common covariance matrix:

\[ X \mid Y = k \sim N(\mu_k, \Sigma) \]

QDA allows each class to have its own covariance matrix:

\[ X \mid Y = k \sim N(\mu_k, \Sigma_k) \]

Note

Because the covariance matrix can change across classes, QDA can produce curved decision boundaries.

Fitting QDA

Code
def_rec <- recipe(default ~ balance, data = default_train)

def_qda_spec <- discrim_quad(
  mode = "classification",
  engine = "MASS"
)

def_wflow_qda <- workflow() |>
  add_recipe(def_rec) |>
  add_model(def_qda_spec)

def_fit_qda <- def_wflow_qda |>
  fit(data = default_train)
Code
def_fit_qda |>
  augment(new_data = default_test) |>
  dplyr::select(default, .pred_class, .pred_Yes) |>
  mutate(default = factor(default, levels = c("Yes", "No"))) |>
  roc_curve(truth = default, .pred_Yes) |>
  ggplot(aes(x = 1 - specificity, y = sensitivity)) +
  ggtitle("QDA ROC curve") +
  geom_path(color = "dodgerblue") +
  geom_abline(lty = 3) +
  coord_equal() +
  theme_minimal()

Naive Bayes

Naive Bayes: the idea

Naive Bayes is also a generative classifier.

Like LDA and QDA, it uses Bayes’ rule.
However, it makes a simplifying assumption:

\[ f_k(X) = f_{k1}(X_1) \times f_{k2}(X_2) \times \cdots \times f_{kp}(X_p) \]

That is, predictors are assumed to be independent within each class.

Note

The assumption is often unrealistic, but it makes the model simple and stable, especially with many predictors or small samples.

Naive Bayes with different predictors

For each predictor and each class, Naive Bayes estimates a separate distribution.

For continuous predictors, one can use:

\[ f_{kj}(X_j) \sim N(\mu_{kj}, \sigma_{kj}^2) \]

or a kernel density estimator.

For categorical predictors, one can use class-specific relative frequencies.

Note

Naive Bayes combines many simple one-variable models into a full classifier.

Fitting Naive Bayes

Code
def_rec <- recipe(default ~ balance, data = default_train)

def_nb_spec <- naive_Bayes(
  mode = "classification",
  engine = "naivebayes"
)

def_wflow_nb <- workflow() |>
  add_recipe(def_rec) |>
  add_model(def_nb_spec)

def_fit_nb <- def_wflow_nb |>
  fit(data = default_train)
Code
def_fit_nb |>
  augment(new_data = default_test) |>
  dplyr::select(default, .pred_class, .pred_Yes) |>
  mutate(default = factor(default, levels = c("Yes", "No"))) |>
  roc_curve(truth = default, .pred_Yes) |>
  ggplot(aes(x = 1 - specificity, y = sensitivity)) +
  ggtitle("Naive Bayes ROC curve") +
  geom_path(color = "darkgreen") +
  geom_abline(lty = 3) +
  coord_equal() +
  theme_minimal()

Comparing classifiers

Comparing classifiers

method .estimate
logistic regression 0.0517932
LDA 0.0517932
QDA 0.0517932
naive Bayes 0.0517837

Important

The same workflow can be used to compare different classifiers:

model specification → fitting → predicted probabilities → ROC / AUC.