Essential Statistical Data Analysis

Class 2 — classification workflows

Alfonso Iodice D’Enza

2026-05-18

classification

In a classification problem the response is a categorical variable

rather than predicting the value of \(Y\), one wants to estimate the posterior probability

\[P(Y=k\mid X=x_{i})\]

that is, the probability that the observation \(i\) belongs the the class \(k\), given that the predictor value for \(i\) is \(x_{i}\)

will a credit card owner default?

Code

set.seed(1234)
default=read_csv("./data/Default.csv")
default |> slice_sample(n=6) |> kbl() |> kable_styling(font_size=12)

default	student	balance	income
No	Yes	311.32186	22648.76
No	Yes	697.13558	18377.15
No	Yes	470.10718	16014.11
No	No	1200.04162	56081.08
No	No	553.64902	47021.49
No	No	10.23149	27237.38

Do balance and income bring useful info to identify deafulters?

Code

p1 = default |> ggplot(aes(x = balance, y=income,color=default))+
  geom_point(alpha=.5,size=3)+theme_minimal() 
p1

Do balance and income bring useful info to identify deafulters?

Code

p2 = default |> pivot_longer(names_to = "predictor",values_to="value",cols=c(balance,income)) |> 
  ggplot(aes(x = default, y = value))+geom_boxplot(aes(fill=default))+
  facet_wrap(~predictor,scales="free")+theme_minimal()
p2

balance is useful, income is not

Code

p1 | p2

Note: to arrange multiple plots together, give a look at the patchwork package

linear regression for non-numeric response?

if \(Y\) is categorical

With \(K\) categories, one could code \(Y\) as an integer vector

for non ordinal factors it would not make sense

even for ordinal factors, one would implicitly assume constant differences between categories

linear regression for non-numeric response?

if \(Y\) is binary

The goal is estimate \(P(Y=1|X)\), which is, in fact, numeric . . .

So one can model \(P(Y=1|X)=\beta_{0}+\beta_{1}X\) . . .

What can possibly be wrong with this approach?

linear regression for non-numeric response?

\(P(\texttt{default}=\texttt{yes}|\texttt{balance})=\beta_{0}+\beta_{1}\texttt{balance}\)

Code

default |> mutate(def_01 = ifelse(default== "Yes",1,0)) |>  
  ggplot(aes(x=balance, y=def_01)) + geom_point(aes(color = default),size=4, alpha=.5) + 
  geom_smooth(method="lm") + theme_minimal()

\(P(\texttt{default}=\texttt{yes}|\texttt{balance})\) is bounded in \([0, 1]\), therefore the linear fit cannot be used

use the logistic function instead

\(P(\texttt{default}=\texttt{yes}|\texttt{balance})=\frac{e^{\beta_{0}+\beta_{1}\texttt{balance}}}{1+e^{\beta_{0}+\beta_{1}\texttt{balance}}}\)

modeling the posterior \(P(Y=1|X)\) by means of a logistic function is the goal of logistic regression

conditional expectation

just like in linear regression, the fit refers to the conditional expectation of \(Y\) given \(X\); since \(Y\in\{0,1\}\), it results that \[E[Y|X] \equiv P(Y=1|X)\]

The odds and the logit

\[ \begin{split} p(X)&=\frac{e^{\beta_{0}+\beta_{1}X}}{1+e^{\beta_{0}+\beta_{1}X}}\\ \left(1+e^{\beta_{0}+\beta_{1}X}\right)p(X)&=e^{\beta_{0}+\beta_{1}X}\\ p(X)+e^{\beta_{0}+\beta_{1}X}p(X)&=e^{\beta_{0}+\beta_{1}X}\\ p(X)&=e^{\beta_{0}+\beta_{1}X}+e^{\beta_{0}-\beta_{1}X}p(X)\\ p(X)&=e^{\beta_{0}+\beta_{1}X}\left(1-p(X)\right)\\ \frac{p(X)}{\left(1-p(X)\right)}&=e^{\beta_{0}+\beta_{1}X} \end{split} \]

\(\frac{p(X)}{\left(1-p(X)\right)}\) are the odds, that take value in \([0,\infty]\)

\(log\left(\frac{p(X)}{\left(1-p(X)\right)}\right)=\beta_{0}+\beta_{1}X\) is the logit: there is a linear relation between the logit and the predictor

Estimating the posterior probability via the logistic function

a toy sample

Estimating the posterior probability via the logistic function

a toy sample : fit the logistic function

Estimating the posterior probability via the logistic function

a toy sample : for a new point \(\texttt{balance}=1400\)

Estimating the posterior probability via the logistic function

a toy sample: one can estimate \(P(\texttt{default=Yes}|\texttt{balance}=1400)\)

Estimating the posterior probability via the logistic function

a toy sample: one can estimate \(P(\texttt{default=Yes}|\texttt{balance}=1400)=.62\)

Estimating the posterior probability via the logistic function

How to find the logistic function? estimate its parameters \(P(Y=1|X) = \frac{e^{\beta_{0}+\beta_{1}X}}{1 + e^{\beta_{0}+\beta_{1}X}}\)

Fitting the logistic regression

pre-process: specify the recipe

Code

def_rec = recipe(default~balance, data=default)

specify the model

Code

def_model_spec = logistic_reg(mode="classification", engine="glm")

put them together in the workflow

Code

def_wflow = workflow() |> add_recipe(def_rec) |> add_model(def_model_spec)

fit the model

Code

def_fit = def_wflow |> fit(data=default)

Fitting the logistic regression

Look at the results

Code

def_fit |>  tidy() |> kbl() |> kable_styling(font_size=12)

term	estimate	std.error	statistic	p.value
(Intercept)	-10.6513306	0.3611574	-29.49221	0
balance	0.0054989	0.0002204	24.95309	0

Code

def_fit |>  glance() |> kbl() |> kable_styling(font_size=12)

null.deviance	df.null	logLik	AIC	BIC	deviance	df.residual	nobs
2920.65	9999	-798.2258	1600.452	1614.872	1596.452	9998	10000

Fitting the logistic regression: a qualitative predictor

Suppose you want to use \(\texttt{student}\) as the qualitative predictor for your logistic regression. You can update, within the workflow, the recipe only.

update the recipe in the workflow and re-fit

Code

def_wflow2 = def_wflow |> update_recipe(recipe(default~student,data=default))
def_fit2 = def_wflow2 |> fit(data=default)

look at the results

Code

def_fit2 |>  tidy() |> kbl() |> kable_styling(font_size=12)

term	estimate	std.error	statistic	p.value
(Intercept)	-3.5041278	0.0707130	-49.554219	0.0000000
studentYes	0.4048871	0.1150188	3.520181	0.0004313

Fitting the logistic regression: a qualitative predictor

It appears that if a customer is a student, he is more likely to default ( \(\hat{\beta}_{1} = 0.4\) ).

For non students \[ \begin{split} log\left(\frac{p(X)}{1-p(X)}\right) = -3.504 &\rightarrow& \ \frac{p(X)}{1-p(X)} = e^{-3.504}\rightarrow\\ p(X) = e^{-3.504}-e^{-3.504}p(X)&\rightarrow& \ p(X) +e^{-3.504}p(X) = e^{-3.504}\rightarrow\\ p(X)(1 +e^{-3.504}) = e^{-3.504} &\rightarrow & \ p(X) = \frac{e^{-3.504}}{1 +e^{-3.504}}=\color{blue}{0.029} \end{split} \]

For students \[ p(X) = \frac{e^{-3.504+0.4}}{1 +e^{-3.504+0.4}}=\color{red}{0.042} \]

Multiple logistic regression

In case of multiple predictors

\[log\left(\frac{p(X)}{1-p(X)} \right)=\beta_{0}+\beta_{1}X_{1}+\beta_{2}X_{2}+\ldots+\beta_{p}X_{p}\]

and following relation holds

\[p(X)=\frac{e^{{\beta}_{0}+{\beta}_{1}X_{1}+{\beta}_{2}X_{2}+\ldots+{\beta}_{p}X_{p}}}{1+e^{{\beta}_{0}+{\beta}_{1}X_{1}+{\beta}_{2}X_{2}+\ldots+{\beta}_{p}X_{p}}}\]

Multiple logistic regression

Let’s consider two predictors \(\texttt{balance}\) and \(\texttt{student}\), again we just update the recipe within the workflow

update the recipe in the workflow and re-fit

Code

def_wflow3 = def_wflow |> update_recipe(recipe(default~balance+student,data=default))
def_fit3 = def_wflow3 |> fit(data=default)

look at the results

Code

def_fit3 |>  tidy() |> kbl() |> kable_styling(font_size=12)

term	estimate	std.error	statistic	p.value
(Intercept)	-10.7494959	0.3691914	-29.116326	0.0e+00
balance	0.0057381	0.0002318	24.749526	0.0e+00
studentYes	-0.7148776	0.1475190	-4.846003	1.3e-06

anything weird?

Multiple logistic regression

Code

p_stu = def_fit2 |> augment(new_data=default) |> 
  ggplot(aes(x=balance, y= .pred_Yes,color=student))+geom_line(size=3)+
  theme_minimal() + ylim(c(0,.1))+xlab("balance")+ylab("P(default=yes|student)")

p_stu

Multiple logistic regression

Code

p_bal_stu = def_fit3 |> augment(new_data=default) |> ggplot(aes(x=balance, y= .pred_Yes,color=student)) +
  geom_line(size=3) + theme_minimal() + xlab("balance") + ylab("P(default=yes|balance,student)")

p_bal_stu

Multiple logistic regression

Two ways to estimate class probabilities

Suppose we want to estimate:

\[ P(Y = k \mid X = x) \]

that is, the probability that an observation with predictors (x) belongs to class (k).

Discriminative approach

Model the posterior probability directly:

\[ P(Y = k \mid X = x) \]

Examples:

logistic regression
multinomial logistic regression

Generative approach

Model how the data are distributed within each class: \(P(X = x \mid Y = k)\)

and combine this with the class proportions: \(P(Y = k)\)

Examples:

LDA
QDA
naive Bayes

Bayes rule for classification

Generative classifiers estimate class probabilities indirectly using Bayes’ rule:

\[ P(Y = k \mid X = x) = \frac{ P(X = x \mid Y = k)P(Y = k) }{ P(X = x) } \]

For classification, the denominator is the same for all classes, so we compare:

\[ P(X = x \mid Y = k)P(Y = k) \]

and assign the observation to the class with the largest value:

\[ \widehat{Y} = \arg\max_k P(X = x \mid Y = k)P(Y = k) \]

A generative classifier combines:

how likely the observation is under each class;
how frequent each class is overall.

From Bayes rule to LDA

LDA is obtained by making a specific assumption about the class-conditional distributions:

\[ X \mid Y = k \sim N(\mu_k, \Sigma) \]

That is, within each class the predictors are approximately normally distributed, and all classes share the same covariance structure.

Linear discriminant analysis: the idea

LDA is a generative classifier.

It assumes that, within each class, the predictor follows a normal distribution:

\[ X \mid Y = k \sim N(\mu_k, \sigma^2) \]

For each class, LDA estimates: the class mean \(\mu_k\); the common variability \(\sigma^2\) and the class proportion \(\pi_k\).

Then it assigns a new observation to the class with the largest estimated posterior probability:

\[ \widehat{Y} = \arg\max_k \widehat{P}(Y = k \mid X = x) \]

LDA with one predictor

With two classes, equal priors and common variance:

\[ X \mid Y = 1 \sim N(\mu_1, \sigma^2) \]

\[ X \mid Y = 2 \sim N(\mu_2, \sigma^2) \]

the LDA decision boundary is:

\[ x = \frac{\mu_1 + \mu_2}{2} \]

simple case

LDA classifies an observation according to which class mean it is closer to.

The boundary is estimated from data

In practice, the true means are unknown.

LDA estimates the class means from the training data:

\[ \hat{\mu}_1, \hat{\mu}_2 \]

and places the estimated boundary at:

\[ \frac{\hat{\mu}_1 + \hat{\mu}_2}{2} \]

Approximate Bayes boundary

The estimated boundary may be close to, but not exactly equal to, the theoretical Bayes boundary.

LDA with one predictor

Code

set.seed(1234)

p_1 <- ggplot() +
  xlim(-10, 10) +
  theme_minimal() +
  stat_function(
    fun = dnorm,
    args = list(mean = 4, sd = 2),
    geom = "area",
    fill = "dodgerblue",
    alpha = .25
  ) +
  stat_function(
    fun = dnorm,
    args = list(mean = -4, sd = 2),
    geom = "area",
    fill = "indianred",
    alpha = .25
  ) +
  geom_vline(xintercept = 0, size = 2, alpha = .5) +
  geom_vline(xintercept = -4, color = "grey", size = 3, alpha = .5) +
  geom_vline(xintercept = 4, color = "grey", size = 3, alpha = .5) +
  geom_point(
    aes(x = -2, y = 0),
    inherit.aes = FALSE,
    size = 10,
    alpha = .5,
    color = "darkgreen"
  ) +
  geom_point(
    aes(x = 1, y = 0),
    inherit.aes = FALSE,
    size = 10,
    alpha = .5,
    color = "magenta"
  ) +
  xlab("x")

p_1

The \(\color{darkgreen}{\text{green point}}\) is assigned to class 1; the \(\color{magenta}{\text{pink point}}\) is assigned to class 2.

The boundary is estimated from data

In practice, the true class means are unknown. LDA estimates them from the training data \(\hat{\mu}_1, \hat{\mu}_2\) - the estimated boundary at \(\frac{\hat{\mu}_1 + \hat{\mu}_2}{2}\)

Code

set.seed(1234)

class_12 <- tibble(
  class_1 = rnorm(50, mean = -4, sd = 2),
  class_2 = rnorm(50, mean = 4, sd = 2)
) |>
  pivot_longer(
    names_to = "classes",
    values_to = "values",
    cols = 1:2
  )

mu_12 <- class_12 |>
  group_by(classes) |>
  summarise(means = mean(values), .groups = "drop")

mu_12_mean <- mean(mu_12$means)

p_2 <- class_12 |>
  ggplot(aes(x = values, fill = classes)) +
  theme_minimal() +
  geom_histogram(aes(y = after_stat(density)), alpha = .5, color = "grey") +
  xlim(-10, 10) +
  geom_vline(
    xintercept = mu_12 |> pull(means),
    color = "grey",
    size = 3,
    alpha = .75
  ) +
  geom_vline(xintercept = mu_12_mean, size = 2, alpha = .75) +
  theme(legend.position = "none")

p_2

Optimal vs estimanted boundary

The theoretical Bayes boundary is at \(0\).
The estimated boundary is slightly off, at -0.31.

Fitting LDA

Code

default <- read_csv("./data/Default.csv") |>
  mutate(default = as.factor(default))

set.seed(1234)

def_split <- initial_split(default, prop = 3/4, strata = default)

default_train <- training(def_split)
default_test  <- testing(def_split)

def_rec <- recipe(default ~ balance, data = default_train)

def_lda_spec <- discrim_linear(
  mode = "classification",
  engine = "MASS"
)

def_wflow_lda <- workflow() |>
  add_recipe(def_rec) |>
  add_model(def_lda_spec)

def_fit_lda <- def_wflow_lda |>
  fit(data = default_train)

just change the model spec

The workflow is the same as before.
Only the model specification changes.

LDA with more than one predictor

With several predictors, LDA assumes that within each class:

\[ X \mid Y = k \sim N(\mu_k, \Sigma) \]

Class-specific errors

A classifier can produce predicted probabilities.

To obtain predicted classes, we choose a threshold.

For example, in the default problem:

\[ \widehat{P}(\text{default} = \text{Yes} \mid X) > c \]

where \(c\) is the classification threshold.

Code

lda_pred <- def_fit_lda |>
  augment(new_data = default_test) |>
  dplyr::select(default, .pred_class, .pred_Yes) |>
  mutate(
    .pred_class_0_05 = as.factor(ifelse(.pred_Yes > .05, "Yes", "No")),
    .pred_class_0_1  = as.factor(ifelse(.pred_Yes > .1,  "Yes", "No")),
    .pred_class_0_2  = as.factor(ifelse(.pred_Yes > .2,  "Yes", "No")),
    .pred_class_0_3  = as.factor(ifelse(.pred_Yes > .3,  "Yes", "No")),
    .pred_class_0_4  = as.factor(ifelse(.pred_Yes > .4,  "Yes", "No")),
    .pred_class_0_5  = as.factor(ifelse(.pred_Yes > .5,  "Yes", "No"))
  )

Moving the threshold

Different thresholds lead to different types of classification errors.

Class-specific errors

Moving the threshold

Lowering the threshold usually identifies more defaulters, but may also increase false alarms.

ROC curve

The ROC curve shows how sensitivity and specificity change as the threshold varies.

Code

def_fit_lda |>
  augment(new_data = default_test) |>
  mutate(default = factor(default, levels = c("Yes", "No"))) |>
  roc_curve(truth = default, .pred_Yes) |>
  ggplot(aes(x = 1 - specificity, y = sensitivity)) +
  ggtitle("LDA ROC curve") +
  geom_path(color = "indianred") +
  geom_abline(lty = 3) +
  coord_equal() +
  theme_minimal()

Quadratic discriminant analysis

QDA: the idea

QDA is similar to LDA, but it relaxes one assumption.

LDA assumes a common covariance matrix:

\[ X \mid Y = k \sim N(\mu_k, \Sigma) \]

QDA allows each class to have its own covariance matrix:

\[ X \mid Y = k \sim N(\mu_k, \Sigma_k) \]

Note

Because the covariance matrix can change across classes, QDA can produce curved decision boundaries.

Fitting QDA

Code

def_rec <- recipe(default ~ balance, data = default_train)

def_qda_spec <- discrim_quad(
  mode = "classification",
  engine = "MASS"
)

def_wflow_qda <- workflow() |>
  add_recipe(def_rec) |>
  add_model(def_qda_spec)

def_fit_qda <- def_wflow_qda |>
  fit(data = default_train)

Code

def_fit_qda |>
  augment(new_data = default_test) |>
  dplyr::select(default, .pred_class, .pred_Yes) |>
  mutate(default = factor(default, levels = c("Yes", "No"))) |>
  roc_curve(truth = default, .pred_Yes) |>
  ggplot(aes(x = 1 - specificity, y = sensitivity)) +
  ggtitle("QDA ROC curve") +
  geom_path(color = "dodgerblue") +
  geom_abline(lty = 3) +
  coord_equal() +
  theme_minimal()

Naive Bayes

Naive Bayes: the idea

Naive Bayes is also a generative classifier.

Like LDA and QDA, it uses Bayes’ rule.
However, it makes a simplifying assumption:

\[ f_k(X) = f_{k1}(X_1) \times f_{k2}(X_2) \times \cdots \times f_{kp}(X_p) \]

That is, predictors are assumed to be independent within each class.

Note

The assumption is often unrealistic, but it makes the model simple and stable, especially with many predictors or small samples.

Naive Bayes with different predictors

For each predictor and each class, Naive Bayes estimates a separate distribution.

For continuous predictors, one can use:

\[ f_{kj}(X_j) \sim N(\mu_{kj}, \sigma_{kj}^2) \]

or a kernel density estimator.

For categorical predictors, one can use class-specific relative frequencies.

Note

Naive Bayes combines many simple one-variable models into a full classifier.

Fitting Naive Bayes

Code

def_rec <- recipe(default ~ balance, data = default_train)

def_nb_spec <- naive_Bayes(
  mode = "classification",
  engine = "naivebayes"
)

def_wflow_nb <- workflow() |>
  add_recipe(def_rec) |>
  add_model(def_nb_spec)

def_fit_nb <- def_wflow_nb |>
  fit(data = default_train)

Code

def_fit_nb |>
  augment(new_data = default_test) |>
  dplyr::select(default, .pred_class, .pred_Yes) |>
  mutate(default = factor(default, levels = c("Yes", "No"))) |>
  roc_curve(truth = default, .pred_Yes) |>
  ggplot(aes(x = 1 - specificity, y = sensitivity)) +
  ggtitle("Naive Bayes ROC curve") +
  geom_path(color = "darkgreen") +
  geom_abline(lty = 3) +
  coord_equal() +
  theme_minimal()

Comparing classifiers

method	.estimate
logistic regression	0.0517932
LDA	0.0517932
QDA	0.0517932
naive Bayes	0.0517837

Important

The same workflow can be used to compare different classifiers:

model specification → fitting → predicted probabilities → ROC / AUC.