Code
| default | student | balance | income |
|---|---|---|---|
| No | Yes | 311.32186 | 22648.76 |
| No | Yes | 697.13558 | 18377.15 |
| No | Yes | 470.10718 | 16014.11 |
| No | No | 1200.04162 | 56081.08 |
| No | No | 553.64902 | 47021.49 |
| No | No | 10.23149 | 27237.38 |
Class 2 — classification workflows
2026-05-18
In a classification problem the response is a categorical variable
rather than predicting the value of \(Y\), one wants to estimate the posterior probability
\[P(Y=k\mid X=x_{i})\]
that is, the probability that the observation \(i\) belongs the the class \(k\), given that the predictor value for \(i\) is \(x_{i}\)
| default | student | balance | income |
|---|---|---|---|
| No | Yes | 311.32186 | 22648.76 |
| No | Yes | 697.13558 | 18377.15 |
| No | Yes | 470.10718 | 16014.11 |
| No | No | 1200.04162 | 56081.08 |
| No | No | 553.64902 | 47021.49 |
| No | No | 10.23149 | 27237.38 |
Note: to arrange multiple plots together, give a look at the patchwork package
if \(Y\) is categorical
With \(K\) categories, one could code \(Y\) as an integer vector
if \(Y\) is binary
The goal is estimate \(P(Y=1|X)\), which is, in fact, numeric . . .
\(P(\texttt{default}=\texttt{yes}|\texttt{balance})=\beta_{0}+\beta_{1}\texttt{balance}\)
\(P(\texttt{default}=\texttt{yes}|\texttt{balance})=\frac{e^{\beta_{0}+\beta_{1}\texttt{balance}}}{1+e^{\beta_{0}+\beta_{1}\texttt{balance}}}\)
modeling the posterior \(P(Y=1|X)\) by means of a logistic function is the goal of logistic regression
conditional expectation
just like in linear regression, the fit refers to the conditional expectation of \(Y\) given \(X\); since \(Y\in\{0,1\}\), it results that \[E[Y|X] \equiv P(Y=1|X)\]
\[ \begin{split} p(X)&=\frac{e^{\beta_{0}+\beta_{1}X}}{1+e^{\beta_{0}+\beta_{1}X}}\\ \left(1+e^{\beta_{0}+\beta_{1}X}\right)p(X)&=e^{\beta_{0}+\beta_{1}X}\\ p(X)+e^{\beta_{0}+\beta_{1}X}p(X)&=e^{\beta_{0}+\beta_{1}X}\\ p(X)&=e^{\beta_{0}+\beta_{1}X}+e^{\beta_{0}-\beta_{1}X}p(X)\\ p(X)&=e^{\beta_{0}+\beta_{1}X}\left(1-p(X)\right)\\ \frac{p(X)}{\left(1-p(X)\right)}&=e^{\beta_{0}+\beta_{1}X} \end{split} \]
a toy sample
a toy sample : fit the logistic function
a toy sample : for a new point \(\texttt{balance}=1400\)
a toy sample: one can estimate \(P(\texttt{default=Yes}|\texttt{balance}=1400)\)
a toy sample: one can estimate \(P(\texttt{default=Yes}|\texttt{balance}=1400)=.62\)
How to find the logistic function? estimate its parameters \(P(Y=1|X) = \frac{e^{\beta_{0}+\beta_{1}X}}{1 + e^{\beta_{0}+\beta_{1}X}}\)
pre-process: specify the recipe
put them together in the workflow
Look at the results
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | -10.6513306 | 0.3611574 | -29.49221 | 0 |
| balance | 0.0054989 | 0.0002204 | 24.95309 | 0 |
Suppose you want to use \(\texttt{student}\) as the qualitative predictor for your logistic regression. You can update, within the workflow, the recipe only.
update the recipe in the workflow and re-fit
It appears that if a customer is a student, he is more likely to default ( \(\hat{\beta}_{1} = 0.4\) ).
In case of multiple predictors
\[log\left(\frac{p(X)}{1-p(X)} \right)=\beta_{0}+\beta_{1}X_{1}+\beta_{2}X_{2}+\ldots+\beta_{p}X_{p}\]
and following relation holds
\[p(X)=\frac{e^{{\beta}_{0}+{\beta}_{1}X_{1}+{\beta}_{2}X_{2}+\ldots+{\beta}_{p}X_{p}}}{1+e^{{\beta}_{0}+{\beta}_{1}X_{1}+{\beta}_{2}X_{2}+\ldots+{\beta}_{p}X_{p}}}\]
Let’s consider two predictors \(\texttt{balance}\) and \(\texttt{student}\), again we just update the recipe within the workflow
update the recipe in the workflow and re-fit
look at the results
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | -10.7494959 | 0.3691914 | -29.116326 | 0.0e+00 |
| balance | 0.0057381 | 0.0002318 | 24.749526 | 0.0e+00 |
| studentYes | -0.7148776 | 0.1475190 | -4.846003 | 1.3e-06 |
Suppose we want to estimate:
\[ P(Y = k \mid X = x) \]
that is, the probability that an observation with predictors (x) belongs to class (k).
Model the posterior probability directly:
\[ P(Y = k \mid X = x) \]
Examples:
Model how the data are distributed within each class: \(P(X = x \mid Y = k)\)
and combine this with the class proportions: \(P(Y = k)\)
Examples:
Generative classifiers estimate class probabilities indirectly using Bayes’ rule:
\[ P(Y = k \mid X = x) = \frac{ P(X = x \mid Y = k)P(Y = k) }{ P(X = x) } \]
For classification, the denominator is the same for all classes, so we compare:
\[ P(X = x \mid Y = k)P(Y = k) \]
and assign the observation to the class with the largest value:
\[ \widehat{Y} = \arg\max_k P(X = x \mid Y = k)P(Y = k) \]
A generative classifier combines:
LDA is obtained by making a specific assumption about the class-conditional distributions:
\[ X \mid Y = k \sim N(\mu_k, \Sigma) \]
That is, within each class the predictors are approximately normally distributed, and all classes share the same covariance structure.
LDA is a generative classifier.
It assumes that, within each class, the predictor follows a normal distribution:
\[ X \mid Y = k \sim N(\mu_k, \sigma^2) \]
For each class, LDA estimates: the class mean \(\mu_k\); the common variability \(\sigma^2\) and the class proportion \(\pi_k\).
Then it assigns a new observation to the class with the largest estimated posterior probability:
\[ \widehat{Y} = \arg\max_k \widehat{P}(Y = k \mid X = x) \]
With two classes, equal priors and common variance:
\[ X \mid Y = 1 \sim N(\mu_1, \sigma^2) \]
\[ X \mid Y = 2 \sim N(\mu_2, \sigma^2) \]
the LDA decision boundary is:
\[ x = \frac{\mu_1 + \mu_2}{2} \]
simple case
LDA classifies an observation according to which class mean it is closer to.
In practice, the true means are unknown.
LDA estimates the class means from the training data:
\[ \hat{\mu}_1, \hat{\mu}_2 \]
and places the estimated boundary at:
\[ \frac{\hat{\mu}_1 + \hat{\mu}_2}{2} \]
Approximate Bayes boundary
The estimated boundary may be close to, but not exactly equal to, the theoretical Bayes boundary.
set.seed(1234)
p_1 <- ggplot() +
xlim(-10, 10) +
theme_minimal() +
stat_function(
fun = dnorm,
args = list(mean = 4, sd = 2),
geom = "area",
fill = "dodgerblue",
alpha = .25
) +
stat_function(
fun = dnorm,
args = list(mean = -4, sd = 2),
geom = "area",
fill = "indianred",
alpha = .25
) +
geom_vline(xintercept = 0, size = 2, alpha = .5) +
geom_vline(xintercept = -4, color = "grey", size = 3, alpha = .5) +
geom_vline(xintercept = 4, color = "grey", size = 3, alpha = .5) +
geom_point(
aes(x = -2, y = 0),
inherit.aes = FALSE,
size = 10,
alpha = .5,
color = "darkgreen"
) +
geom_point(
aes(x = 1, y = 0),
inherit.aes = FALSE,
size = 10,
alpha = .5,
color = "magenta"
) +
xlab("x")
p_1The \(\color{darkgreen}{\text{green point}}\) is assigned to class 1; the \(\color{magenta}{\text{pink point}}\) is assigned to class 2.
In practice, the true class means are unknown. LDA estimates them from the training data \(\hat{\mu}_1, \hat{\mu}_2\) - the estimated boundary at \(\frac{\hat{\mu}_1 + \hat{\mu}_2}{2}\)
set.seed(1234)
class_12 <- tibble(
class_1 = rnorm(50, mean = -4, sd = 2),
class_2 = rnorm(50, mean = 4, sd = 2)
) |>
pivot_longer(
names_to = "classes",
values_to = "values",
cols = 1:2
)
mu_12 <- class_12 |>
group_by(classes) |>
summarise(means = mean(values), .groups = "drop")
mu_12_mean <- mean(mu_12$means)
p_2 <- class_12 |>
ggplot(aes(x = values, fill = classes)) +
theme_minimal() +
geom_histogram(aes(y = after_stat(density)), alpha = .5, color = "grey") +
xlim(-10, 10) +
geom_vline(
xintercept = mu_12 |> pull(means),
color = "grey",
size = 3,
alpha = .75
) +
geom_vline(xintercept = mu_12_mean, size = 2, alpha = .75) +
theme(legend.position = "none")
p_2Optimal vs estimanted boundary
The theoretical Bayes boundary is at \(0\).
The estimated boundary is slightly off, at -0.31.
default <- read_csv("./data/Default.csv") |>
mutate(default = as.factor(default))
set.seed(1234)
def_split <- initial_split(default, prop = 3/4, strata = default)
default_train <- training(def_split)
default_test <- testing(def_split)
def_rec <- recipe(default ~ balance, data = default_train)
def_lda_spec <- discrim_linear(
mode = "classification",
engine = "MASS"
)
def_wflow_lda <- workflow() |>
add_recipe(def_rec) |>
add_model(def_lda_spec)
def_fit_lda <- def_wflow_lda |>
fit(data = default_train)just change the model spec
The workflow is the same as before.
Only the model specification changes.
With several predictors, LDA assumes that within each class:
\[ X \mid Y = k \sim N(\mu_k, \Sigma) \]
A classifier can produce predicted probabilities.
To obtain predicted classes, we choose a threshold.
For example, in the default problem:
\[ \widehat{P}(\text{default} = \text{Yes} \mid X) > c \]
where \(c\) is the classification threshold.
lda_pred <- def_fit_lda |>
augment(new_data = default_test) |>
dplyr::select(default, .pred_class, .pred_Yes) |>
mutate(
.pred_class_0_05 = as.factor(ifelse(.pred_Yes > .05, "Yes", "No")),
.pred_class_0_1 = as.factor(ifelse(.pred_Yes > .1, "Yes", "No")),
.pred_class_0_2 = as.factor(ifelse(.pred_Yes > .2, "Yes", "No")),
.pred_class_0_3 = as.factor(ifelse(.pred_Yes > .3, "Yes", "No")),
.pred_class_0_4 = as.factor(ifelse(.pred_Yes > .4, "Yes", "No")),
.pred_class_0_5 = as.factor(ifelse(.pred_Yes > .5, "Yes", "No"))
)Moving the threshold
Different thresholds lead to different types of classification errors.
Moving the threshold
Lowering the threshold usually identifies more defaulters, but may also increase false alarms.
The ROC curve shows how sensitivity and specificity change as the threshold varies.
def_fit_lda |>
augment(new_data = default_test) |>
mutate(default = factor(default, levels = c("Yes", "No"))) |>
roc_curve(truth = default, .pred_Yes) |>
ggplot(aes(x = 1 - specificity, y = sensitivity)) +
ggtitle("LDA ROC curve") +
geom_path(color = "indianred") +
geom_abline(lty = 3) +
coord_equal() +
theme_minimal()QDA is similar to LDA, but it relaxes one assumption.
LDA assumes a common covariance matrix:
\[ X \mid Y = k \sim N(\mu_k, \Sigma) \]
QDA allows each class to have its own covariance matrix:
\[ X \mid Y = k \sim N(\mu_k, \Sigma_k) \]
Note
Because the covariance matrix can change across classes, QDA can produce curved decision boundaries.
def_fit_qda |>
augment(new_data = default_test) |>
dplyr::select(default, .pred_class, .pred_Yes) |>
mutate(default = factor(default, levels = c("Yes", "No"))) |>
roc_curve(truth = default, .pred_Yes) |>
ggplot(aes(x = 1 - specificity, y = sensitivity)) +
ggtitle("QDA ROC curve") +
geom_path(color = "dodgerblue") +
geom_abline(lty = 3) +
coord_equal() +
theme_minimal()Naive Bayes is also a generative classifier.
Like LDA and QDA, it uses Bayes’ rule.
However, it makes a simplifying assumption:
\[ f_k(X) = f_{k1}(X_1) \times f_{k2}(X_2) \times \cdots \times f_{kp}(X_p) \]
That is, predictors are assumed to be independent within each class.
Note
The assumption is often unrealistic, but it makes the model simple and stable, especially with many predictors or small samples.
For each predictor and each class, Naive Bayes estimates a separate distribution.
For continuous predictors, one can use:
\[ f_{kj}(X_j) \sim N(\mu_{kj}, \sigma_{kj}^2) \]
or a kernel density estimator.
For categorical predictors, one can use class-specific relative frequencies.
Note
Naive Bayes combines many simple one-variable models into a full classifier.
def_fit_nb |>
augment(new_data = default_test) |>
dplyr::select(default, .pred_class, .pred_Yes) |>
mutate(default = factor(default, levels = c("Yes", "No"))) |>
roc_curve(truth = default, .pred_Yes) |>
ggplot(aes(x = 1 - specificity, y = sensitivity)) +
ggtitle("Naive Bayes ROC curve") +
geom_path(color = "darkgreen") +
geom_abline(lty = 3) +
coord_equal() +
theme_minimal()| method | .estimate |
|---|---|
| logistic regression | 0.0517932 |
| LDA | 0.0517932 |
| QDA | 0.0517932 |
| naive Bayes | 0.0517837 |
Important
The same workflow can be used to compare different classifiers:
model specification → fitting → predicted probabilities → ROC / AUC.