Essential statistics across disciplines

elements of probability

Alfonso Iodice D’Enza

Deterministic vs. Probabilistic Thinking

Deterministic Thinking assumes outcomes are certain and repeatable:
“If I do this, that will definitely happen.”
Probabilistic Thinking accepts uncertainty:
“If I do this, there’s an 80% chance that will happen.”

Examples in Everyday Life

Scenario	Deterministic Thinking	Probabilistic Thinking
Weather	“It won’t rain today.”	“There’s a 30% chance of rain.”
Medicine	“This treatment will cure me.”	“This treatment works 75% of the time.”
Coin flip	“It will land heads next.”	“50% chance of heads.”
Exams	“If I study, I will pass.”	“If I study, I’m likely to pass.”

Probability Basics

Key Concepts

Experiment: A repeatable process with uncertain outcome
Sample space (Ω): All possible outcomes
Event: A subset of outcomes
Probability: Number between 0 and 1, quantifying likelihood

Probability

The word probability is used to quantify the likelihood that a particular outcome of an experiment will occur.
The term “experiment” is used broadly: any procedure that results in an observation.
Typically, we do not know in advance what the result will be.

Sample Space

The set of all possible outcomes of an experiment is called the sample space, denoted by Ω.
For a coin toss: Ω = {Heads, Tails}.
For a dice roll: Ω = {1, 2, 3, 4, 5, 6}.

Event

An event is any subset of the sample space.
Examples:
- “Getting an even number” in a dice roll: A = {2, 4, 6}.
- “Getting Heads” in a coin toss: B = {Heads}.

Frequency Interpretation

The probability of an event is interpreted as the relative frequency of that event in a long series of repeated experiments.
Example: If we roll a die many times, the relative frequency of getting a 3 converges to 1/6.

Classical Interpretation

If all outcomes are equally likely:

\[ P(A) = \frac{\text{Number of favorable outcomes}}{\text{Total number of outcomes}} \]
Example: Probability of rolling an even number on a die:

\[ P(\{2,4,6\}) = \frac{3}{6} = 0.5 \]

Kolmogorov Axioms

\(0 \leq P(A) \leq 1\) for any event \(A\)
\(P(\Omega) = 1\)
For mutually exclusive events \(A_1, A_2, \ldots\):
\[ P\left(\bigcup_i A_i\right) = \sum_i P(A_i) \]

The sample space and the event A

Code

library(tidyverse)
library("VennDiagram")
a_color <- "indianred"
b_color <- "dodgerblue"



draw_ellipse <- function(center = c(0.5, 0.5), a = 0.25, b = 0.15, n = 100, col = "white") {
  t <- seq(0, 2 * pi, length.out = n)
  x <- center[1] + a * cos(t)
  y <- center[2] + b * sin(t)
  grid.polygon(x = x, y = y, gp = gpar(fill = col, col = NA))
}

library(grid)

grid.newpage()
grid.rect(gp = gpar(fill = "white", col = NA))  # full background = Ac
draw_ellipse(center = c(0.5, 0.5), a = 0.25, b = 0.15, col = a_color)  # elliptical A

grid.text("A", x = 0.5, y = 0.5, gp = gpar(fontsize = 16, col = "white"))

Complement Rule

If \(A\) is an event, then the complement of \(A\), denoted \(A^c\), is the event that \(A\) does not occur.
Rule:
\[ P(A^c) = 1 - P(A) \]

Code

library(grid)
# Define color


# Draw complement Ac as full background with an elliptical A cut out
grid.newpage()
grid.rect(gp = gpar(fill = a_color, col = NA))  # full background = Ac
draw_ellipse(center = c(0.5, 0.5), a = 0.25, b = 0.15, col = "white")  # elliptical A

# Optional: label
grid.text("Ac", x = 0.1, y = 0.9, gp = gpar(fontsize = 16, col = "white"))

Union

Union \(A \cup B\): event that A or B occurs.

Code

grid.newpage()
invisible(draw.pairwise.venn(
  area1 = 10,
  area2 = 10,
  cross.area = 5,
  category = c("A", "B"),
  fill = c(a_color, a_color),
  lty = "blank",
  print.mode = "none"
))

intersection

Intersection \(A \cap B\): event that both A and B occur.

Code

grid.newpage()
invisible(draw.pairwise.venn(
  area1 = 10,
  area2 = 10,
  cross.area = 5,
  category = c("A", "B"),
  fill = c("lightgrey", "lightgrey"),
  alpha = 0.25, col = "darkgrey",
  print.mode = "none"
))

grid.circle(x = 0.5, y = 0.5, r = 0.30, gp = gpar(fill = "indianred", col = NA, alpha = 0.6))

# Optional label
grid.text("A ∩ B", x = 0.5, y = 0.5, gp = gpar(fontsize = 20))

mutually exclusive events

Code

grid.newpage()
invisible(draw.pairwise.venn(
  area1 = 10,
  area2 = 10,
  cross.area = 0,
  category = c("A", "B"),
  fill = c(a_color, b_color),
  lty = "blank",
  print.mode = "none"
))

Probability of Union

For mutually exclusive events:
\[ P(A \cup B) = P(A) + P(B) \]
In general:
\[ P(A \cup B) = P(A) + P(B) - P(A \cap B) \]

Conditional Probability

Probability of \(A\) given that \(B\) occurred:
\[ P(A|B) = \frac{P(A \cap B)}{P(B)} \]
Interpretation: how probable is \(A\) when we know \(B\) has occurred.

Independence

Events \(A\) and \(B\) are independent if:
\[ P(A \cap B) = P(A) \cdot P(B) \]
Intuition: knowledge of \(B\) tells us nothing about \(A\), and vice versa.
note: two independent events ARE NOT mutually exclusive.

Total Probability Theorem

If \(B_1, B_2, \ldots, B_n\) form a partition of the sample space:
\[ P(A) = \sum_{i=1}^3 P(A\cap B_{i}) = \sum_{i=1}^3 P(A|B_i) P(B_i) \] recall that \(P(A|B_i) = \frac{P(A \cap B_i)}{P(B_i)}\) and then \(P(A \cap B_i)=P(A| B_{i})P(B_{i})\)

Code

library(grid)

grid.newpage()

# Draw three adjacent rectangles: B1, B2, B3
grid.rect(x = 1/6, width = 1/3, height = 1, just = "center", gp = gpar(fill = "lightblue", col = "darkgrey"))
grid.rect(x = 0.5, width = 1/3, height = 1, just = "center", gp = gpar(fill = "lightgreen", col = "darkgrey"))
grid.rect(x = 5/6, width = 1/3, height = 1, just = "center", gp = gpar(fill = "lightpink", col = "darkgrey"))

# Label each B_i
grid.text("B1", x = 1/6, y = 0.95, gp = gpar(fontsize = 14))
grid.text("B2", x = 0.5, y = 0.95, gp = gpar(fontsize = 14))
grid.text("B3", x = 5/6, y = 0.95, gp = gpar(fontsize = 14))

# Draw event A as a horizontal ellipse overlapping all three rectangles
draw_ellipse <- function(center, a, b, col = "orchid", alpha = 0.4) {
  t <- seq(0, 2 * pi, length.out = 200)
  x <- center[1] + a * cos(t)
  y <- center[2] + b * sin(t)
  grid.polygon(x = x, y = y, gp = gpar(fill = col, col = NA, alpha = alpha))
}

# A overlaps portions of B1, B2, B3
draw_ellipse(center = c(0.5, 0.5), a = 0.5, b = 0.15)

# Label A
grid.text("A", x = 0.5, y = 0.7, gp = gpar(fontsize = 16, col = "black"))
grid.text("(A ∩ B1)", x = 0.2, y = 0.5, gp = gpar(fontsize = 16, col = "black"))
grid.text("(A ∩ B2)", x = 0.5, y = 0.5, gp = gpar(fontsize = 16, col = "black"))
grid.text("(A ∩ B3)", x = 0.8, y = 0.5, gp = gpar(fontsize = 16, col = "black"))

Bayes’ Theorem

Allows inversion of conditional probabilities:
\[ P(B_j|A) = \frac{P(A|B_j) \cdot P(B_j)}{\sum_{i=1}^n P(A|B_i) \cdot P(B_i)} \]
Widely used in diagnostic testing and decision-making.

Diagnostic Tests – Key Definitions

Sensitivity:
\[ \text{Sens} = P(\text{Test} + | \text{Disease}) \]
Specificity:
\[ \text{Spec} = P(\text{Test} - | \text{No Disease}) \]
False Positive Rate:
\[ 1 - \text{Specificity} \]
False Negative Rate:
\[ 1 - \text{Sensitivity} \]

Predictive Values

Positive Predictive Value (PPV):
\[ P(\text{Disease} | \text{Test} +) \]
Negative Predictive Value (NPV):
\[ P(\text{No Disease} | \text{Test} -) \]
These depend on disease prevalence.

Example: Bayes’ Theorem in Medical Testing

Suppose there is a disease that affects 1% of the population.

A medical test exists that is:

99% accurate if the person has the disease (sensitivity: true positive rate)
95% accurate if the person does not have the disease (specificity: true negative rate)

You take the test and it comes back positive. What is the probability that you actually have the disease?

Example: Bayes’ Theorem in Medical Testing

Let:

\(D\): you have the disease
\(\neg D\): you do not have the disease
\(T\): the test result is positive

Given:

\(P(D) = 0.01\)
\(P(\neg D) = 0.99\)
\(P(T \mid D) = 0.99\) (true positive rate = sensitivity)
\(P(T \mid \neg D) = 0.05\)
- (false positive rate = \(1 - 0.95\))

Applying Bayes’ Theorem:

\[ \begin{align} P(D \mid T) =& \frac{P(T \mid D) \cdot P(D)}{P(T \mid D) \cdot P(D) + P(T \mid \neg D) \cdot P(\neg D)} =\\ &P(D \mid T) = \frac{0.99 \cdot 0.01}{0.99 \cdot 0.01 + 0.05 \cdot 0.99} = \frac{0.0099}{0.0099 + 0.0495} = \frac{0.0099}{0.0594} \approx 0.1667 \end{align} \]

Even after testing positive, there’s only a 16.7% chance that you actually have the disease, due to the low base rate of the disease in the population.

intermission

The Fallibility of the Judge

The Robbery

In 1964, an elderly lady, returning from the supermarket, is pushed to the ground and robbed of her purse.

She manages to notice that the assailant is a blonde woman with a ponytail
A passerby, who witnessed the scene, notices that the robber escapes in a yellow car driven by a man with a beard, mustache, and dark skin

The Suspects

A few days later, Janet and Malcolm Collins are stopped—they match the descriptions.

During the trial, a mathematician calculates the probability that an innocent couple would match each of the following characteristics:

A dark-skinned man with a beard
A man with a mustache
A woman with a ponytail
A woman with blonde hair
A yellow car
An interracial couple traveling in the same car

The Suspects (cont’d)

Estimated probabilities:

Dark-skinned man with a beard: \(\frac{1}{10}\)
Man with a mustache: \(\frac{1}{4}\)
Woman with a ponytail: \(\frac{1}{10}\)
Blonde-haired woman: \(\frac{1}{3}\)
Yellow car: \(\frac{1}{10}\)
Interracial couple traveling together: \(\frac{1}{1000}\)

\[ \frac{1}{10} \times \frac{1}{4} \times \frac{1}{10} \times \frac{1}{3} \times \frac{1}{10} \times \frac{1}{1000} = \frac{1}{12,000,000} \]

Janet and Malcolm are found guilty.

Example

keep in mind that

\[P(A \mid B) \neq P(B \mid A)\]

suppose that

\(A\): the animal is a dog
\(B\): the animal has 4 legs

then

\(P(A \mid B)\): Given that the animal has 4 legs, what’s the probability it’s a dog?
\(P(B \mid A)\): Given that it’s a dog, what’s the probability it has 4 legs?

of course they are not the same!

Back to the Trial

Define:

\(A\): The couple is innocent
\(B\): The couple matches the witnesses’ description

The mathematician calculated:

\(P(B \mid A)\): Probability a couple matches the description if they are innocent

What should have been calculated:

\(P(A \mid B)\): Probability a couple is innocent given they match the description

Facts:

There were 10 couples in the city matching the description
Of those, 9 were innocent

Therefore:

\(P(B \mid A) = \frac{1}{12,000,000}\)
\(P(A \mid B) = \frac{9}{10}\)

True News

Dropout Rates

The dropout rate increased by 100%

The dropout rate increased from 0.001 to 0.002

Both statements are true, but the one on the left has a greater emotional impact.

end of intermission

Random Variables

Often we’re not directly interested in the outcome of a particular experiment, but in the numerical value that the result determines (e.g., when playing dice, what matters is the sum of the dice faces).
The quantities determined based on the outcome of an experiment are called random variables.
Each value of a random variable corresponds to one or more outcomes of an experiment, so each value of a random variable is associated with a probability.
In general, we don’t know what value the random variable will take, but by studying the probability distribution we can understand what to expect.

Probability and Random Variables

Consider the experiment tossing four coins and define the random variable \(X =\) number of heads (H).

The possible outcomes of the experiment are:

\[ \begin{equation} \begin{split} X = \{ & HHHH, HHHT, HHTH, HHTT, \\ & HTHH, HTHT, HTTH, HTTT, \\ & THHH, THHT, THTH, THTT,\\ & TTHH, TTHT, TTTH, TTTT\} \end{split} \end{equation} \]

The corresponding values of \(X\) for these outcomes are:

\[ \begin{equation} \begin{split} X = \{ & 4, 3, 3, 2, \\ & 3, 2, 2, 1, \\ & 3, 2, 2, 1,\\ & 2, 1, 1, 0\} \end{split} \end{equation} \]

The values that \(X\) can assume are \(\{0, 1, 2, 3, 4\}\).

Probability Distribution of a X (also called probability mass function, pmf)

Now that we’ve calculated the probabilities associated with each value of \(X\), we can summarize them in a table:

\(x_i\)	\(p_i\)
0	1/16
1	4/16
2	6/16
3	4/16
4	1/16

(they look like the relative frequencies associated to the values of \(X\), not?)

Plot of the pmf

Code

df <- data.frame(
  x_i = c(0, 1, 2, 3, 4),
  p_i = c(1/16, 4/16, 6/16, 4/16, 1/16)
)

# Create the lollipop plot
ggplot(df, aes(x = factor(x_i), y = p_i)) +
  geom_segment(aes(xend = factor(x_i), y = 0, yend = p_i), color = "dodgerblue", size = 1.5) +
  geom_point(color = "indianred", size = 4) +
  labs(
    title = "Probability Distribution of Number of Heads (X)",
    x = "Number of Heads (xᵢ)",
    y = "Probability (pᵢ)"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, size = 16),
    axis.text = element_text(size = 12),
    axis.title = element_text(size = 14)
  )

Event Probabilities from Distribution

Knowing the probability distribution of \(X\), we can compute probabilities of events derived from the sample space \(\Omega\).

Event A: at least 3 heads:

\[ P(A) = P(X\geq 3)= P(X=3) + P(X=4) = \frac{4}{16} + \frac{1}{16} = \frac{5}{16} \] you add up the probabilities associated with the values of \(X\) that satisfy the event.

Code

df |> mutate(active=ifelse(x_i >= 3,"yes","no")) |>  
  ggplot(aes(x = factor(x_i), y = p_i , colour=active)) +
  geom_segment(aes(xend = factor(x_i), y = 0, yend = p_i, size = 1)) +
  geom_point(size = 4,color="indianred") +
  labs(
    title = "X >=3",
    x = "Number of Heads (xᵢ)",
    y = "Probability (pᵢ)"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, size = 16),
    axis.text = element_text(size = 12),
    axis.title = element_text(size = 14),
    legend.position = "none"
  )+scale_color_manual(values = c("yes" = "lightgreen", "no" = "grey"))

Event Probabilities from Distribution

Event B: less than 3 heads:

\[ P(B) = P(X=0) + P(X=1) + P(X=2) = \frac{1}{16} + \frac{4}{16} + \frac{6}{16} = \frac{11}{16} \]

Code

df |> mutate(active=ifelse(x_i < 3,"yes","no")) |>  
  ggplot(aes(x = factor(x_i), y = p_i , colour=active)) +
  geom_segment(aes(xend = factor(x_i), y = 0, yend = p_i, size = 1)) +
  geom_point(size = 4,color="indianred") +
  labs(
    title = "X < 3",
    x = "Number of Heads (xᵢ)",
    y = "Probability (pᵢ)"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, size = 16),
    axis.text = element_text(size = 12),
    axis.title = element_text(size = 14),
    legend.position = "none"
  )+scale_color_manual(values = c("yes" = "lightgreen", "no" = "grey"))

Types of Random Variables

A number of heads (in four tosses) random variable is discrete: it maps outcomes of an experiment to a finite or countable set of real numbers.
Of course, some random variables may map to a continuous set of real numbers.
- the time it takes for a bus to arrive at a bus stop is a continuous random variable, as it can take any value within a certain range.

Probability distribution:

For discrete random variables, there is the probability mass function

\(x_i\)	\(p_i\)
\(x_1\)	\(p_1\)
\(x_2\)	\(p_2\)
…	…
\(x_i\)	\(p_i\)
…	…

It must satisfy:

\[ p_i \geq 0, \quad \forall i = 1, 2, \ldots \quad \text{and} \quad \sum_{i=1}^{+\infty} p_i = 1 \]

Probability density function (PDF):

For continuous random variables, the probability to observe exactly some value (the bus takes exaclty 63.4 seconds…) is zero

one rather refers to a small interval of values, and the probability density function is used.

The probability of observing a value in the interval \([x_1, x_2]\) is given by the area under the curve of the PDF between \(x_1\) and \(x_2\):

\[ \int_{x_{1}}^{x_{2}} f(x)dx \] where \(f(x)\) is the probability density function (PDF) of the random variable \(X\).

Probability density function (PDF):

consider a generic distribution \(f(x)\), that takes values in between -6 and 6.

what is the probability to observe a value in between -0.5 and 1?

Code

# Define the bimodal density function
bimodal_density <- function(x) {
  0.3 * dnorm(x, mean = -1, sd = 1) +  # First peak
  0.5 * dnorm(x, mean = 2, sd = 0.8)   # Second peak
}

# Generate data for plotting
x_vals <- seq(-6, 6, length.out = 1000)
density_vals <- bimodal_density(x_vals)
df <- tibble(x = x_vals, y = density_vals)
highlight_df <- df |> filter(x>-.5 & x <= 1)

# Plot
ggplot(df, aes(x = x, y = y)) +
  geom_line(color = "steelblue", size = 1.2) +
  geom_area(data = highlight_df, fill = "indianred", alpha = 0.5) +
  labs(
    title = "generic pdf: probability between -0.5 and 1",
    x = "x", y = "Density"
  ) +
  theme_minimal()

Just like the PMF, the PDF must satisfy:

\[ P(x_{1}\leq X \leq x_{2}) = \int_{x_{1}}^{x_{2}} f(x)dx = 1 \]

taking the integral is just like summing up the probabilities of all possible outcomes in the PMF, but now we are integrating over a continuous range of values instead of summing over discrete values.

A super easy probability density function

Suppose that a random variable maps whether an NBA basketball player is on the pitch or not.

The game consists of 4 quarters, each of 12 minutes.
based on last season, here’s the probability distribution for the player to be in play during the game

Code

set.seed(123)
playtime = tibble(
         `[1, 12]` = ifelse(runif(82)>.5,"yes","no"),
         `[12, 24]`= ifelse(runif(82)>.75,"yes","no"),
         `[24, 36]`= ifelse(runif(82)>.9,"yes","no"),
         `[36, 48]`= ifelse(runif(82)>.6,"yes","no")
         ) |> pivot_longer(cols=`[1, 12]`:`[36, 48]`, names_to="quarter", values_to="in_play") |> 
  filter(in_play=="yes") |> mutate(
    playing=case_when(
    quarter=="[1, 12]" ~ 2,
    quarter=="[12, 24]" ~ 14,
    quarter=="[24, 36]" ~ 25,
    quarter=="[36, 48]" ~ 47,
  )
  ) 

density_play=playtime |> count(quarter) |> mutate(prop=n/sum(n),density=prop/12) |> pull(density)


playtime |> ggplot(aes(x=playing)) +
  geom_histogram(aes(y=after_stat(density)),breaks=c(0,12,24,36,48)  , fill="steelblue",alpha=.5,closed="right") +
  # annotate(geom="rect", xmin=10, xmax=12, ymin=0, ymax=density_play[1], alpha=0.5, fill="indianred") +
  # annotate(geom="rect", xmin=12, xmax=15, ymin=0, ymax=density_play[2], alpha=0.5, fill="indianred") +
  labs(title="Histogram of Player's Playtime", x="Playtime (minutes)", y="Density") +
  theme_minimal()

A super easy probability density function

what is the probability that the player is on the pitch in minutes 10 to 15?

Code

playtime |> ggplot(aes(x=playing)) +
  geom_histogram(aes(y=after_stat(density)),breaks=c(0,12,24,36,48)  , fill="steelblue",alpha=.2) +
  annotate(geom="rect", xmin=10, xmax=12, ymin=0, ymax=density_play[1], alpha=0.5, fill="indianred") +
  annotate(geom="rect", xmin=12, xmax=15, ymin=0, ymax=density_play[2], alpha=0.5, fill="indianred") +
  labs(title="Histogram of Player's Playtime", x="Playtime (minutes)", y="Density") +
  theme_minimal()

A super easy probability density function

what is the probability that the player is on the pitch in minutes 20 to 40?

Code

playtime |> ggplot(aes(x=playing)) +
  geom_histogram(aes(y=after_stat(density)),breaks=c(0,12,24,36,48)  , fill="steelblue",alpha=.2) +
  annotate(geom="rect", xmin=20, xmax=24, ymin=0, ymax=density_play[2], alpha=0.5, fill="indianred") +
  annotate(geom="rect", xmin=24, xmax=36, ymin=0, ymax=density_play[3], alpha=0.5, fill="indianred") +
  annotate(geom="rect", xmin=36, xmax=40, ymin=0, ymax=density_play[4], alpha=0.5, fill="indianred") +
  labs(title="Histogram of Player's Playtime", x="Playtime (minutes)", y="Density") +
  theme_minimal()

Expected value

the mean of a random variable \(X\) is its expected value \(E[X]\): that is, the best guess of the value of \(X\)

Going back to the number of heads in 4 coin tosses, the expected value is:

\[ \begin{align*} E[X]=\sum_{i=1}^{n}{x_{i}p(x_{i})}= & 0 \times (1/16) +1 \times (4/16) + \\ & 2 \times (6/16) + 3 \times (4/16) + 4 \times (1/16) = 2 \end{align*} \]

The variance of a random variable is the expected value of the squared deviation of the random variable from its mean.

\[ \begin{align*} Var(X)=\sum_{i=1}^{n}{\left(x_{i}-E[X]\right)^{2}p(x_{i})}= \left[(0-2)^2\right] \times(1/16)+ \left[(1-2)^2\right] \times(4/16)+\\ +\left[(2-2)^2\right] \times(6/16)+\left[(3-2)^2\right] \times(4/16) = \\ +\left[(4-2)^2\right] \times(1/16) = 1 \end{align*} \]

Expected value

not new

for discrete variables, the expected value is computed as the mean of the variable, with the weights being the probabilities of each value, instead of their relative frequencies

E[X] for continuous variables

for continuous variables, the rationale is the same, but integration is used instead of summation