Essential statistics across disciplines

elements of inference

Alfonso Iodice D’Enza

From Probability to Inference

The Bridge

Probability: Known model → predict likelihood of outcomes
Inference: Observed data → draw conclusions about model

“Is what we observed surprising if the coin is fair?”

How confident are you in a poll predicting an election?

Polls report point estimates, e.g., “Candidate A: 52%, Candidate B: 48%”, but always with uncertainty
That 4-point lead might seem convincing — but with a margin of error (say ±3%), the true difference might be much smaller
There’s a real probability that the candidate who appears behind is actually ahead

Code

set.seed(42)
library(tidyverse)
sim_poll <- rnorm(10000, mean = 0.04, sd = 0.03)
ggplot(data.frame(lead = sim_poll), aes(x = lead)) +
  geom_histogram(bins = 60, fill = "#3182bd", color="grey") +
  geom_vline(xintercept = 0, color = "red", linetype = "dashed") +
  labs(title = "Simulated Election Poll Lead (4% ± 3%)",
       x = "Lead for Candidate A", y = "Frequency") +
  theme_minimal()

Inference Essentials

Population and Sample

The population is a large set of objects with measurable quantities.
A sample is a small subset of the population.
The goal of the statistical approach is to analyze the sample to draw conclusions about the population.

Random Sample

To make inferences from the sample to the population, we assume the population follows a probability distribution F, with unknown parameter(s).
- say that the population is distributed as a Normal distribution with parameters \(\mu\) and \(\sigma^2\): knowing the parameters is all it takes to characterize the distribution of the population (parametric approach).
By randomly selecting elements, the associated values are treated as random variables distributed according to F.

Definition:
A set of independent, identically distributed random variables \(X_1, X_2, \ldots, X_n \sim F\) is called a random sample from distribution F.

From Sample to Population

Point Estimate: Uses sample data to estimate a population parameter.
Confidence Interval: Adds a margin of error to the point estimate.
Hypothesis Test: Tests two opposing hypotheses about a population parameter.

Definition of a Point Estimator

Let \(X\) be a r.v. from a known distribution family (Normal, Binomial…), with unknown parameter \(\theta\).
Use a random sample \(X_1, X_2, \ldots, X_n\).
A known function \(T(\cdot)\) estimates the parameter \(\theta\):
\(\hat{\theta} = T(X_1, \ldots, X_n)\)

Point Estimators

Estimator A

No Bias, Low variability

Point Estimators

Estimator B

No Bias, High variability

Point Estimators

Estimator C

Bias, High variability

Point Estimators

Estimator D

Bias, Low variability

Ideal Estimator

An ideal estimator is:

Unbiased: Its expected value equals the true parameter.
Efficient: Has low variance around the parameter.

Natural estimator

if the target parameter is the population mean, then the natural estimator is the sample mean
if the target parameter is the population proportion, then the natural estimator is the sample proportion

Sample Mean Distribution

Given a population (e.g., workers) with a measurable variable (e.g., income):
Let \(X_1, X_2, ..., X_n\) be i.i.d. with distribution \(F\), parameters \(\mu\), \(\sigma^2\).

Sample mean:
\(\bar{X} = \frac{1}{n} \sum X_i\)

population and random samples

Code

library(tidyverse)
library(patchwork)
set.seed(123)
n=50
n_s=15
toy_data <- tibble(values = rnorm(1000000, mean = 5, sd = 2)) |> mutate(what="population")
sample_data <- sample_n(toy_data, n*n_s, replace = TRUE) |> 
  mutate(what=fct_inorder(rep(paste0("sample_",1:n_s),each=n)))

# plot population and sample
pop_pl=toy_data |> ggplot(aes(x=values)) +
  geom_histogram(fill="gold",bins = 100,color="darkgrey",alpha=.5) +
  labs(title = "population distribution N(5, 2)",
       x = "", y = "count") + geom_vline(aes(xintercept = 5), color = "indianred", linetype = "dashed") +
  theme_minimal()+ facet_wrap(~what, scales = "free_y",ncol=1) 
 
sample_means <- sample_data |> 
  group_by(what) |> 
  summarise(mean_val = mean(values))

sam_pl = sample_data |> 
  ggplot(aes(x = values, y = after_stat(density))) +
  geom_histogram(aes(fill = what), bins = 100, alpha = 0.5) +

  # Sample mean line
  geom_vline(data = sample_means, aes(xintercept = mean_val), 
             color = "indianred", linetype = "dashed", linewidth = 0.3) +

  # Population mean line (μ = 5)
  geom_vline(xintercept = 5, color = "blue", linetype = "solid", linewidth = 0.3) +

  labs(
    title = "Random Samples (size = 100)",
    x = "",
    y = "density"
  ) + 
  theme_minimal() +
  theme(
    legend.position = "none",
    plot.title = element_text(hjust = 0.5, size = 5)
  ) +
  facet_wrap(~what, ncol = 5)

(pop_pl/sam_pl)

The sample mean \(\bar{X}\)

\(\bar{X}\) bounces around the population mean \(\mu\), but there is more to it…

Code

library(tidyverse)
library(gganimate)

set.seed(123)
n <- 50       # sample size
n_s <- 100      # number of samples
true_mean <- 5

# Generate samples
sample_data <- tibble(
  sample_id = rep(1:n_s, each = n),
  values = rnorm(n * n_s, mean = true_mean, sd = 2)
) |>
  group_by(sample_id) |>
  mutate(sample_mean = mean(values))

# Plot
anim <- ggplot(sample_data, aes(x = values, fill = as.factor(sample_id))) +
  geom_histogram(aes(y = after_stat(density)), bins = 50, color = "white") +
  geom_vline(aes(xintercept = sample_mean), color = "dodgerblue", linetype ="solid" ) +
  geom_vline(xintercept = true_mean, color = "indianred", linetype = "dashed") +
  labs(title = 'Sample {closest_state}', x = 'Value', y = 'Density') +
  theme_minimal() + theme(
    legend.position = "none"
  ) +
  transition_states(sample_id, transition_length = 1, state_length = 1) +
  ease_aes('sine-in-out')

# Animate!


anim_file <- animate(anim, nframes = 50, fps = 5, width = 600, height = 400, renderer = gifski_renderer())
anim_save("figures/sample_animation.gif", animation = anim_file)

The sample mean \(\bar{X}\)

by drawing many samples from the population, say 10000, computing the mean for each sample, and plotting the distribution of these sample means, we can see that:

Code

n=50
n_s=10000

sample_data <- sample_n(toy_data, n*n_s, replace = TRUE) |> 
  mutate(what=fct_inorder(rep(paste0("sample_",1:n_s),each=n)))

sample_means <- sample_data |> 
  group_by(what) |> 
  summarise(mean_val = mean(values))

bar_x_pl = sample_means |> ggplot(aes(x=mean_val))+
  geom_histogram(aes(y=after_stat(density)), bins = 100, fill="forestgreen", color="darkgrey", alpha=.5)+
  expand_limits(x=c(-5,15))+theme_minimal() + geom_vline(aes(xintercept = mean(mean_val)), color = "gold", linetype = "dashed") 


(pop_pl/bar_x_pl)

the sample mean distribution approaches a normal distribution with mean \(\mu=5\) and standard deviation \(\frac{\sigma}{\sqrt{n}}\).

(In this case, \(\sigma=2\) and \(n=50\), so \(\frac{\sigma}{\sqrt{n}}=\frac{2}{\sqrt{50}}=0.283\).)

Interval Estimators - Confidence Intervals

Confidence Interval Example

For sample mean \(\bar{X}\): \[ P(\bar{X} - \text{ME} < \mu < \bar{X} + \text{ME}) = \text{confidence level} \] ME = Margin of Error

Consider the sample mean estimator \(\bar{X}\), whose expected value is \(E(\bar{X}) = \mu\) and whose standard deviation is \[\text{sd}(\bar{X}) = \sqrt{\frac{\sigma^2}{n}} = \frac{\sigma}{\sqrt{n}}.\]

Interval Estimators - Confidence Intervals

The standardized value of \(\bar{X}\) is

\[Z = \frac{\bar{X} - \mu}{\sigma / \sqrt{n}} = \sqrt{n} \cdot \frac{\bar{X} - \mu}{\sigma}\]

Therefore, the confidence level \((1 - \alpha)\) corresponds to:

\[P\left( \frac{\sqrt{n}}{\sigma} \cdot |\bar{X} - \mu| \leq Z_{\alpha/2} \right) = 1 - \alpha\]

Equivalently:

\[P\left( \bar{X} - Z_{\alpha/2} \cdot \frac{\sigma}{\sqrt{n}} < \mu < \bar{X} + Z_{\alpha/2} \cdot \frac{\sigma}{\sqrt{n}} \right) = 1 - \alpha\]

Visual: 95% confidence

Confidence Interval 95%

CI for the Mean

Example 1 - Known Variance

Suppose that the signal from an MRI (Magnetic Resonance Imaging) machine is transmitted from a source A to a destination B. The signal emitted has intensity \(\mu\).

The intensity of the signal is perceived at B according to a Normal distribution with mean \(\mu\) and standard deviation \(\sigma = 3\).

In other words, due to transmission noise, the signal intensity received at B differs from that emitted at A, with a mean of \(\mu = 0\) and a standard deviation of \(\sigma = 3\).

Suppose that \(n = 10\) transmissions were performed from A to B, and the signal intensity at the destination was recorded.

Example 1 - Known Variance

these are the quantities you need:

\(\{17,21,20,18,19,22,20,21,16,19\}\), \(n = 10\), \(\sigma = 3\)

95% CI:

\[ \bar{X} = 19.3,\ Z_{0.025} = 1.96,\ ME = \frac{\sigma}{\sqrt{n}}Z_{0.025}=\frac{3}{\sqrt{10}}1.96 = 1.86 \rightarrow \bar{X}\pm1.86\rightarrow[17.44, 21.16] \]

90% CI:

\[Z_{0.05} = 1.645 \Rightarrow [17.74, 20.86]\]

99% CI:
\[Z_{0.005} = 2.576 \Rightarrow [16.86, 21.74]\]

Sample Size Calculation

Given a confidence interval, and fixing the width of the interval at \(b\), determine the sample size \(n\) needed.

Recall that the bounds of the confidence interval for the mean are given by: \[\bar{X} \pm Z_{\alpha/2} \cdot \frac{\sigma}{\sqrt{n}}\]

Therefore, the total width of the interval is: \[2 \cdot Z_{\alpha/2} \cdot \frac{\sigma}{\sqrt{n}} = b\]

Solving for \(n\): \[2 \cdot Z_{\alpha/2} \cdot \frac{\sigma}{b} = \sqrt{n} \quad \Longrightarrow \quad n = \left(2 \cdot Z_{\alpha/2} \cdot \frac{\sigma}{b}\right)^2\]

What sample size is needed to obtain a 95% confidence interval for the population mean \(\mu\), with a total width of \(b = 0.01\), given that \(\sigma = 2\)? \[ n = (784)^2 = 614,656 \]

Unknown Variance - Use Student’s t

If the population standard deviation \(\sigma\) is not known, it must be estimated using the sample standard deviation \(s\) of the sample mean \(\bar{X}\). Thus, the sample standard deviation is given by:

\[ s = \sqrt{ \frac{ \sum_{i=1}^{n} (X_i - \bar{X})^2 }{n - 1} } \]

The standardized quantity is:

\[ T_{n-1} = \frac{\bar{X} - \mu}{s / \sqrt{n}} \]

and it follows a Student’s t-distribution with \(n - 1\) degrees of freedom.

Unknown Variance - Use Student’s t

Example 3 - Unknown Variance

An agency needs to assess the concentration level of a toxic substance, PCB, in breast milk. To do this, a sample of 20 mothers is considered, and the concentration level of the substance in their milk is studied.

PCB concentration in 20 samples:

\[16,0,0,2,3,6,8,2,5,0,12,10,5,7,2,3,8,17,9,1\]

\[ \bar{x} = 5.8,\ S = 5.085, \ n = 20, t_{\alpha/2,n-1} = 2.093 \]

We calculate the confidence interval as:

{x} t_{/2} = 5.8 = 5.8

Therefore, the confidence interval estimates are: \(\left[3.42,\ 8.18\right]\)

95% CI:
\(t_{0.025,19} = 2.093 \Rightarrow [3.42, 8.18]\)
99% CI:
\(t_{0.005,19} = 2.861 \Rightarrow [2.55, 9.05]\)

CI for Proportion

Let \(p\) be the proportion of statistical units in the population that exhibit a certain characteristic. The corresponding sample statistic is the sample proportion: = where \(x\) is the number of units in the sample that exhibit the characteristic.

The sample proportion \(\hat{p}\) satisfies:

Expected value: \(E[\hat{p}] = p\)
Standard deviation: \(\sigma_{\hat{p}} = \sqrt{\frac{p(1 - p)}{n}}\)

the confidence interval for the population proportion is given by: Z_{/2} _{}

Since the true value of \(p\) is unknown, the standard deviation is estimated by: S_{} =

Therefore, the confidence interval becomes: Z_{/2}

Example - CI for Proportion

A survey was conducted to determine the proportion of students who passed a certain exam. 82 out of 100 students passed. What is the 99% confidence interval for the proportion of students who passed?

\[ \hat{p} = 0.82,\ Z_{0.005} = 2.576 \Rightarrow CI: [0.721, 0.919] \]

Example 5 - Sample Size for Proportion

A certain newspaper reports the result of a survey, according to which 46% of the population intends to get vaccinated against the flu virus. It is stated that the margin of error is 3%, and the confidence level used is \((1 - \alpha) = 0.95\). How many people were interviewed?

Given \(\hat{p} = 0.46\), ME = 0.03, CI 95%: \[ n = \left(\frac{1.96^2 \cdot 0.46 \cdot 0.54}{0.03^2} \right) = 1060 \]

Hypothesis Testing

Definition

A statistical hypothesis is a statement concerning one or more parameters of the population distribution. It is an hypothesis rather than a fact because it cannot be established a priori whether it is true or not.

Testing Procedure

We aim to determine whether the sample values are compatible with the statistical hypothesis in question.

To accept an hypothesis based on the observed sample does not mean to affirm that it is true, but rather that the data collected do not reject the hypothesis.

Hypothesis Testing: parameter and sample Spaces

Parameter Space

The parameter space is the set of all values a population parameter can take.

By formulating an hypothesis (e.g., \(\mu = 1\) or \(\mu \geq 1\)), we create a bipartition of this space: values for which the hypothesis is true and those for which it is false.

Sample Space

The sample space is the set of all possible samples of size \(n\) that can be observed.

The decision rule divides this space: some samples lead to acceptance of the hypothesis, others to rejection.

Null and Alternative Hypotheses

Competing Hypotheses

Every hypothesis test includes two competing hypotheses:

\(H_0\): null hypothesis (e.g., \(\mu = 1\), \(\mu \geq 1\))
\(H_1\): alternative hypothesis (e.g., \(\mu \neq 1\), \(\mu < 1\))

Parameter Space Bipartition

Let \(\omega_0\) be the set of values of the parameter \(\theta\) defined by \(H_0\):

If \(\theta \in \omega_0\), then \(H_0\) is true.
If \(\theta \notin \omega_0\), then \(H_0\) is false.

Simple vs Composite Hypotheses

If \(H_0: \mu = 1\) vs. \(H_1: \mu \neq 1\), then \(H_0\) is a simple hypothesis (only one value), and \(H_1\) is two-sided.
If \(H_0: \mu \geq 1\) vs. \(H_1: \mu < 1\), then \(H_0\) is a composite hypothesis (a range of values), and \(H_1\) is one-sided.

Decision rule

The observed sample will lead to a value of the test statistic \(T_n\). Now, there will be a set \(C_{0}\) of possible values of \(T_{n}\) that will lead to the acceptance of \(H_0\).

The decision can be right or wrong:

Code

library(ggplot2)

# Define rectangles
rects <- data.frame(
  xmin = c(0, 5, 0, 5),
  xmax = c(5, 10, 5, 10),
  ymin = c(5, 5, 0, 0),
  ymax = c(10, 10, 5, 5),
  fill = c("palegreen", "indianred", "indianred", "palegreen")
)

# Labels (now fully valid plotmath syntax)
labels <- data.frame(
  x = c(2.5, 7.5, 2.5, 7.5),
  y = c(7.5, 7.5, 2.5, 2.5),
  label = c(
    "T[n] %in% C[0] * ',' * theta %in% omega[0]",
    "T[n] %notin% C[0] * ',' * theta %in% omega[0]",
    "T[n] %in% C[0] * ',' * theta %notin% omega[0]",
    "T[n] %notin% C[0] * ',' * theta %notin% omega[0]"
  ),
  color = c("dodgerblue", "white", "white", "dodgerblue")
)

labels_2 <- data.frame(
  x = c(1, 9, 1, 9),
  y = c(7.5, 7.5, 2.5, 2.5),
  label = c(
    "accept * ' ' * H[0] * ':' * correct",
    "reject * ' ' * H[0] * ':' * wrong",
    "accept * ' ' * H[0] * ':' * wrong",
    "reject * ' ' * H[0] * ':' * correct"
  ),
  color = c("dodgerblue", "white", "white", "dodgerblue")
)


# Plot
ggplot() +
  geom_rect(data = rects, aes(xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax, fill = fill), alpha = 0.6) +
  geom_text(
    data = labels,
    aes(x = x, y = y, label = label, color = color),
    parse = TRUE, angle = 45, size = 7, fontface = "bold"
  ) +
  geom_text(
    data = labels_2,
    aes(x = x, y = y, label = label, color = color),
    parse = TRUE, angle = 90, size = 5, fontface = "bold"
  ) +
  scale_fill_identity() +
  scale_color_identity() +
  theme_void()

Errors in Testing

Type I Error (α): False positive → reject H₀ when it’s true
Type II Error (β): False negative → fail to reject H₀ when H₁ is true

Code

library(ggplot2)

# Define rectangles
rects <- data.frame(
  xmin = c(0, 5, 0, 5),
  xmax = c(5, 10, 5, 10),
  ymin = c(5, 5, 0, 0),
  ymax = c(10, 10, 5, 5),
  fill = c("palegreen", "indianred", "indianred", "palegreen")
)

# Labels (now fully valid plotmath syntax)
labels <- data.frame(
  x = c(2.5, 7.5, 2.5, 7.5),
  y = c(7.5, 7.5, 2.5, 2.5),
  label = c(
    "T[n] %in% C[0] * ',' * theta %in% omega[0]",
    "'type I error'",
    "'type II error'",
    "T[n] %notin% C[0] * ',' * theta %notin% omega[0]"
  ),
  color = c("dodgerblue", "white", "white", "dodgerblue")
)

labels_2 <- data.frame(
  x = c(1, 9, 1, 9),
  y = c(7.5, 7.5, 2.5, 2.5),
  label = c(
    "accept * ' ' * H[0] * ':' * correct",
    "reject * ' ' * H[0] * ':' * wrong",
    "accept * ' ' * H[0] * ':' * wrong",
    "reject * ' ' * H[0] * ':' * correct"
  ),
  color = c("dodgerblue", "white", "white", "dodgerblue")
)


# Plot
ggplot() +
  geom_rect(data = rects, aes(xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax, fill = fill), alpha = 0.6) +
  geom_text(
    data = labels,
    aes(x = x, y = y, label = label, color = color),
    parse = TRUE, angle = 45, size = 7, fontface = "bold"
  ) +
  geom_text(
    data = labels_2,
    aes(x = x, y = y, label = label, color = color),
    parse = TRUE, angle = 90, size = 5, fontface = "bold"
  ) +
  scale_fill_identity() +
  scale_color_identity() +
  theme_void()

Errors in Testing

Type I Error (α): False positive → reject H₀ when it’s true
Type II Error (β): False negative → fail to reject H₀ when H₁ is true

why type I error matters more

example

\(H_0\): The defendant is innocent

\(H_1\): The defendant is guilty

Type I error → Convicting an innocent person
Type II error → Letting a guilty person go free

example

\(H_0\): The new drug is no better than the old
\(H_1\): The new drug is better
Type I error → Approving an ineffective drug
Type II error → Discarding an effective drug

probability of errors

One never knows whether \(H_{0}\) is true or not. The probability of making an error is given by the following:

\(P(reject \ H_{0}| \ H_{0} \ is \ true)= \alpha\)
\(P(accept \ H_{0}| \ H_{1} \ is \ true)= \beta\)

Code

# Parameters
mu0 <- 0       # Mean under H0
mu1 <- 2       # Mean under H1
sd <- 1        # Standard deviation (same for both)
alpha <- 0.05  # Significance level
z_crit <- qnorm(1 - alpha)  # Critical value for one-sided test

# x range
x_vals <- seq(-4, 6, length.out = 1000)

# Data for the two distributions
df <- data.frame(
  x = x_vals,
  H0 = dnorm(x_vals, mean = mu0, sd = sd),
  H1 = dnorm(x_vals, mean = mu1, sd = sd)
)

# Data for alpha area (under H0, right of critical value)
alpha_area <- df %>%
  filter(x > z_crit) %>%
  mutate(density = H0)

# Data for beta area (under H1, left of critical value)
beta_area <- df %>%
  filter(x < z_crit) %>%
  mutate(density = H1)

# Plot
ggplot(df, aes(x = x)) +
  geom_line(aes(y = H0), color = "dodgerblue", linewidth = 1.2, linetype = "solid") +
  geom_line(aes(y = H1), color = "indianred", linewidth = 1.2, linetype = "dashed") +
  geom_area(data = alpha_area, aes(y = density), fill = "dodgerblue", alpha = 0.3) +
  geom_area(data = beta_area, aes(y = density), fill = "indianred", alpha = 0.3) +
  geom_vline(xintercept = z_crit, linetype = "dotted") +
  annotate("text", x = z_crit + 0.3, y = 0.05, label = "Critical value", angle = 90, hjust = 0) +
  annotate("text", x = 4.2, y = 0.15, label = "H1", color = "indianred", size = 5) +
  annotate("text", x = -2, y = 0.15, label = "H0", color = "dodgerblue", size = 5) +
  annotate("text", x = 1.5, y = 0.02, label = expression(beta), color = "indianred", size = 6) +
  annotate("text", x = 3.5, y = 0.02, label = expression(alpha), color = "dodgerblue", size = 6) +
  theme_minimal() +
  labs(title = "Visualization of Alpha (Type I) and Beta (Type II) Errors",
       y = "Density", x = "Test Statistic")

probability of errors

Since the Type I error is the one to be minimized, one sets its probability of occurrence (a.k.a. the significance level \(\alpha\)) to a small value. The lower \(\alpha\), the higher is \(\beta\).

if \(\alpha\) is set to 0.1

Code

# Parameters
mu0 <- 0       # Mean under H0
mu1 <- 2       # Mean under H1
sd <- 1        # Standard deviation (same for both)
alpha <- 0.1  # Significance level
z_crit <- qnorm(1 - alpha)  # Critical value for one-sided test

# x range
x_vals <- seq(-4, 6, length.out = 1000)

# Data for the two distributions
df <- data.frame(
  x = x_vals,
  H0 = dnorm(x_vals, mean = mu0, sd = sd),
  H1 = dnorm(x_vals, mean = mu1, sd = sd)
)

# Data for alpha area (under H0, right of critical value)
alpha_area <- df %>%
  filter(x > z_crit) %>%
  mutate(density = H0)

# Data for beta area (under H1, left of critical value)
beta_area <- df %>%
  filter(x < z_crit) %>%
  mutate(density = H1)

# Plot
ggplot(df, aes(x = x)) +
  geom_line(aes(y = H0), color = "dodgerblue", linewidth = 1.2, linetype = "solid") +
  geom_line(aes(y = H1), color = "indianred", linewidth = 1.2, linetype = "dashed") +
  geom_area(data = alpha_area, aes(y = density), fill = "dodgerblue", alpha = 0.3) +
  geom_area(data = beta_area, aes(y = density), fill = "indianred", alpha = 0.3) +
  geom_vline(xintercept = z_crit, linetype = "dotted") +
  annotate("text", x = z_crit + 0.3, y = 0.05, label = "Critical value", angle = 90, hjust = 0) +
  annotate("text", x = 4.2, y = 0.15, label = "H1", color = "indianred", size = 5) +
  annotate("text", x = -2, y = 0.15, label = "H0", color = "dodgerblue", size = 5) +
  annotate("text", x = 1.5, y = 0.02, label = expression(beta), color = "indianred", size = 6) +
  annotate("text", x = 3.5, y = 0.02, label = expression(alpha), color = "dodgerblue", size = 6) +
  theme_minimal() +
  labs(title = "Visualization of Alpha (Type I) and Beta (Type II) Errors",
       y = "Density", x = "Test Statistic")

probability of errors

Since the Type I error is the one to be minimized, one sets its probability of occurrence (a.k.a. the significance level \(\alpha\)) to a small value. The lower \(\alpha\), the higher is \(\beta\).

if \(\alpha\) is set to 0.01

Code

# Parameters
mu0 <- 0       # Mean under H0
mu1 <- 2       # Mean under H1
sd <- 1        # Standard deviation (same for both)
alpha <- 0.01  # Significance level
z_crit <- qnorm(1 - alpha)  # Critical value for one-sided test

# x range
x_vals <- seq(-4, 6, length.out = 1000)

# Data for the two distributions
df <- data.frame(
  x = x_vals,
  H0 = dnorm(x_vals, mean = mu0, sd = sd),
  H1 = dnorm(x_vals, mean = mu1, sd = sd)
)

# Data for alpha area (under H0, right of critical value)
alpha_area <- df %>%
  filter(x > z_crit) %>%
  mutate(density = H0)

# Data for beta area (under H1, left of critical value)
beta_area <- df %>%
  filter(x < z_crit) %>%
  mutate(density = H1)

# Plot
ggplot(df, aes(x = x)) +
  geom_line(aes(y = H0), color = "dodgerblue", linewidth = 1.2, linetype = "solid") +
  geom_line(aes(y = H1), color = "indianred", linewidth = 1.2, linetype = "dashed") +
  geom_area(data = alpha_area, aes(y = density), fill = "dodgerblue", alpha = 0.3) +
  geom_area(data = beta_area, aes(y = density), fill = "indianred", alpha = 0.3) +
  geom_vline(xintercept = z_crit, linetype = "dotted") +
  annotate("text", x = z_crit + 0.3, y = 0.05, label = "Critical value", angle = 90, hjust = 0) +
  annotate("text", x = 4.2, y = 0.15, label = "H1", color = "indianred", size = 5) +
  annotate("text", x = -2, y = 0.15, label = "H0", color = "dodgerblue", size = 5) +
  annotate("text", x = 1.5, y = 0.02, label = expression(beta), color = "indianred", size = 6) +
  annotate("text", x = 3.5, y = 0.02, label = expression(alpha), color = "dodgerblue", size = 6) +
  theme_minimal() +
  labs(title = "Visualization of Alpha (Type I) and Beta (Type II) Errors",
       y = "Density", x = "Test Statistic")

the p-value

Setting the significance level \(\alpha\) defines the threshold (critical value): if the obseved value of the test statistic is greater than the critical value, reject the null hypothesis.

Say the observed value is \(z = 2\), if \(\alpha=0.01\)

Code

# Parameters
mu0 <- 0       # Mean under H0
mu1 <- 2       # Mean under H1
sd <- 1        # Standard deviation (same for both)
alpha <- 0.01  # Significance level
z_crit <- qnorm(1 - alpha)  # Critical value for one-sided test

# x range
x_vals <- seq(-4, 6, length.out = 1000)

# Data for the two distributions
df <- data.frame(
  x = x_vals,
  H0 = dnorm(x_vals, mean = mu0, sd = sd),
  H1 = dnorm(x_vals, mean = mu1, sd = sd)
)

# Data for alpha area (under H0, right of critical value)
alpha_area <- df %>%
  filter(x > z_crit) %>%
  mutate(density = H0)

# Data for beta area (under H1, left of critical value)
beta_area <- df %>%
  filter(x < z_crit) %>%
  mutate(density = H1)

# Plot
ggplot(df, aes(x = x)) +
  geom_line(aes(y = H0), color = "dodgerblue", linewidth = 1.2, linetype = "solid") +
  # geom_line(aes(y = H1), color = "indianred", linewidth = 1.2, linetype = "dashed") +
  geom_area(data = alpha_area, aes(y = density), fill = "dodgerblue", alpha = 0.3) +
  # geom_area(data = beta_area, aes(y = density), fill = "indianred", alpha = 0.3) +
  geom_vline(xintercept = z_crit, linetype = "dotted") +
  geom_vline(xintercept = 2, linetype = "solid",color="forestgreen") +
  annotate("text", x = 2 - 0.1, y = 0.05, label = "observed value", angle = 90, hjust = 0) +
  annotate("text", x = z_crit + 0.3, y = 0.05, label = "Critical value", angle = 90, hjust = 0) +
  annotate("text", x = 3, y = 0.15, label = "no reject H0", color = "indianred", size = 5) +
  annotate("text", x = -2, y = 0.15, label = "H0", color = "dodgerblue", size = 5) +
  # annotate("text", x = 1.5, y = 0.02, label = expression(beta), color = "indianred", size = 6) +
  annotate("text", x = 3.5, y = 0.02, label = expression(alpha), color = "dodgerblue", size = 6) +
  theme_minimal() +
  labs(title = "Visualization of Alpha (Type I) and Beta (Type II) Errors",
       y = "Density", x = "Test Statistic")

the p-value

Setting the significance level \(\alpha\) defines the threshold (critical value): if the obseved value of the test statistic is greater than the critical value, reject the null hypothesis.

Say the observed value is \(z = 2\), if \(\alpha=0.01\)

Code

# Parameters
mu0 <- 0       # Mean under H0
mu1 <- 2       # Mean under H1
sd <- 1        # Standard deviation (same for both)
alpha <- 0.05  # Significance level
z_crit <- qnorm(1 - alpha)  # Critical value for one-sided test

# x range
x_vals <- seq(-4, 6, length.out = 1000)

# Data for the two distributions
df <- data.frame(
  x = x_vals,
  H0 = dnorm(x_vals, mean = mu0, sd = sd),
  H1 = dnorm(x_vals, mean = mu1, sd = sd)
)

# Data for alpha area (under H0, right of critical value)
alpha_area <- df %>%
  filter(x > z_crit) %>%
  mutate(density = H0)

# Data for beta area (under H1, left of critical value)
beta_area <- df %>%
  filter(x < z_crit) %>%
  mutate(density = H1)

# Plot
ggplot(df, aes(x = x)) +
  geom_line(aes(y = H0), color = "dodgerblue", linewidth = 1.2, linetype = "solid") +
  # geom_line(aes(y = H1), color = "indianred", linewidth = 1.2, linetype = "dashed") +
  geom_area(data = alpha_area, aes(y = density), fill = "dodgerblue", alpha = 0.3) +
  # geom_area(data = beta_area, aes(y = density), fill = "indianred", alpha = 0.3) +
  geom_vline(xintercept = z_crit, linetype = "dotted") +
  geom_vline(xintercept = 2, linetype = "solid",color="forestgreen") +
  annotate("text", x = 2 + 0.1, y = 0.05, label = "observed value", angle = 90, hjust = 0) +
  annotate("text", x = z_crit + 0.1, y = 0.05, label = "Critical value", angle = 90, hjust = 0) +
  annotate("text", x = 3, y = 0.15, label = "reject H0", color = "indianred", size = 5) +
  annotate("text", x = -2, y = 0.15, label = "H0", color = "dodgerblue", size = 5) +
  # annotate("text", x = 1.5, y = 0.02, label = expression(beta), color = "indianred", size = 6) +
  annotate("text", x = 3.5, y = 0.02, label = expression(alpha), color = "dodgerblue", size = 6) +
  theme_minimal() +
  labs(title = "Visualization of Alpha (Type I) and Beta (Type II) Errors",
       y = "Density", x = "Test Statistic")

the p-value

Instead of setting \(\alpha\), we can set the observed value and calculate the p-value as the area under the curve to the right.

If the observed value is \(z = 2\), the p-value is 0.0227501

for \(\alpha=0.05\), reject the null hypothesis since 0.0227501 < 0.05
for \(\alpha=0.01\), do not reject the null hypothesis since 0.0227501 > 0.01

Code

# Parameters
mu0 <- 0       # Mean under H0
mu1 <- 2       # Mean under H1
sd <- 1        # Standard deviation (same for both)
alpha <- 0.05  # Significance level
z_crit <- qnorm(1 - alpha)  # Critical value for one-sided test



# x range
x_vals <- seq(-4, 6, length.out = 1000)

# Data for the two distributions
df <- data.frame(
  x = x_vals,
  H0 = dnorm(x_vals, mean = mu0, sd = sd),
  H1 = dnorm(x_vals, mean = mu1, sd = sd)
)

# Data for alpha area (under H0, right of critical value)
alpha_area <- df %>%
  filter(x > z_crit) %>%
  mutate(density = H0)

p_area <- df %>%
  filter(x > 2) %>%
  mutate(density = H0)

# Data for beta area (under H1, left of critical value)
beta_area <- df %>%
  filter(x < z_crit) %>%
  mutate(density = H1)

# Plot
ggplot(df, aes(x = x)) +
  geom_line(aes(y = H0), color = "dodgerblue", linewidth = 1.2, linetype = "solid") +
  # geom_line(aes(y = H1), color = "indianred", linewidth = 1.2, linetype = "dashed") +
  # geom_area(data = alpha_area, aes(y = density), fill = "dodgerblue", alpha = 0.3) +
  geom_area(data = p_area, aes(y = density), fill = "indianred", alpha = 0.3) +
  # geom_vline(xintercept = z_crit, linetype = "dotted") +
  geom_vline(xintercept = 2, linetype = "solid",color="forestgreen") +
  annotate("text", x = 2 + 0.1, y = 0.05, label = "observed value", angle = 90, hjust = 0) +
  # annotate("text", x = z_crit + 0.1, y = 0.05, label = "Critical value", angle = 90, hjust = 0) +
  # annotate("text", x = 3, y = 0.15, label = "reject H0", color = "indianred", size = 5) +
  annotate("text", x = -2, y = 0.15, label = "H0", color = "dodgerblue", size = 5) +
  # annotate("text", x = 1.5, y = 0.02, label = expression(beta), color = "indianred", size = 6) +
  annotate("text", x = 2.8, y = 0.05, label = expression(p-value), color = "indianred", size = 6) +
  theme_minimal() +
  labs(title = "Visualization of Alpha (Type I) and Beta (Type II) Errors",
       y = "Density", x = "Test Statistic")

Hypothesis Testing on the Mean (Known Variance)

Example 1: One-Sided Test (Known Variance)

Hip prosthetics have a mean resistance \(\mu = 1800N\), standard deviation \(\sigma = 100N\). After a process improvement, a sample of \(n=50\) prosthetics has \(\bar{X} = 1850N\). Test, at \(\alpha = 0.01\), whether resistance improved.

Hypotheses:

\(H_0: \mu = 1800\)
\(H_1: \mu > 1800\)

Test Statistic

\[ Z_{obs} = \frac{\bar{X} - \mu}{\sigma/\sqrt{n}} = \frac{1850 - 1800}{100/\sqrt{50}} = \boxed{3.55} \]

Critical value: \(Z_c = Z_{1 - \alpha} = Z_{0.99} = 2.33\)

Since \(3.55 > 2.33\), reject \(H_0\).

Critical value in terms of \(X\):

\[ X_c = Z_c \cdot \frac{\sigma}{\sqrt{n}} + \mu = 2.33 \cdot \frac{100}{\sqrt{50}} + 1800 = \boxed{1832.9} \]

Since \(1850 > 1832.9\), reject \(H_0\).

Hypothesis Testing on the Mean (Unknown Variance)

Example 2: Test on the Mean (Unknown Variance)

A group of volunteers among hospitalized patients with high cholesterol levels (at least 240 mg/dL) was selected to test the effectiveness of a cholesterol-lowering drug. 40 volunteers were treated with the drug for 60 days, and their cholesterol levels were measured again. The sample showed an average decrease in cholesterol of 6.8 with a sample standard deviation of 12.1. Use a 5% significance level.

Solution

To test if the observed reduction is significant:

\[ H_0: \mu = 0 \ \ \ \ \ \ \ \ H_1: \mu \neq 0 \]

where \(\mu\) is the mean reduction in cholesterol.

Test Statistic

The test statistic is:

\[ T_{obs} = \frac{\bar{X} - \mu}{S / \sqrt{n}} = \frac{\sqrt{40}}{12.1} \times 6.8 = 3.554 \]

The critical value at 5% significance with 39 degrees of freedom is:

\[ t_{39, 0.025} = 2.02 \]

Since \(|T_{obs}| > t_{c}\), we reject \(H_0\). Thus, the reduction is statistically significant — though not necessarily due to the drug alone. A placebo effect or other causes could be responsible.

\[ p\text{-value} = 2P(T_{39} > 3.554) = 0.0001 \]

Hypothesis Test on a Proportion

Example 3: Proportion Test

A subject A is tested for extrasensory abilities. They are shown 50 cards (red or blue) and asked to guess the color chosen by a subject B in another room. A guesses correctly 32 times. Can we say A has paranormal abilities at the 5% significance level?

Solution

Assume random guessing: \(p = 0.5\).

\[ H_0: p = 0.5 \\ H_1: p > 0.5 \]

This is a one-sided test.

Calculating the Test Statistic

Significance level \(\alpha = 0.05\) ⇒ critical value \(z_c = 1.645\)

Under \(H_0\), we expect:

\[ \mu = Np = 25 \ \ \ \ \ \ \ \ \sigma = \sqrt{Np(1-p)} = \sqrt{12.5} = 3.54 \]

thus, the observed statistic is

\[ z_{obs} = \frac{32 - 25}{3.54} = 1.98 \]

Since \(z_{obs} > z_c\), we reject \(H_0\): A’s performance is significantly better than chance.

The corresponding \(p\)-value is: 0.024 that is lower than 0.05.

Hypothesis Testing: Comparing Two Populations

The Importance of Control Groups

When testing the effect of a treatment (e.g., a drug), it’s important that all other factors are held constant. This way, any difference in outcomes can be attributed to the treatment itself.

This is often not feasible, so we use:

A treatment group (receives the drug)
A control group (receives a placebo)

We then test whether the difference in outcomes is statistically significant.

Comparing Means (Known Variances)

We have two independent samples:

\(X_1, \dots, X_n \sim N(\mu_x, \sigma_x^2)\)
\(Y_1, \dots, Y_m \sim N(\mu_y, \sigma_y^2)\)
\(X\) and \(Y\) are independent

We test:

\[ H_0: \mu_x = \mu_y \quad vs. \quad H_1: \mu_x \neq \mu_y \]

Test statistic:

\[ Z = \frac{\bar{X} - \bar{Y}}{\sqrt{\frac{\sigma_x^2}{n} + \frac{\sigma_y^2}{m}}} \]

Example: testing the difference between two populations

The goal is to study the effectiveness of a new drug in reducing cholesterol levels. To test the drug, 100 volunteers were recruited and divided into two groups of 50 each. The first group was given the new drug, while the second group—the control group—was administered lovastatin, a commonly used substance for lowering cholesterol. Each volunteer was instructed to take one pill every 12 hours for three months. None of the patients knew whether they were taking the new drug or the lovastatin. The first group (which took the new drug) recorded an average cholesterol reduction of 8.8, with a sample variance of 4.5. The second group recorded an average reduction of 8.2, with a sample variance of 5.4. Do these results support the hypothesis that, at a 5% significance level, the new drug leads to a greater average reduction in cholesterol levels?

We want to test the hypotheses:

\[ H_0: \mu_x \leq \mu_y \quad \text{vs.} \quad H_1: \mu_x > \mu_y \]

The data provided are: \(\bar{X} = 8.8\), \(\bar{Y} = 8.2\), \(n = 50\), \(m = 50\), \(S^2_x = 4.5\), and \(S^2_y = 5.4\).

Example: testing the difference between two populations

The test statistic is calculated as:

\[ Z_{obs} = \frac{\bar{X} - \bar{Y}}{\sqrt{\frac{S^2_x}{n} + \frac{S^2_y}{m}}} = \frac{8.8 - 8.2}{\sqrt{\frac{4.5}{50} + \frac{5.4}{50}}} = 1.3484 \]

The critical value at a significance level of \(\alpha = 0.05\) is:

\[ Z_c = 1.645 \]

Since \(Z_{obs} < Z_c\), we do not reject the null hypothesis \(H_0\).

The p-value is:

\[ P(Z > Z_{obs}) = 0.089 \]

Wrapping Up

Probability models uncertainty
Inference lets us generalize from data
Core concepts: sample space, distributions, CI, p-values