Essential statistics across disciplines

bivariate interdependence

Alfonso Iodice D’Enza

Interdependence and Dependence

If the components of a bivariate variable (X,Y) play the same role in the analysis, we study their interdependence.

If, instead, we are interested in how Y changes with respect to X, we refer to the dependence of Y on X.

Y is called the dependent variable
X is called the independent variable

Measure of Interdependence

Given a bivariate variable (X,Y), the type of association between components X and Y is defined as:

association if X and Y are categorical variables
correlation if X and Y are continuous variables

Two categorical variables

Ufo abductions

A sci-fi example: let’s study the interdependence of the species of aliens and the method of abduction.

The most common way to describe the joint distribution of a pair of categorical variables is the joint frequency table.

Code

# Load tidyverse
library(tidyverse)
library(janitor)
library(kableExtra)
# Set seed for reproducibility
set.seed(42)

# Simulate data
ufo_data <- tibble(
  alien_species = sample(c("Martian", "Venusian", "Zeta Reticulan", "Reptilian"),
                         size = 100, replace = TRUE, prob = c(0.4, 0.2, 0.3, 0.1)),
  abduction_method = sample(c("Tractor Beam", "Teleportation", "Sleep Gas", "Invisibility Cloak"),
                            size = 100, replace = TRUE)
)

# View a few rows
# ufo_data %>% head()

# Create joint frequency table
ufo_table <- ufo_data |> 
  tabyl(alien_species, abduction_method)


ufo_table %>%
  kbl(caption = "Abduction Method by Alien Species", align = "c") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
                full_width = FALSE,
                position = "center") |> column_spec(1, bold = TRUE)

Abduction Method by Alien Species
alien_species	Invisibility Cloak	Sleep Gas	Teleportation	Tractor Beam
Martian	10	8	9	11
Reptilian	6	5	4	1
Venusian	3	2	8	5
Zeta Reticulan	5	3	10	10

Ufo abductions

besides counting all the combination of species and method, we can also calculate the marginal frequencies of each variable (that is, compute the rows and columns total)

Code

ufo_table |> adorn_totals(where = c("row","col"))|> 
  kbl(caption = "Abduction Method by Alien Species", align = "c") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
                full_width = FALSE,
                position = "center") |> 
  column_spec(1, bold = TRUE) |> 
  column_spec(6, color = "dodgerblue") |> 
  row_spec(5, color = "forestgreen")

Abduction Method by Alien Species
alien_species	Invisibility Cloak	Sleep Gas	Teleportation	Tractor Beam	Total
Martian	10	8	9	11	38
Reptilian	6	5	4	1	16
Venusian	3	2	8	5	18
Zeta Reticulan	5	3	10	10	28
Total	24	18	31	27	100

The row totals represent the distribution of the variable alien_species, while the column totals is the distribution of the variable abduction_method.

need for relative frequencies

Looking at the absolute value may be misleading:

an observed frequency of 11 for the Tractor Beam method is not very informative, unless we know how many abductions were made in total.

Ufo abductions

in relative terms, the cross-table can be computed as:

Code

ufo_table |> 
  adorn_totals(where = c("row","col"))|> 
  adorn_percentages(denominator = "all") |>
  adorn_rounding(digits = 2) |>
  kbl(caption = "Abduction Method by Alien Species", align = "c") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
                full_width = FALSE,
                position = "center") |> 
  column_spec(1, bold = TRUE) |> 
  column_spec(6, color = "dodgerblue") |> 
  row_spec(5, color = "forestgreen")

Abduction Method by Alien Species
alien_species	Invisibility Cloak	Sleep Gas	Teleportation	Tractor Beam	Total
Martian	0.10	0.08	0.09	0.11	0.38
Reptilian	0.06	0.05	0.04	0.01	0.16
Venusian	0.03	0.02	0.08	0.05	0.18
Zeta Reticulan	0.05	0.03	0.10	0.10	0.28
Total	0.24	0.18	0.31	0.27	1.00

need for conditional frequencies

It is often more interesting to look at the conditional relative frequencies

the Tractor Beam has been used 11 (0.11) times by Martians, 10 (0.1) times by Zeta Reticulans
can we say that the Tractor Beam is the most used method by Martians?

Ufo abductions

For each alien species, what is the relative frequency of each abduction method?

Code

  ufo_table |> 
  adorn_totals(where = c("row","col"))|> 
  adorn_percentages(denominator = "row") |>
  adorn_rounding(digits = 2) |>
  kbl(caption = "Abduction Method by Alien Species", align = "c") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
                full_width = FALSE,
                position = "center") |> 
  column_spec(1, bold = TRUE) |> 
  column_spec(6, color = "dodgerblue") |> 
  row_spec(5, color = "forestgreen")

Abduction Method by Alien Species
alien_species	Invisibility Cloak	Sleep Gas	Teleportation	Tractor Beam	Total
Martian	0.26	0.21	0.24	0.29	1
Reptilian	0.38	0.31	0.25	0.06	1
Venusian	0.17	0.11	0.44	0.28	1
Zeta Reticulan	0.18	0.11	0.36	0.36	1
Total	0.24	0.18	0.31	0.27	1

row profiles

The last table shows the row profiles: the distribution of the abduction_method by alien_species.

Ufo abductions

For each abduction method, what is the relative frequency of abduction made by alien species?

Code

  ufo_table |> 
  adorn_totals(where = c("row","col"))|> 
  adorn_percentages(denominator = "col") |>
  adorn_rounding(digits = 2) |>
  kbl(caption = "Abduction Method by Alien Species", align = "c") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
                full_width = FALSE,
                position = "center") |> 
  column_spec(1, bold = TRUE) |> 
  column_spec(6, color = "dodgerblue") |> 
  row_spec(5, color = "forestgreen")

Abduction Method by Alien Species
alien_species	Invisibility Cloak	Sleep Gas	Teleportation	Tractor Beam	Total
Martian	0.42	0.44	0.29	0.41	0.38
Reptilian	0.25	0.28	0.13	0.04	0.16
Venusian	0.12	0.11	0.26	0.19	0.18
Zeta Reticulan	0.21	0.17	0.32	0.37	0.28
Total	1.00	1.00	1.00	1.00	1.00

column profiles

The last table shows the column profiles: the distribution of the alien_species by abduction_method.

Ufo abductions: measuring association

asking if the two variables are associated or not is like asking

knowing the abduction method, does one have a better guess on the alien species responsible?
knowing the alien species, does one have a better guess on the abduction method used?

Suppose one observes these row profiles

Code

library(tidyverse)
library(janitor)

# Step 1: Simulate data
set.seed(42)
ufo_data <- tibble(
  alien_species = sample(c("Martian", "Venusian", "Zeta Reticulan", "Reptilian"),
                         size = 100, replace = TRUE, prob = c(0.4, 0.2, 0.3, 0.1)),
  abduction_method = sample(c("Tractor Beam", "Teleportation", "Sleep Gas", "Invisibility Cloak"),
                            size = 100, replace = TRUE)
)

# Step 2: Create observed tabyl with totals
observed_tab <- ufo_data %>%
  tabyl(alien_species, abduction_method) %>%
  adorn_totals("both")

# Step 3: Extract row and column totals
row_totals <- observed_tab %>%
  filter(alien_species != "Total") %>%
  pull(Total)

col_totals <- observed_tab %>%
  filter(alien_species == "Total") %>%
  select(-alien_species, -Total) %>%
  as.numeric()

grand_total <- sum(row_totals)

# Step 4: Compute expected frequencies under independence
expected_matrix <- outer(row_totals, col_totals) / grand_total
dimnames(expected_matrix) <- list(
  observed_tab$alien_species[observed_tab$alien_species != "Total"],
  names(observed_tab)[2:(ncol(observed_tab) - 1)]
)

# Step 5: Convert expected matrix to tabyl-style tibble
expected_tab <- as_tibble(expected_matrix, rownames = "alien_species") %>%
  adorn_totals("both") 


# Print expected frequencies under independence
expected_tab |> adorn_percentages("row") |>  
  kbl(caption = "Abduction Method by Alien Species", align = "c") |> 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
                full_width = FALSE,
                position = "center") |> 
  column_spec(1, bold = TRUE) |> 
  column_spec(6, color = "dodgerblue") |> 
  row_spec(5, color = "forestgreen")

Abduction Method by Alien Species
alien_species	Invisibility Cloak	Sleep Gas	Teleportation	Tractor Beam	Total
Martian	0.24	0.18	0.31	0.27	1
Reptilian	0.24	0.18	0.31	0.27	1
Venusian	0.24	0.18	0.31	0.27	1
Zeta Reticulan	0.24	0.18	0.31	0.27	1
Total	0.24	0.18	0.31	0.27	1

Ufo abductions: measuring association

asking if the two variables are associated or not is like asking

knowing the abduction method, does one have a better guess on the alien species responsible?
knowing the alien species, does one have a better guess on the abduction method used?

Suppose one observes these column profiles

Code

expected_tab |> adorn_percentages("col") |>  
  kbl(caption = "Abduction Method by Alien Species", align = "c") |> 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
                full_width = FALSE,
                position = "center") |> 
  column_spec(1, bold = TRUE) |> 
  column_spec(6, color = "dodgerblue") |> 
  row_spec(5, color = "forestgreen")

Abduction Method by Alien Species
alien_species	Invisibility Cloak	Sleep Gas	Teleportation	Tractor Beam	Total
Martian	0.38	0.38	0.38	0.38	0.38
Reptilian	0.16	0.16	0.16	0.16	0.16
Venusian	0.18	0.18	0.18	0.18	0.18
Zeta Reticulan	0.28	0.28	0.28	0.28	0.28
Total	1.00	1.00	1.00	1.00	1.00

Ufo abductions: measuring association

the previous tables have been computed from this joint distribution

Code

expected_tab |> adorn_percentages("all") |>  
  kbl(caption = "Abduction Method by Alien Species", align = "c") |> 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
                full_width = FALSE,
                position = "center") |> 
  column_spec(1, bold = TRUE) |> 
  column_spec(6, color = "dodgerblue") |> 
  row_spec(5, color = "forestgreen")

Abduction Method by Alien Species
alien_species	Invisibility Cloak	Sleep Gas	Teleportation	Tractor Beam	Total
Martian	0.0912	0.0684	0.1178	0.1026	0.38
Reptilian	0.0384	0.0288	0.0496	0.0432	0.16
Venusian	0.0432	0.0324	0.0558	0.0486	0.18
Zeta Reticulan	0.0672	0.0504	0.0868	0.0756	0.28
Total	0.2400	0.1800	0.3100	0.2700	1.00

Ufo abductions: measuring association

and the previous joint distribution has been computed from the marginal distributions of the example we saw before

Code

ufo_table |> 
  adorn_totals(where = c("row","col"))|> 
  adorn_percentages(denominator = "all") |>
  adorn_rounding(digits = 2) |>
  kbl(caption = "Abduction Method by Alien Species", align = "c") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
                full_width = FALSE,
                position = "center") |> 
  column_spec(1, bold = TRUE) |> 
  column_spec(6, color = "dodgerblue") |> 
  row_spec(5, color = "forestgreen")

Abduction Method by Alien Species
alien_species	Invisibility Cloak	Sleep Gas	Teleportation	Tractor Beam	Total
Martian	0.10	0.08	0.09	0.11	0.38
Reptilian	0.06	0.05	0.04	0.01	0.16
Venusian	0.03	0.02	0.08	0.05	0.18
Zeta Reticulan	0.05	0.03	0.10	0.10	0.28
Total	0.24	0.18	0.31	0.27	1.00

Ufo abductions: measuring association

observed frequencies

Code

ufo_table |> 
  adorn_totals(where = c("row","col"))|> 
  adorn_percentages(denominator = "all") |>
  adorn_rounding(digits = 2) |>
  kbl(caption = "Abduction Method by Alien Species", align = "c") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
                full_width = FALSE,
                position = "center") |> 
  column_spec(1, bold = TRUE) |> 
  column_spec(6, color = "dodgerblue") |> 
  row_spec(5, color = "forestgreen")

Abduction Method by Alien Species
alien_species	Invisibility Cloak	Sleep Gas	Teleportation	Tractor Beam	Total
Martian	0.10	0.08	0.09	0.11	0.38
Reptilian	0.06	0.05	0.04	0.01	0.16
Venusian	0.03	0.02	0.08	0.05	0.18
Zeta Reticulan	0.05	0.03	0.10	0.10	0.28
Total	0.24	0.18	0.31	0.27	1.00

expercted frequencies under independence

Code

expected_tab |> adorn_percentages("all") |>  
  adorn_rounding(digits = 2) |>
  kbl(caption = "Abduction Method by Alien Species", align = "c") |> 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
                full_width = FALSE,
                position = "center") |> 
  column_spec(1, bold = TRUE) |> 
  column_spec(6, color = "dodgerblue") |> 
  row_spec(5, color = "forestgreen")

Abduction Method by Alien Species
alien_species	Invisibility Cloak	Sleep Gas	Teleportation	Tractor Beam	Total
Martian	0.09	0.07	0.12	0.10	0.38
Reptilian	0.04	0.03	0.05	0.04	0.16
Venusian	0.04	0.03	0.06	0.05	0.18
Zeta Reticulan	0.07	0.05	0.09	0.08	0.28
Total	0.24	0.18	0.31	0.27	1.00

the expected frequencies are obtained by multiplying the margins: e.g.
- the expected frequency of Martian and Tractor Beam is \(.38\times .27 =\) 0.1
- the expected frequency of Venusian and Teleportation is \(.18\times .31 =\) 0.06

Ufo abductions: measuring association

observed frequencies

Code

ufo_table |> 
  adorn_totals(where = c("row","col"))|> 
  adorn_percentages(denominator = "all") |>
  adorn_rounding(digits = 2) |>
  kbl(caption = "Abduction Method by Alien Species", align = "c") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
                full_width = FALSE,
                position = "center") |> 
  column_spec(1, bold = TRUE) |> 
  column_spec(6, color = "dodgerblue") |> 
  row_spec(5, color = "forestgreen")

Abduction Method by Alien Species
alien_species	Invisibility Cloak	Sleep Gas	Teleportation	Tractor Beam	Total
Martian	0.10	0.08	0.09	0.11	0.38
Reptilian	0.06	0.05	0.04	0.01	0.16
Venusian	0.03	0.02	0.08	0.05	0.18
Zeta Reticulan	0.05	0.03	0.10	0.10	0.28
Total	0.24	0.18	0.31	0.27	1.00

expercted frequencies under independence

Code

expected_tab |> adorn_percentages("all") |>  
  adorn_rounding(digits = 2) |>
  kbl(caption = "Abduction Method by Alien Species", align = "c") |> 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
                full_width = FALSE,
                position = "center") |> 
  column_spec(1, bold = TRUE) |> 
  column_spec(6, color = "dodgerblue") |> 
  row_spec(5, color = "forestgreen")

Abduction Method by Alien Species
alien_species	Invisibility Cloak	Sleep Gas	Teleportation	Tractor Beam	Total
Martian	0.09	0.07	0.12	0.10	0.38
Reptilian	0.04	0.03	0.05	0.04	0.16
Venusian	0.04	0.03	0.06	0.05	0.18
Zeta Reticulan	0.07	0.05	0.09	0.08	0.28
Total	0.24	0.18	0.31	0.27	1.00

the higher the difference between observed and expected frequencies, the higher the association
- the mere difference between frequencies would depend on their order of magnitude; also, the sign of the difference does not matter

Ufo abductions: measuring association

to measure the association one computes, for each cell of the table, the residuals as

\[\frac{\text{observed freq}-\text{expected freq}}{\sqrt{\text{expected freq}}}\]

Code

exp_tab = expected_tab |> adorn_percentages("all") |>  
  adorn_rounding(digits = 2)|> select(-1,-Total) |> data.matrix()

obs_tab = ufo_table |> 
  adorn_totals(where = c("row","col"))|> 
  adorn_percentages(denominator = "all") |>
  adorn_rounding(digits = 2)|> select(-1,-Total) |> data.matrix()

chi_resid_mat = (obs_tab -exp_tab)/sqrt(exp_tab)


chi_resid_tab <- as.data.frame(chi_resid_mat) |> slice(-5) |> 
  rownames_to_column("alien_species") |> 
  mutate(alien_species = ufo_table |> 
           filter(alien_species!="Total") |> 
           pull(alien_species)
         )


chi_resid_tab |> 
  pivot_longer(-alien_species, names_to = "abduction_method", values_to = "residual") |> 
  ggplot(aes(x = abduction_method, y = alien_species, fill = residual)) +
  geom_tile(color = "white") +
  scale_fill_gradient2(low = "dodgerblue", mid = "white", high = "indianred", midpoint = 0) +
  labs(title = "Chi-Squared Pearson Residuals",
       x = "Abduction Method", y = "Alien Species", fill = "Residual") +
  theme_minimal()

the sum of all the squared residuals is the chi-square statistic

Code

chisq.test(table(ufo_data), correct = FALSE) |> 
  broom::tidy() |> 
  select(statistic, parameter, p.value) |> 
  mutate(across(everything(), ~ round(.x, 2))) |> kbl()

statistic	parameter	p.value
10.69	9	0.3

Two continuous variables

Sharks and ice creams

Two continuous variables are correlated if large values of one variable correspond to large (small) values of the other variable.

by large is meant above average; by small is meant below average

Code

library(tidyverse)

set.seed(123)

# 1. Strong Positive: Ice cream vs shark attacks
df1 <- tibble(
  ice_cream_sales = seq(100, 1000, length.out = 50),
  shark_attacks = ice_cream_sales * 0.05 + rnorm(50, 0, 10),
  pair = "Ice Cream vs Shark Attacks"
)

df1 |> 
  ggplot(aes(x = ice_cream_sales, y = shark_attacks)) +
  geom_point(color="indianred") +
  # geom_smooth(method = "lm", se = FALSE, color = "blue") +
  labs(x = "Ice Cream Sales", y = "Shark Attacks") + #title = "Ice Cream Sales vs Shark Attacks",
  theme_minimal()

Sharks and ice creams: measuring correlation

building the index

the index is supposed to increase when both the variables are above average or both variables below average
the index is supposed to decrease when one variable is above average and the other is below average
the further away from the average, the higher the contribution to the correlation

Code

df1 |> 
  mutate(
    # Sign of contribution to correlation (+ if both above or both below mean, - otherwise)
    sign_index = ifelse(
      (ice_cream_sales > mean(ice_cream_sales) & shark_attacks > mean(shark_attacks)) |
      (ice_cream_sales < mean(ice_cream_sales) & shark_attacks < mean(shark_attacks)),
      "+", "-"
    ),
    # Size index based on distance from means (scaled absolute contribution)
    size_index = abs((ice_cream_sales - mean(ice_cream_sales)) * 
                     (shark_attacks - mean(shark_attacks)))
  ) |>
  ggplot(aes(x = ice_cream_sales, y = shark_attacks, color = sign_index)) +
  geom_text(aes(label = sign_index, size = size_index), alpha = 0.8) +
  geom_vline(xintercept = mean(df1$ice_cream_sales), color = "dodgerblue",size=1.5,alpha=.5) +
  geom_hline(yintercept = mean(df1$shark_attacks), color = "forestgreen",size=1.5,alpha=.5) +
  labs(
    # title = "Contribution Signs to Correlation",
    # subtitle = "Positive ( + ) when above/below both means, Negative ( - ) otherwise",
    x = "Ice Cream Sales",
    y = "Shark Attacks"
  ) +
  scale_size(range = c(3, 8)) +
  theme_minimal() +
  theme(legend.position = "none")

Sharks and ice creams: computing correlation

building the index

to compute the index, just take the average of the contributions products \(\sigma_{xy}=\sum (x_i - \bar{x})(y_i - \bar{y})\) and divide it by it’s maximum value \(\sigma_x \sigma_y\).
the linear correlation index is then \[\rho_{xy} =\frac{\sigma_{xy}}{\sigma_x \sigma_y} \ \ \ \text{and it varies between -1 and 1}\]
in the example, the value of the index is 0.82, and since the maximum is 1, it is an high value

Code

df1 |> 
  mutate(
    # Sign of contribution to correlation (+ if both above or both below mean, - otherwise)
    sign_index = ifelse(
      (ice_cream_sales > mean(ice_cream_sales) & shark_attacks > mean(shark_attacks)) |
      (ice_cream_sales < mean(ice_cream_sales) & shark_attacks < mean(shark_attacks)),
      "+", "-"
    ),
    # Size index based on distance from means (scaled absolute contribution)
    size_index = abs((ice_cream_sales - mean(ice_cream_sales)) * 
                     (shark_attacks - mean(shark_attacks)))
  ) |>
  ggplot(aes(x = ice_cream_sales, y = shark_attacks, color = sign_index)) +
  geom_text(aes(label = sign_index, size = size_index), alpha = 0.8) +
  geom_vline(xintercept = mean(df1$ice_cream_sales), color = "dodgerblue",size=1.5,alpha=.5) +
  geom_hline(yintercept = mean(df1$shark_attacks), color = "forestgreen",size=1.5,alpha=.5) +
  labs(
    # title = "Contribution Signs to Correlation",
    # subtitle = "Positive ( + ) when above/below both means, Negative ( - ) otherwise",
    x = "Ice Cream Sales",
    y = "Shark Attacks"
  ) +
  scale_size(range = c(3, 8)) +
  theme_minimal() +
  theme(legend.position = "none")

beachgoers and sunburns: measuring correlation

Code

# 2. Moderate Positive: Beachgoers vs sunburns
df2 <- tibble(
  beachgoers = seq(50, 500, length.out = 50),
  sunburns = beachgoers * 0.045 + rnorm(50, 0, 20),
  pair = "Beachgoers vs Sunburns"
)

df2 |> 
  mutate(
    # Sign of contribution to correlation (+ if both above or both below mean, - otherwise)
    sign_index = ifelse(
      (beachgoers > mean(beachgoers) & sunburns > mean(sunburns)) |
      (beachgoers < mean(beachgoers) & sunburns < mean(sunburns)),
      "+", "-"
    ),
    # Size index based on distance from means (scaled absolute contribution)
    size_index = abs((beachgoers - mean(beachgoers)) * 
                     (sunburns - mean(sunburns)))
  ) |>
  ggplot(aes(x = beachgoers, y = sunburns, color = sign_index)) +
  geom_text(aes(label = sign_index, size = size_index), alpha = 0.8) +
  geom_vline(xintercept = mean(df2$beachgoers), color = "dodgerblue",size=1.5,alpha=.5) +
  geom_hline(yintercept = mean(df2$sunburns), color = "forestgreen",size=1.5,alpha=.5) +
  labs(
    # title = "Contribution Signs to Correlation",
    # subtitle = "Positive ( + ) when above/below both means, Negative ( - ) otherwise",
    x = "beachgoers",
    y = "sunburns"
  ) +
  scale_size(range = c(3, 8)) +
  theme_minimal() +
  theme(legend.position = "none")

in the example, the value of the index is 0.42

ufo sightings and potato sales: measuring correlation

Code

# 2. Moderate Positive: Beachgoers vs sunburns
df3 <- tibble(
  ufo_sightings = rnorm(50, 20, 5),
  potato_sales = rnorm(50, 500, 30),
  pair = "UFO Sightings vs Potato Sales"
)


df3 |> 
  mutate(
    # Sign of contribution to correlation (+ if both above or both below mean, - otherwise)
    sign_index = ifelse(
      (ufo_sightings > mean(ufo_sightings) & potato_sales > mean(potato_sales)) |
      (ufo_sightings < mean(ufo_sightings) & potato_sales < mean(potato_sales)),
      "+", "-"
    ),
    # Size index based on distance from means (scaled absolute contribution)
    size_index = abs((ufo_sightings - mean(ufo_sightings)) * 
                     (potato_sales - mean(potato_sales)))
  ) |>
  ggplot(aes(x = ufo_sightings, y = potato_sales, color = sign_index)) +
  geom_text(aes(label = sign_index, size = size_index), alpha = 0.8) +
  geom_vline(xintercept = mean(df3$ufo_sightings), color = "dodgerblue",size=1.5,alpha=.5) +
  geom_hline(yintercept = mean(df3$potato_sales), color = "forestgreen",size=1.5,alpha=.5) +
  labs(
    # title = "Contribution Signs to Correlation",
    # subtitle = "Positive ( + ) when above/below both means, Negative ( - ) otherwise",
    x = "ufo_sightings",
    y = "potato_sales"
  ) +
  scale_size(range = c(3, 8)) +
  theme_minimal() +
  theme(legend.position = "none")

in the example, the value of the index is -0.01

Alien documentaries vs social life: measuring correlation

Code

df4 <- tibble(
  doc_hours = seq(0, 10, length.out = 50),
  social_life = 100 - doc_hours * 8 + rnorm(50, 0, 5),
  pair = "Alien Docs vs Social Life"
)

df4 |> 
  mutate(
    # Sign of contribution to correlation (+ if both above or both below mean, - otherwise)
    sign_index = ifelse(
      (doc_hours > mean(doc_hours) & social_life > mean(social_life)) |
      (doc_hours < mean(doc_hours) & social_life < mean(social_life)),
      "+", "-"
    ),
    # Size index based on distance from means (scaled absolute contribution)
    size_index = abs((doc_hours - mean(doc_hours)) * 
                     (social_life - mean(social_life)))
  ) |>
  ggplot(aes(x = doc_hours, y = social_life, color = sign_index)) +
  geom_text(aes(label = sign_index, size = size_index), alpha = 0.8) +
  geom_vline(xintercept = mean(df4$doc_hours), color = "dodgerblue",size=1.5,alpha=.5) +
  geom_hline(yintercept = mean(df4$social_life), color = "forestgreen",size=1.5,alpha=.5) +
  labs(
    # title = "Contribution Signs to Correlation",
    # subtitle = "Positive ( + ) when above/below both means, Negative ( - ) otherwise",
    x = "doc_hours",
    y = "social_life"
  ) +
  scale_size(range = c(3, 8)) +
  theme_minimal() +
  theme(legend.position = "none")

in the example, the value of the index is -0.98

Nonlinear correlation: the correlation index will fail

Code

# Simulate non-linear relationship (parabolic)
set.seed(123)
df_nonlinear <- tibble(
  x = seq(-10, 10, length.out = 100),
  y = x^2 + rnorm(100, 0, 5)  # strong but non-linear
)


# Plot
ggplot(df_nonlinear, aes(x = x, y = y)) +
  geom_point(color = "indianred", alpha = 0.7) +
  geom_vline(xintercept = mean(df_nonlinear$x), color = "dodgerblue",size=1.5,alpha=.5) +
  geom_hline(yintercept = mean(df_nonlinear$y), color = "forestgreen",size=1.5,alpha=.5)+
  # geom_smooth(method = "lm", se = FALSE, linetype = "dashed", color = "blue") +  # linear fit
  # geom_smooth(method = "loess", se = FALSE, color = "green") +  # non-linear fit
  labs(
    title = "Example of Non-Linear Correlation",
    # subtitle = "Quadratic relationship: linear fit fails, LOESS succeeds",
    x = "Variable X",
    y = "Variable Y"
  ) +
  theme_minimal()

in the example, the value of the index is 0.01

One continuous, one categorical

Alien species and laser power preference

\[\eta^2 = \frac{BetweenSS}{TotalSS}=\frac{\sum_{j=1}^{k} n_j (\bar{Y}j - \bar{Y})^2}{\sum_{i=1}^{N}(Y_i - \bar{Y})^2}\]

Code

library(tidyverse)
library(broom)

# Simulate data
set.seed(42)
alien_data <- tibble(
  alien_species = rep(c("Martian", "Venusian", "Zeta Reticulan", "Reptilian"), each = 30),
  laser_power = c(
    rnorm(30, mean = 100, sd = 10),  # Martians like moderate
    rnorm(30, mean = 130, sd = 15),  # Venusians love powerful beams
    rnorm(30, mean = 90, sd = 12),   # Zetas are conservative
    rnorm(30, mean = 110, sd = 8)    # Reptilians are balanced
  )
)

alien_groups=alien_data |> group_by(alien_species) |> 
  summarise(
    mean_laser_power = mean(laser_power),
    size_alien = n()
  ) 



# Visualize
alien_data |> ggplot(aes(x = alien_species, y = laser_power, fill = alien_species)) +
  # geom_violin(trim = FALSE, alpha = 0.7) +
  geom_jitter(aes(color=alien_species),width = 0.05, alpha = 0.5, size = 1.5) +
  geom_boxplot(width = 0.1, outlier.shape = NA, color = "black") +
  labs(
    title = "Laser Power Preferences by Alien Species",
    x = "Alien Species",
    y = "Laser Gun Power Setting"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

Code

# Run ANOVA
anova_model <- aov(laser_power ~ alien_species, data = alien_data)
anova_summary <- summary(anova_model)

# Calculate eta squared manually
ss_total <- sum((alien_data$laser_power - mean(alien_data$laser_power))^2)
ss_between <- sum(alien_groups$size_alien * (alien_groups$mean_laser_power - mean(alien_data$laser_power))^2)
eta_squared <- ss_between / ss_total

The value for the index in this example is 0.57

Alien species and sleep hours

Code

library(tidyverse)
library(broom)
library(glue)

# Simulate data — no meaningful group differences
set.seed(1337)
alien_sleep <- tibble(
  alien_species = rep(c("Martian", "Venusian", "Zeta Reticulan", "Reptilian"), each = 30),
  sleep_hours = rnorm(120, mean = 8, sd = 1.5)  # same distribution for all
)

# Plot distributions
ggplot(alien_sleep, aes(x = alien_species, y = sleep_hours, fill = alien_species)) +
  geom_jitter(aes(color=alien_species),width = 0.05, alpha = 0.5, size = 1.5) +
  geom_boxplot(width = 0.1, outlier.shape = NA, color = "black") +
  labs(
    title = "Sleep Hours by Alien Species (No Real Difference)",
    x = "Alien Species",
    y = "Sleep Hours"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

Code

# ANOVA
anova_model <- aov(sleep_hours ~ alien_species, data = alien_sleep)
anova_summary <- summary(anova_model)

# Compute eta squared
ss_total <- sum((alien_sleep$sleep_hours - mean(alien_sleep$sleep_hours))^2)
ss_between <- anova_summary[[1]][["Sum Sq"]][1]
eta_squared <- ss_between / ss_total

The value for the index in this example is 0.01