besides counting all the combination of species and method, we can also calculate the marginal frequencies of each variable (that is, compute the rows and columns total)
Code
ufo_table |>adorn_totals(where =c("row","col"))|>kbl(caption ="Abduction Method by Alien Species", align ="c") %>%kable_styling(bootstrap_options =c("striped", "hover", "condensed", "responsive"),full_width =FALSE,position ="center") |>column_spec(1, bold =TRUE) |>column_spec(6, color ="dodgerblue") |>row_spec(5, color ="forestgreen")
Abduction Method by Alien Species
alien_species
Invisibility Cloak
Sleep Gas
Teleportation
Tractor Beam
Total
Martian
10
8
9
11
38
Reptilian
6
5
4
1
16
Venusian
3
2
8
5
18
Zeta Reticulan
5
3
10
10
28
Total
24
18
31
27
100
The row totals represent the distribution of the variable alien_species, while the column totals is the distribution of the variable abduction_method.
need for relative frequencies
Looking at the absolute value may be misleading:
an observed frequency of 11 for the Tractor Beam method is not very informative, unless we know how many abductions were made in total.
Ufo abductions
in relative terms, the cross-table can be computed as:
Code
ufo_table |>adorn_totals(where =c("row","col"))|>adorn_percentages(denominator ="all") |>adorn_rounding(digits =2) |>kbl(caption ="Abduction Method by Alien Species", align ="c") %>%kable_styling(bootstrap_options =c("striped", "hover", "condensed", "responsive"),full_width =FALSE,position ="center") |>column_spec(1, bold =TRUE) |>column_spec(6, color ="dodgerblue") |>row_spec(5, color ="forestgreen")
Abduction Method by Alien Species
alien_species
Invisibility Cloak
Sleep Gas
Teleportation
Tractor Beam
Total
Martian
0.10
0.08
0.09
0.11
0.38
Reptilian
0.06
0.05
0.04
0.01
0.16
Venusian
0.03
0.02
0.08
0.05
0.18
Zeta Reticulan
0.05
0.03
0.10
0.10
0.28
Total
0.24
0.18
0.31
0.27
1.00
need for conditional frequencies
It is often more interesting to look at the conditional relative frequencies
the Tractor Beam has been used 11 (0.11) times by Martians, 10 (0.1) times by Zeta Reticulans
can we say that the Tractor Beam is the most used method by Martians?
Ufo abductions
For each alien species, what is the relative frequency of each abduction method?
Code
ufo_table |>adorn_totals(where =c("row","col"))|>adorn_percentages(denominator ="row") |>adorn_rounding(digits =2) |>kbl(caption ="Abduction Method by Alien Species", align ="c") %>%kable_styling(bootstrap_options =c("striped", "hover", "condensed", "responsive"),full_width =FALSE,position ="center") |>column_spec(1, bold =TRUE) |>column_spec(6, color ="dodgerblue") |>row_spec(5, color ="forestgreen")
Abduction Method by Alien Species
alien_species
Invisibility Cloak
Sleep Gas
Teleportation
Tractor Beam
Total
Martian
0.26
0.21
0.24
0.29
1
Reptilian
0.38
0.31
0.25
0.06
1
Venusian
0.17
0.11
0.44
0.28
1
Zeta Reticulan
0.18
0.11
0.36
0.36
1
Total
0.24
0.18
0.31
0.27
1
row profiles
The last table shows the row profiles: the distribution of the abduction_method by alien_species.
Ufo abductions
For each abduction method, what is the relative frequency of abduction made by alien species?
Code
ufo_table |>adorn_totals(where =c("row","col"))|>adorn_percentages(denominator ="col") |>adorn_rounding(digits =2) |>kbl(caption ="Abduction Method by Alien Species", align ="c") %>%kable_styling(bootstrap_options =c("striped", "hover", "condensed", "responsive"),full_width =FALSE,position ="center") |>column_spec(1, bold =TRUE) |>column_spec(6, color ="dodgerblue") |>row_spec(5, color ="forestgreen")
Abduction Method by Alien Species
alien_species
Invisibility Cloak
Sleep Gas
Teleportation
Tractor Beam
Total
Martian
0.42
0.44
0.29
0.41
0.38
Reptilian
0.25
0.28
0.13
0.04
0.16
Venusian
0.12
0.11
0.26
0.19
0.18
Zeta Reticulan
0.21
0.17
0.32
0.37
0.28
Total
1.00
1.00
1.00
1.00
1.00
column profiles
The last table shows the column profiles: the distribution of the alien_species by abduction_method.
Ufo abductions: measuring association
asking if the two variables are associated or not is like asking
knowing the abduction method, does one have a better guess on the alien species responsible?
knowing the alien species, does one have a better guess on the abduction method used?
Two continuous variables are correlated if large values of one variable correspond to large (small) values of the other variable.
by large is meant above average; by small is meant below average
Code
library(tidyverse)set.seed(123)# 1. Strong Positive: Ice cream vs shark attacksdf1 <-tibble(ice_cream_sales =seq(100, 1000, length.out =50),shark_attacks = ice_cream_sales *0.05+rnorm(50, 0, 10),pair ="Ice Cream vs Shark Attacks")df1 |>ggplot(aes(x = ice_cream_sales, y = shark_attacks)) +geom_point(color="indianred") +# geom_smooth(method = "lm", se = FALSE, color = "blue") +labs(x ="Ice Cream Sales", y ="Shark Attacks") +#title = "Ice Cream Sales vs Shark Attacks",theme_minimal()
Sharks and ice creams: measuring correlation
building the index
the index is supposed to increase when both the variables are above average or both variables below average
the index is supposed to decrease when one variable is above average and the other is below average
the further away from the average, the higher the contribution to the correlation
Code
df1 |>mutate(# Sign of contribution to correlation (+ if both above or both below mean, - otherwise)sign_index =ifelse( (ice_cream_sales >mean(ice_cream_sales) & shark_attacks >mean(shark_attacks)) | (ice_cream_sales <mean(ice_cream_sales) & shark_attacks <mean(shark_attacks)),"+", "-" ),# Size index based on distance from means (scaled absolute contribution)size_index =abs((ice_cream_sales -mean(ice_cream_sales)) * (shark_attacks -mean(shark_attacks))) ) |>ggplot(aes(x = ice_cream_sales, y = shark_attacks, color = sign_index)) +geom_text(aes(label = sign_index, size = size_index), alpha =0.8) +geom_vline(xintercept =mean(df1$ice_cream_sales), color ="dodgerblue",size=1.5,alpha=.5) +geom_hline(yintercept =mean(df1$shark_attacks), color ="forestgreen",size=1.5,alpha=.5) +labs(# title = "Contribution Signs to Correlation",# subtitle = "Positive ( + ) when above/below both means, Negative ( - ) otherwise",x ="Ice Cream Sales",y ="Shark Attacks" ) +scale_size(range =c(3, 8)) +theme_minimal() +theme(legend.position ="none")
Sharks and ice creams: computing correlation
building the index
to compute the index, just take the average of the contributions products \(\sigma_{xy}=\sum (x_i - \bar{x})(y_i - \bar{y})\) and divide it by it’s maximum value \(\sigma_x \sigma_y\).
the linear correlation index is then \[\rho_{xy} =\frac{\sigma_{xy}}{\sigma_x \sigma_y} \ \ \ \text{and it varies between -1 and 1}\]
in the example, the value of the index is 0.82, and since the maximum is 1, it is an high value
Code
df1 |>mutate(# Sign of contribution to correlation (+ if both above or both below mean, - otherwise)sign_index =ifelse( (ice_cream_sales >mean(ice_cream_sales) & shark_attacks >mean(shark_attacks)) | (ice_cream_sales <mean(ice_cream_sales) & shark_attacks <mean(shark_attacks)),"+", "-" ),# Size index based on distance from means (scaled absolute contribution)size_index =abs((ice_cream_sales -mean(ice_cream_sales)) * (shark_attacks -mean(shark_attacks))) ) |>ggplot(aes(x = ice_cream_sales, y = shark_attacks, color = sign_index)) +geom_text(aes(label = sign_index, size = size_index), alpha =0.8) +geom_vline(xintercept =mean(df1$ice_cream_sales), color ="dodgerblue",size=1.5,alpha=.5) +geom_hline(yintercept =mean(df1$shark_attacks), color ="forestgreen",size=1.5,alpha=.5) +labs(# title = "Contribution Signs to Correlation",# subtitle = "Positive ( + ) when above/below both means, Negative ( - ) otherwise",x ="Ice Cream Sales",y ="Shark Attacks" ) +scale_size(range =c(3, 8)) +theme_minimal() +theme(legend.position ="none")
beachgoers and sunburns: measuring correlation
Code
# 2. Moderate Positive: Beachgoers vs sunburnsdf2 <-tibble(beachgoers =seq(50, 500, length.out =50),sunburns = beachgoers *0.045+rnorm(50, 0, 20),pair ="Beachgoers vs Sunburns")df2 |>mutate(# Sign of contribution to correlation (+ if both above or both below mean, - otherwise)sign_index =ifelse( (beachgoers >mean(beachgoers) & sunburns >mean(sunburns)) | (beachgoers <mean(beachgoers) & sunburns <mean(sunburns)),"+", "-" ),# Size index based on distance from means (scaled absolute contribution)size_index =abs((beachgoers -mean(beachgoers)) * (sunburns -mean(sunburns))) ) |>ggplot(aes(x = beachgoers, y = sunburns, color = sign_index)) +geom_text(aes(label = sign_index, size = size_index), alpha =0.8) +geom_vline(xintercept =mean(df2$beachgoers), color ="dodgerblue",size=1.5,alpha=.5) +geom_hline(yintercept =mean(df2$sunburns), color ="forestgreen",size=1.5,alpha=.5) +labs(# title = "Contribution Signs to Correlation",# subtitle = "Positive ( + ) when above/below both means, Negative ( - ) otherwise",x ="beachgoers",y ="sunburns" ) +scale_size(range =c(3, 8)) +theme_minimal() +theme(legend.position ="none")
in the example, the value of the index is 0.42
ufo sightings and potato sales: measuring correlation
Code
# 2. Moderate Positive: Beachgoers vs sunburnsdf3 <-tibble(ufo_sightings =rnorm(50, 20, 5),potato_sales =rnorm(50, 500, 30),pair ="UFO Sightings vs Potato Sales")df3 |>mutate(# Sign of contribution to correlation (+ if both above or both below mean, - otherwise)sign_index =ifelse( (ufo_sightings >mean(ufo_sightings) & potato_sales >mean(potato_sales)) | (ufo_sightings <mean(ufo_sightings) & potato_sales <mean(potato_sales)),"+", "-" ),# Size index based on distance from means (scaled absolute contribution)size_index =abs((ufo_sightings -mean(ufo_sightings)) * (potato_sales -mean(potato_sales))) ) |>ggplot(aes(x = ufo_sightings, y = potato_sales, color = sign_index)) +geom_text(aes(label = sign_index, size = size_index), alpha =0.8) +geom_vline(xintercept =mean(df3$ufo_sightings), color ="dodgerblue",size=1.5,alpha=.5) +geom_hline(yintercept =mean(df3$potato_sales), color ="forestgreen",size=1.5,alpha=.5) +labs(# title = "Contribution Signs to Correlation",# subtitle = "Positive ( + ) when above/below both means, Negative ( - ) otherwise",x ="ufo_sightings",y ="potato_sales" ) +scale_size(range =c(3, 8)) +theme_minimal() +theme(legend.position ="none")
in the example, the value of the index is -0.01
Alien documentaries vs social life: measuring correlation
Code
df4 <-tibble(doc_hours =seq(0, 10, length.out =50),social_life =100- doc_hours *8+rnorm(50, 0, 5),pair ="Alien Docs vs Social Life")df4 |>mutate(# Sign of contribution to correlation (+ if both above or both below mean, - otherwise)sign_index =ifelse( (doc_hours >mean(doc_hours) & social_life >mean(social_life)) | (doc_hours <mean(doc_hours) & social_life <mean(social_life)),"+", "-" ),# Size index based on distance from means (scaled absolute contribution)size_index =abs((doc_hours -mean(doc_hours)) * (social_life -mean(social_life))) ) |>ggplot(aes(x = doc_hours, y = social_life, color = sign_index)) +geom_text(aes(label = sign_index, size = size_index), alpha =0.8) +geom_vline(xintercept =mean(df4$doc_hours), color ="dodgerblue",size=1.5,alpha=.5) +geom_hline(yintercept =mean(df4$social_life), color ="forestgreen",size=1.5,alpha=.5) +labs(# title = "Contribution Signs to Correlation",# subtitle = "Positive ( + ) when above/below both means, Negative ( - ) otherwise",x ="doc_hours",y ="social_life" ) +scale_size(range =c(3, 8)) +theme_minimal() +theme(legend.position ="none")
in the example, the value of the index is -0.98
Nonlinear correlation: the correlation index will fail
Code
# Simulate non-linear relationship (parabolic)set.seed(123)df_nonlinear <-tibble(x =seq(-10, 10, length.out =100),y = x^2+rnorm(100, 0, 5) # strong but non-linear)# Plotggplot(df_nonlinear, aes(x = x, y = y)) +geom_point(color ="indianred", alpha =0.7) +geom_vline(xintercept =mean(df_nonlinear$x), color ="dodgerblue",size=1.5,alpha=.5) +geom_hline(yintercept =mean(df_nonlinear$y), color ="forestgreen",size=1.5,alpha=.5)+# geom_smooth(method = "lm", se = FALSE, linetype = "dashed", color = "blue") + # linear fit# geom_smooth(method = "loess", se = FALSE, color = "green") + # non-linear fitlabs(title ="Example of Non-Linear Correlation",# subtitle = "Quadratic relationship: linear fit fails, LOESS succeeds",x ="Variable X",y ="Variable Y" ) +theme_minimal()
library(tidyverse)library(broom)# Simulate dataset.seed(42)alien_data <-tibble(alien_species =rep(c("Martian", "Venusian", "Zeta Reticulan", "Reptilian"), each =30),laser_power =c(rnorm(30, mean =100, sd =10), # Martians like moderaternorm(30, mean =130, sd =15), # Venusians love powerful beamsrnorm(30, mean =90, sd =12), # Zetas are conservativernorm(30, mean =110, sd =8) # Reptilians are balanced ))alien_groups=alien_data |>group_by(alien_species) |>summarise(mean_laser_power =mean(laser_power),size_alien =n() ) # Visualizealien_data |>ggplot(aes(x = alien_species, y = laser_power, fill = alien_species)) +# geom_violin(trim = FALSE, alpha = 0.7) +geom_jitter(aes(color=alien_species),width =0.05, alpha =0.5, size =1.5) +geom_boxplot(width =0.1, outlier.shape =NA, color ="black") +labs(title ="Laser Power Preferences by Alien Species",x ="Alien Species",y ="Laser Gun Power Setting" ) +theme_minimal() +theme(legend.position ="none")
Code
# Run ANOVAanova_model <-aov(laser_power ~ alien_species, data = alien_data)anova_summary <-summary(anova_model)# Calculate eta squared manuallyss_total <-sum((alien_data$laser_power -mean(alien_data$laser_power))^2)ss_between <-sum(alien_groups$size_alien * (alien_groups$mean_laser_power -mean(alien_data$laser_power))^2)eta_squared <- ss_between / ss_total
The value for the index in this example is 0.57
Alien species and sleep hours
Code
library(tidyverse)library(broom)library(glue)# Simulate data — no meaningful group differencesset.seed(1337)alien_sleep <-tibble(alien_species =rep(c("Martian", "Venusian", "Zeta Reticulan", "Reptilian"), each =30),sleep_hours =rnorm(120, mean =8, sd =1.5) # same distribution for all)# Plot distributionsggplot(alien_sleep, aes(x = alien_species, y = sleep_hours, fill = alien_species)) +geom_jitter(aes(color=alien_species),width =0.05, alpha =0.5, size =1.5) +geom_boxplot(width =0.1, outlier.shape =NA, color ="black") +labs(title ="Sleep Hours by Alien Species (No Real Difference)",x ="Alien Species",y ="Sleep Hours" ) +theme_minimal() +theme(legend.position ="none")