Part I — Theory and guided demonstration

What is statistics?

“Statistics is the science of learning from data.”
David Hand

“Statistics is the grammar of science.”
Karl Pearson

For this course, the practical question is:

How do we move from raw information to useful statistical learning workflows?

Data vs information

Just because one has data, it does not mean one has information.

  • Data: raw facts, numbers, texts, measurements, labels, events.
  • Information: data that have been collected, processed, organised, and interpreted in relation to a question.

Statistical work often starts with transforming raw data into something we can analyse.

A small dataset

Code
set.seed(123)

toy_data <- tibble(
  age = sample(22:60, size = 90, replace = TRUE),
  
  height_cm = round(rnorm(90, mean = 171, sd = 8), 1),
  
  gender = factor(
    sample(
      c("Male", "Female"),
      size = 90,
      replace = TRUE,
      prob = c(0.45, 0.55)
    )
  ),
  
  education = factor(
    sample(
      c("High School", "Bachelor's", "Master's", "PhD"),
      size = 90,
      replace = TRUE,
      prob = c(0.20, 0.35, 0.30, 0.15)
    ),
    levels = c("High School", "Bachelor's", "Master's", "PhD"),
    ordered = TRUE
  ),
  
  commute_by = sample(
    c("Car", "Bus", "Bike", "Walk"),
    size = 90,
    replace = TRUE,
    prob = c(0.40, 0.25, 0.20, 0.15)
  ),
  
  commute_cost = case_when(
    commute_by == "Car"  ~ round(rnorm(90, mean = 120, sd = 25)),
    commute_by == "Bus"  ~ round(rnorm(90, mean = 55, sd = 10)),
    commute_by == "Bike" ~ 0,
    commute_by == "Walk" ~ 0
  )
)

toy_data  |> slice_sample(n = 10) |>  kbl() |> kable_styling(full_width = FALSE)
age height_cm gender education commute_by commute_cost
50 176.5 Female Bachelor's Bus 60
29 168.8 Female Bachelor's Walk 0
46 154.6 Male High School Car 184
46 186.3 Female Bachelor's Bike 0
60 171.6 Male Bachelor's Bike 0
46 173.4 Female Bachelor's Bike 0
48 161.2 Male PhD Bus 64
38 175.4 Female PhD Car 117
33 166.0 Female Bachelor's Car 162
56 176.2 Male Master's Walk 0

Dataset elements

Observation
A single unit of data: a person, a country, a patient, a firm, a text, an experimental run.

Variable
A characteristic measured on each observation.

Code
toy_data |>
  kbl() |>
  kable_styling(full_width = FALSE) |>
  row_spec(1, background = "#D9FDEC") |>
  row_spec(3, background = "#CAF6FC") |>
  row_spec(8, background = "#FDB5BA")
age height_cm gender education commute_by commute_cost
52 187.4 Female Bachelor's Bike 0
36 167.1 Male High School Bike 0
35 152.5 Male Bachelor's Car 167
24 179.0 Female High School Car 86
58 165.3 Female Bachelor's Car 121
35 165.5 Male Master's Car 151
46 179.2 Female Bachelor's Car 102
47 168.7 Female Bachelor's Car 101
48 161.2 Male PhD Bus 64
26 172.5 Female Bachelor's Bus 47
48 169.9 Female Master's Walk 0
49 171.0 Male Bachelor's Car 128
30 174.1 Male Master's Bus 47
50 168.0 Female Master's Car 125
56 176.2 Male Master's Walk 0
29 169.2 Male Bachelor's Car 171
47 173.7 Female PhD Bike 0
28 179.8 Female Master's Car 139
30 174.5 Male Bachelor's Bus 49
40 168.4 Male Master's Bike 0
57 180.2 Male Bachelor's Car 111
35 178.9 Male PhD Bike 0
38 175.4 Female PhD Car 117
60 172.9 Female Master's Car 89
33 166.0 Female Bachelor's Car 162
36 181.9 Female PhD Car 143
53 166.2 Female Bachelor's Bus 60
28 188.5 Female Master's Bike 0
30 183.3 Male Bachelor's Walk 0
31 169.1 Female Master's Car 137
44 162.8 Female Bachelor's Walk 0
48 165.3 Male Bachelor's Car 137
28 173.1 Female Bachelor's Walk 0
48 169.0 Male Master's Bike 0
53 168.2 Female High School Bus 60
59 163.4 Female Bachelor's Bike 0
46 170.6 Male PhD Car 145
55 164.7 Male High School Bike 0
50 157.7 Male High School Car 102
26 168.0 Female Bachelor's Bike 0
29 178.4 Female Master's Bus 48
33 166.4 Male High School Car 69
34 175.9 Female Master's Bike 0
39 158.1 Female Bachelor's Car 115
54 170.6 Male Bachelor's Bike 0
48 175.2 Female Bachelor's Car 117
46 173.4 Female Bachelor's Bike 0
59 171.8 Male PhD Walk 0
42 165.9 Female High School Car 162
36 164.2 Female High School Walk 0
47 162.8 Female Bachelor's Bike 0
52 171.9 Female Bachelor's Car 76
37 163.4 Female PhD Car 122
51 167.1 Female Master's Car 106
27 169.0 Male Master's Bus 33
29 185.8 Female Bachelor's Walk 0
43 165.8 Female Master's Bus 43
43 172.9 Male Bachelor's Bus 60
60 171.6 Male Bachelor's Bike 0
52 163.3 Female High School Bus 53
38 170.4 Female Bachelor's Bus 49
55 182.6 Male High School Car 128
25 174.6 Male Bachelor's Car 130
34 171.3 Male High School Walk 0
26 167.6 Female Master's Bike 0
46 154.6 Male High School Car 184
43 180.1 Female High School Bike 0
46 159.3 Male Master's Bike 0
53 176.9 Female Master's Car 127
46 186.3 Female Bachelor's Bike 0
44 159.4 Female Bachelor's Bus 64
56 176.6 Female High School Walk 0
51 168.9 Male Bachelor's Bus 66
33 158.4 Female High School Walk 0
52 158.9 Male Bachelor's Bus 57
51 158.2 Male High School Bus 52
56 166.8 Female PhD Walk 0
35 159.3 Female Bachelor's Bike 0
50 176.5 Female Bachelor's Bus 60
53 187.8 Female High School Car 141
28 160.7 Female Master's Bus 44
24 177.3 Male High School Bus 67
44 177.2 Male PhD Walk 0
36 173.7 Female Bachelor's Bike 0
42 162.9 Male Master's Walk 0
58 170.0 Male High School Car 75
29 168.8 Female Bachelor's Walk 0
31 175.5 Female PhD Car 125
55 168.0 Male Bachelor's Car 124
31 178.8 Female Bachelor's Bike 0

Variable types

Numerical variables

  • continuous: height, income, speed;
  • discrete: number of children, number of goals.

Categorical variables

  • nominal: gender, species, country;
  • ordinal: education level, satisfaction scale.

The type of variable determines which summaries and visualisations are meaningful.

Describe and infer

Descriptive statistics

Describe what is observed:

  • summaries;
  • tables;
  • plots;
  • exploratory patterns.

Inferential statistics

Use observed data to reason about a larger process or population:

  • uncertainty;
  • sampling variability;
  • modelling;
  • prediction.

In this first class, we focus mostly on description and exploration.

observational vs experimental studies

Suppose a researcher wants to study the effect of the use of blue-screens (smartphone/tablet) before sleeping on the attention span.

observational studies

the researcher observes the statistical units without intervening.

  • assign the statistical units that use the blue-screens to one group, the remaining to another group.

  • measure the attention span of the two groups after two weeks.

  • if the group using blue-screens has a lower attention span, the researcher can conclude that the use of blue-screens is correlated to the attention span.

experimental studies

the researcher manipulates the statistical units to observe the effect of the manipulation.

  • split the statistical units into two groups via random assignment: one group is asked to use a blue-screen device before sleep, whereas the other group is asked NOT to use the device.

  • measure the attention span of the two groups after two weeks.

  • if the group using blue-screens has a lower attention span, the researcher can conclude that the use of blue-screens causes a lower attention span.

observational vs experimental studies

Suppose a researcher wants to study the effect of the use of blue-screens (smartphone/tablet) before sleeping on the attention span.

observational studies

the researcher observes the statistical units without intervening.

  • assign the statistical units that use the blue-screens to one group, the remaining to another group.

  • measure the attention span of the two groups after two weeks.

  • if the group using blue-screens has a lower attention span, the researcher can conclude that the use of blue-screens is correlated to the attention span.

experimental studies

the researcher manipulates the statistical units to observe the effect of the manipulation.

  • split the statistical units into two groups via random assignment: one group is asked to use a blue-screen device before sleep, whereas the other group is asked NOT to use the device.

  • measure the attention span of the two groups after two weeks.

  • if the group using blue-screens has a lower attention span, the researcher can conclude that the use of blue-screens causes a lower attention span.

random sampling and random assignment

In the setup of a study, two key concepts are desireable: random sampling and random assignment.

random sampling

the statistical units are randomly selected from the population.

  • this is to ensure that the sample is representative of the population.

  • the sample is used to make inferences about the population.

it makes it possible to generalize the results of the study to the population

random assigment

the statistical units are randomly assigned to case/control groups.

it makes it possible to detect causality (e.g. between blue-screen use and attention span)

random assigment

  • random assignment is a key feature of experimental studies and random sampling grants the generalizability of the results to the population.

  • to have both is the ideal setup for a study, but it is not always possible. (you cannot select people at random and force them to use or not use blue-screens before sleep!)

Frequency distributions: categorical variables

Code
toy_data |>
  count(commute_by, name = "frequency") |>
  mutate(relative_frequency = frequency / sum(frequency)) |>
  kbl(digits = 2) |>
  kable_styling(full_width = FALSE)
commute_by frequency relative_frequency
Bike 22 0.24
Bus 19 0.21
Car 34 0.38
Walk 15 0.17

Frequency distributions: numerical variables

Numerical variables are often grouped into intervals.

Code
toy_data |>
  mutate(height_group = cut(height_cm, breaks = seq(160, 185, by = 5), include.lowest = TRUE)) |>
  count(height_group, name = "frequency") |>
  kbl() |>
  kable_styling(full_width = FALSE)
height_group frequency
[160,165] 10
(165,170] 26
(170,175] 18
(175,180] 16
(180,185] 5
NA 15

Informative plots

An effective data visualisation should be:

  • appropriate: suited to the variable type and the question;
  • clear: easy to understand;
  • simple: not cluttered;
  • accurate: not misleading.

Non-informative plots: lying with graphics

Code
tibble(company = c("company A", "company B"), revenue = c(800, 1000)) |>
  ggplot(aes(x = company, y = revenue, fill = company)) +
  geom_col() +
  coord_cartesian(ylim = c(750, 1000)) +
  theme_minimal() +
  theme(legend.position = "none", axis.text.y = element_blank(), axis.ticks.y = element_blank())

Code
tibble(company = c("company A", "company B"), revenue = c(800, 1000)) |>
  ggplot(aes(x = company, y = revenue, fill = company)) +
  geom_col() +
  theme_minimal() +
  theme(legend.position = "none")

visualization for numerical variables

Code
toy_data |> ggplot(aes(x = height_cm)) +
    geom_histogram( binwidth= 5,
                  boundary = 150,
                  fill = "indianred", color = "black",alpha=.5) +
  labs(title = "Histogram of height", x = "height (cm)", y = "Frequency") + theme_minimal() 

visualization for numerical variables

Code
toy_data |> mutate(height_cm=cut(height_cm, 
                                         breaks=seq(150,200,by=5),include.lowest=T)) |> 
  count(height_cm) |> kbl() |> kable_styling()
height_cm n
[150,155] 2
(155,160] 8
(160,165] 10
(165,170] 26
(170,175] 18
(175,180] 16
(180,185] 5
(185,190] 5
Code
toy_data |> ggplot(aes(x = height_cm)) +
  geom_histogram( binwidth= 5,
                  boundary = 150,
                  fill = "indianred", color = "black",alpha=.5) +
  labs(title = "Histogram of height", x = "height (cm)") + theme_minimal() 

visualization for numerical variables

Code
toy_data |> mutate(height_cm=cut(height_cm, 
                                         breaks=seq(150,200,by=5),include.lowest=T)) |> 
  count(height_cm) |> mutate(prop=n/sum(n)) |> select(-n) |> kbl() |> kable_styling()
height_cm prop
[150,155] 0.0222222
(155,160] 0.0888889
(160,165] 0.1111111
(165,170] 0.2888889
(170,175] 0.2000000
(175,180] 0.1777778
(180,185] 0.0555556
(185,190] 0.0555556
Code
toy_data |> ggplot(aes(x = height_cm)) +
  geom_histogram(aes(y = after_stat(count / sum(count))),
                  binwidth= 5,
                  boundary = 150,
                  fill = "indianred", color = "black",alpha=.5) +
  # geom_histogram( aes(y=after_stat(density)), binwidth= 5,
  #                 boundary = 150,
  #                 fill = "indianred", color = "black",alpha=.5) +
  labs(title = "Histogram of height", x = "height (cm)") + theme_minimal() 

visualization for numerical variables

Code
toy_data |> mutate(height_cm=cut(height_cm, 
                                         breaks=seq(150,200,by=5),include.lowest=T)) |> 
  count(height_cm) |> mutate(prop=n/sum(n)) |> select(-n) |> kbl() |> kable_styling()
height_cm prop
[150,155] 0.0222222
(155,160] 0.0888889
(160,165] 0.1111111
(165,170] 0.2888889
(170,175] 0.2000000
(175,180] 0.1777778
(180,185] 0.0555556
(185,190] 0.0555556
Code
toy_data |> ggplot(aes(x = height_cm)) +
  # geom_histogram(aes(y = after_stat(count / sum(count))),
  #                 binwidth= 5,
  #                 boundary = 150,
  #                 fill = "dodgerblue", color = "black",alpha=.5) +
  geom_histogram( aes(y=after_stat(density)), binwidth= 5,
                  boundary = 150,
                  fill = "dodgerblue", color = "black",alpha=.5) +
  labs(title = "Histogram of height", x = "height (cm)") + theme_minimal() 

visualization for numerical variables

Code
custom_breaks <- c(150,170,175,195,200)
toy_data |> mutate(height_cm=cut(height_cm, 
                                         breaks=custom_breaks,include.lowest=T)) |> 
  count(height_cm) |> mutate(prop=n/sum(n)) |> select(-n) |> kbl() |> kable_styling()
height_cm prop
[150,170] 0.5111111
(170,175] 0.2000000
(175,195] 0.2888889
Code
toy_data |> ggplot(aes(x = height_cm)) +
  geom_histogram(aes(y = after_stat(count / sum(count))),
                  breaks=custom_breaks,
                  boundary = 150,
                  fill = "indianred", color = "black",alpha=.5) +
  # geom_histogram( aes(y=after_stat(density)), binwidth= 5,
  #                 boundary = 150,
  #                 fill = "dodgerblue", color = "black",alpha=.5) +
  labs(title = "Histogram of height", x = "height (cm)") + theme_minimal() 

visualization for numerical variables

Code
toy_data |> mutate(height_cm=cut(height_cm, 
                                         breaks=custom_breaks,include.lowest=T)) |> 
  count(height_cm) |> mutate(prop=n/sum(n)) |> select(-n) |> kbl() |> kable_styling()
height_cm prop
[150,170] 0.5111111
(170,175] 0.2000000
(175,195] 0.2888889
Code
toy_data |> ggplot(aes(x = height_cm)) +
  geom_histogram(aes(y=after_stat(density)),
                  breaks=custom_breaks,
                  boundary = 150,
                  fill = "dodgerblue", color = "black",alpha=.5) +
  # geom_histogram( aes(y=after_stat(density)), binwidth= 5,
  #                 boundary = 150,
  #                 fill = "dodgerblue", color = "black",alpha=.5) +
  labs(title = "Histogram of height", x = "height (cm)") + theme_minimal() 

visualization for numerical variables

Code
toy_data |> ggplot(aes(x = height_cm)) +
  geom_histogram(aes(y = after_stat(count / sum(count))),
                  breaks=custom_breaks,
                  boundary = 150,
                  fill = "indianred", color = "black",alpha=.5) +
  # geom_histogram( aes(y=after_stat(density)), binwidth= 5,
  #                 boundary = 150,
  #                 fill = "dodgerblue", color = "black",alpha=.5) +
  labs(title = "Histogram of height", x = "height (cm)") + theme_minimal() 

the heights of the bars add up to 1

Code
toy_data |> ggplot(aes(x = height_cm)) +
  geom_histogram(aes(y=after_stat(density)),
                  breaks=custom_breaks,
                  boundary = 150,
                  fill = "dodgerblue", color = "black",alpha=.5) +
  # geom_histogram( aes(y=after_stat(density)), binwidth= 5,
  #                 boundary = 150,
  #                 fill = "dodgerblue", color = "black",alpha=.5) +
  labs(title = "Histogram of height", x = "height (cm)") + theme_minimal() 

the area of the bars adds up to 1

  • more appropriate when the bins, intervals of values, are of different widths

visualization for numerical variables (integers)

While one can still consider bins and an histogram, since the variable is not continuous, a more appropriate option is the lollipop plot.

Code
toy_data |> 
count(age ) %>%
  ggplot(aes(x = age, y = n)) +
  geom_segment(aes(xend = age, yend = 0), color = "dodgerblue") +
  geom_point(color = "indianred", size = 3) + theme_minimal() 

visualization for categorical variables

Barplots are the most common way to visualize categorical variables.

Code
toy_data |> ggplot(aes(x = factor(education))) +
  geom_bar(fill = "dodgerblue") +
  labs(title = "barplot", x = "education", y = "Count")+ theme_minimal()

visualization for categorical variables

pie charts are also popular, but less easily interpretable.

Code
 data_pie <- toy_data |> 
  count(education) %>%
  mutate(perc = (n / sum(n)),
         label = paste0(education, " ", scales::percent(perc)))

# Plot
ggplot(data_pie, aes(x = "", y = perc, fill = education)) +
  geom_col(width = 1, color = "white") +
  coord_polar(theta = "y") +
  labs(title = "Pie Chart of eduction", fill = "education") +
  geom_text(aes(label = label), position = position_stack(vjust = 0.5)) +
  theme_void()

visualization for categorical variables

similar to piecharts, but more interpretable, is the donut chart.

Code
data_pie |> ggplot(aes(x = 2, y = perc, fill = education)) +
  geom_col(width = 1, color = "white") +
  coord_polar(theta = "y") +
  xlim(0.5, 2.5) +  # create the hole
  geom_text(aes(label = label), position = position_stack(vjust = 0.5)) +
  labs(title = "Donut Chart of education", fill = "education") +
  theme_void()

visualization for categorical variables

If the factor levels are not ordered as in education, the bars of the barplot can be re-ordered in an ascending or descending order.

Code
toy_data |> ggplot(aes(x = fct_rev(fct_infreq(commute_by)))) +
  geom_bar(fill = "dodgerblue") +
  labs(title = "barplot", x = "commute_by", y = "Count")+ theme_minimal()

Code
toy_data |> ggplot(aes(x = fct_infreq(commute_by))) +
  geom_bar(fill = "dodgerblue") +
  labs(title = "barplot", x = "commute_by", y = "Count")+ theme_minimal()

This is useful for categorical variables with many levels, to put in evidence the most/least frequent ones.

summarise (numerical) data: indexes

four distributions

Suppose to have the distribution of employees heights from four companies.

Code
n=1000
set.seed(123)
library("sn")



four_heigths_data = tibble(
                      `A&co.`= rsn(n = n, xi=140, omega=25,alpha=15),
                      `B&co.`= rnorm(n, 170, 25),
                      `C&co.`= rnorm(n, 170, 9),
                      `D&co.`= rnorm(n, 190, 9)
)

four_heigths_data |> 
  pivot_longer(cols = everything(), names_to = "company", values_to = "height") |> 
  ggplot(aes(x = height, fill = company)) +
  geom_density(alpha = 0.5) +
  geom_histogram( aes(y=after_stat(density)), binwidth= 5,colour="black",alpha=.5)+
  scale_fill_manual(values = c("indianred","dodgerblue",  "darkorange", "darkgreen")) +
  labs(title = "Density of heights", x = "height (cm)", y = "Density") + theme_minimal() + facet_grid(company~.)

you can tell they are different from one another, but how?

company C&co. and D&co.

the two distributions look similar, but for their position

Code
n=1000
set.seed(123)
library("sn")

four_heigths_data |> 
  pivot_longer(cols = everything(), names_to = "company", values_to = "height") |> 
  filter(company %in% c("C&co.", "D&co.")) |>
  ggplot(aes(x = height, fill = company)) +
  geom_density(alpha = 0.5) +
  geom_histogram( aes(y=after_stat(density)), binwidth= 5,colour="black",alpha=.5)+
  scale_fill_manual(values = c("darkorange", "darkgreen")) +
  labs(title = "Density of heights", x = "height (cm)", y = "Density") + theme_minimal() + facet_grid(company~.)

company B&co. and C&co.

the two distributions seem to have

  • a similar position (centered around 170 cm)

  • different spread (the first one is more spread out than the second one)

Code
n=1000
set.seed(123)
library("sn")

four_heigths_data |> 
  pivot_longer(cols = everything(), names_to = "company", values_to = "height") |> 
  filter(company %in% c("B&co.", "C&co.")) |>
  ggplot(aes(x = height, fill = company)) +
  geom_density(alpha = 0.5) +
  geom_histogram( aes(y=after_stat(density)), binwidth= 5,colour="black",alpha=.5)+
  scale_fill_manual(values = c("dodgerblue","darkorange")) +
  labs(title = "Density of heights", x = "height (cm)", y = "Density") + theme_minimal() + facet_grid(company~.)

company A&co. and D&co.

the two distributions seem to have

  • different position

  • different spread

  • different shape (the first one is more skewed to the right than the second one)

Code
n=1000
set.seed(123)
library("sn")

four_heigths_data |> 
  pivot_longer(cols = everything(), names_to = "company", values_to = "height") |> 
  filter(company %in% c("A&co.", "D&co.")) |>
  ggplot(aes(x = height, fill = company)) +
  geom_density(alpha = 0.5) +
  geom_histogram( aes(y=after_stat(density)), binwidth= 5,colour="black",alpha=.5)+
  scale_fill_manual(values = c("indianred", "darkgreen")) +
  labs(title = "Density of heights", x = "height (cm)", y = "Density") + theme_minimal() + facet_grid(company~.)

Measuring position, spread and shape

position

  • mean

  • median and quartiles

spread

  • variance (standard deviation)

  • interquartile range (IQR)

shape

  • skewness

Visualising relationships

Interdependence and dependence

If the components of a bivariate variable \((X,Y)\) play the same role in the analysis, we study their interdependence.

If we are interested in how \(Y\) changes with respect to \(X\), we refer to the dependence of \(Y\) on \(X\).

  • \(Y\) is often called the dependent variable.
  • \(X\) is often called the independent variable.

Types of bivariate relationships

Given a pair of variables \((X,Y)\):

  • if both are categorical, we study association;
  • if both are continuous, we study correlation;
  • if one is categorical and one is continuous, we compare distributions across groups.

Two categorical variables

UFO abductions: joint frequency table

A sci-fi example: the relationship between alien species and method of abduction.

Code
set.seed(42)

ufo_data <- tibble(
  alien_species = sample(
    c("Martian", "Venusian", "Zeta Reticulan", "Reptilian"),
    size = 100, replace = TRUE, prob = c(0.4, 0.2, 0.3, 0.1)
  ),
  abduction_method = sample(
    c("Tractor Beam", "Teleportation", "Sleep Gas", "Invisibility Cloak"),
    size = 100, replace = TRUE
  )
)

ufo_table <- ufo_data |>
  count(alien_species, abduction_method) |>
  pivot_wider(names_from = abduction_method, values_from = n, values_fill = 0)

ufo_table |> kbl() |> kable_styling(full_width = FALSE)
alien_species Invisibility Cloak Sleep Gas Teleportation Tractor Beam
Martian 10 8 9 11
Reptilian 6 5 4 1
Venusian 3 2 8 5
Zeta Reticulan 5 3 10 10

Row profiles

Row profiles answer: within each alien species, how are abduction methods distributed?

Code
ufo_data |>
  count(alien_species, abduction_method) |>
  group_by(alien_species) |>
  mutate(row_profile = n / sum(n)) |>
  select(alien_species, abduction_method, row_profile) |>
  pivot_wider(names_from = abduction_method, values_from = row_profile, values_fill = 0) |>
  kbl(digits = 2) |>
  kable_styling(full_width = FALSE)
alien_species Invisibility Cloak Sleep Gas Teleportation Tractor Beam
Martian 0.26 0.21 0.24 0.29
Reptilian 0.38 0.31 0.25 0.06
Venusian 0.17 0.11 0.44 0.28
Zeta Reticulan 0.18 0.11 0.36 0.36

Visualising categorical association

Code
ufo_data |>
  ggplot(aes(x = alien_species, fill = abduction_method)) +
  geom_bar(position = "fill") +
  labs(x = "Alien species", y = "Proportion", fill = "Method") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 25, hjust = 1))

Residual heatmap

Association can also be visualised by comparing observed cells with what we would expect under independence.

Code
chisq_ufo <- chisq.test(
  xtabs(~ alien_species + abduction_method, data = ufo_data)
)

as.data.frame(as.table(chisq_ufo$stdres)) |>
  rename(residual = Freq) |>
  ggplot(aes(abduction_method, alien_species, fill = residual)) +
  geom_tile() +
  geom_text(aes(label = round(residual, 1))) +
  coord_equal() +
  labs(
    x = "Abduction method",
    y = "Alien species",
    fill = "Std. residual"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 30, hjust = 1))

Two continuous variables

Correlation: shark attacks and ice cream

Two continuous variables are correlated when large values of one variable tend to correspond to large or small values of the other.

Code
set.seed(123)

df1 <- tibble(
  ice_cream_sales = seq(100, 1000, length.out = 50),
  shark_attacks = ice_cream_sales * 0.05 + rnorm(50, 0, 10)
)

df1 |>
  ggplot(aes(x = ice_cream_sales, y = shark_attacks)) +
  geom_point(color = "indianred") +
  labs(x = "Ice cream sales", y = "Shark attacks") +
  theme_minimal()

Correlation: building the index

Correlation increases when both variables are above their average or both are below their average.

Code
df1 |>
  mutate(
    sign_index = if_else(
      (ice_cream_sales > mean(ice_cream_sales) & shark_attacks > mean(shark_attacks)) |
        (ice_cream_sales < mean(ice_cream_sales) & shark_attacks < mean(shark_attacks)),
      "+", "-"
    ),
    size_index = abs((ice_cream_sales - mean(ice_cream_sales)) * (shark_attacks - mean(shark_attacks)))
  ) |>
  ggplot(aes(x = ice_cream_sales, y = shark_attacks, color = sign_index)) +
  geom_text(aes(label = sign_index, size = size_index), alpha = 0.8) +
  geom_vline(xintercept = mean(df1$ice_cream_sales), color = "dodgerblue", linewidth = 1.2, alpha = 0.5) +
  geom_hline(yintercept = mean(df1$shark_attacks), color = "forestgreen", linewidth = 1.2, alpha = 0.5) +
  scale_size(range = c(3, 8)) +
  labs(x = "Ice cream sales", y = "Shark attacks") +
  theme_minimal() +
  theme(legend.position = "none")

Pearson correlation

The linear correlation index is:

\[ \rho_{xy} = \frac{\sigma_{xy}}{\sigma_x \sigma_y} \]

where \(\sigma_{xy}\) is the covariance and \(\sigma_x\), \(\sigma_y\) are the standard deviations of the two variables.

For the example:

Code
cor(df1$ice_cream_sales, df1$shark_attacks) |> round(2)
[1] 0.82

Correlation is not causation

The ice cream and shark example is deliberately provocative.

spurious correlations

A strong association does not automatically imply that one variable causes the other. A third variable, such as temperature or beach attendance, may explain both.

For some funny examples, see Tyler Vigen’s Spurious Correlations:
https://www.tylervigen.com/spurious-correlations

Negative and weak correlation

Code
df_negative <- tibble(
  doc_hours = seq(0, 10, length.out = 50),
  social_life = 100 - doc_hours * 8 + rnorm(50, 0, 5)
)

df_negative |>
  ggplot(aes(doc_hours, social_life)) +
  geom_point(color = "indianred") +
  labs(
    x = "Alien documentary hours",
    y = "Social life index",
    title = paste("Correlation =", round(cor(df_negative$doc_hours, df_negative$social_life), 2))
  ) +
  theme_minimal()

When correlation fails: nonlinear relationships

Code
set.seed(123)

df_nonlinear <- tibble(
  x = seq(-10, 10, length.out = 100),
  y = x^2 + rnorm(100, 0, 5)
)

df_nonlinear |>
  ggplot(aes(x = x, y = y)) +
  geom_point(color = "indianred", alpha = 0.7) +
  geom_vline(xintercept = mean(df_nonlinear$x), color = "dodgerblue", linewidth = 1.2, alpha = 0.5) +
  geom_hline(yintercept = mean(df_nonlinear$y), color = "forestgreen", linewidth = 1.2, alpha = 0.5) +
  labs(
    title = paste("Linear correlation =", round(cor(df_nonlinear$x, df_nonlinear$y), 2)),
    x = "X",
    y = "Y"
  ) +
  theme_minimal()

One categorical and one continuous variable

Comparing distributions across groups

Code
set.seed(42)

alien_power <- tibble(
  alien_species = rep(c("Martian", "Venusian", "Zeta Reticulan", "Reptilian"), each = 30),
  laser_power = c(
    rnorm(30, mean = 50, sd = 7),
    rnorm(30, mean = 65, sd = 8),
    rnorm(30, mean = 55, sd = 10),
    rnorm(30, mean = 80, sd = 6)
  )
)

alien_power |>
  ggplot(aes(x = alien_species, y = laser_power, fill = alien_species)) +
  geom_boxplot(alpha = 0.6, show.legend = FALSE) +
  labs(x = "Alien species", y = "Preferred laser power") +
  theme_minimal()

More expressive view: violin + boxplot + jitter

Code
alien_power |>
  ggplot(aes(x = alien_species, y = laser_power, fill = alien_species)) +
  geom_violin(alpha = 0.35, show.legend = FALSE) +
  geom_boxplot(width = 0.12, outlier.shape = NA, show.legend = FALSE) +
  geom_jitter(width = 0.08, alpha = 0.4, show.legend = FALSE) +
  labs(x = "Alien species", y = "Preferred laser power") +
  theme_minimal()

Fancier but useful visualisations

Multivariate view: pair plots

Pair plots are useful before clustering or classification because they help us see several relationships at once.

Code
library(palmerpenguins)
library(GGally)

palmerpenguins::penguins |>
  drop_na(species, bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g) |>
  select(species, bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g) |>
  GGally::ggpairs(aes(colour = species, alpha = 0.5))

Note

If GGally is not installed, this slide can be skipped or shown from the rendered instructor version.

Correlation heatmap

Code
library(palmerpenguins)

palmerpenguins::penguins |>
  select(where(is.numeric)) |>
  cor(use = "complete.obs") |>
  as.data.frame() |>
  rownames_to_column("var1") |>
  pivot_longer(-var1, names_to = "var2", values_to = "correlation") |>
  ggplot(aes(var1, var2, fill = correlation)) +
  geom_tile() +
  geom_text(aes(label = round(correlation, 2))) +
  coord_equal() +
  labs(x = NULL, y = NULL, fill = "Correlation") +
  theme_minimal()

Text as data

Why text?

Text is a useful example of non-standard data.

  • It does not arrive as a clean numerical matrix.
  • We must decide what the observational units are: documents, paragraphs, sentences, words.
  • We transform text into variables before analysing it.

In text analysis, the statistical unit may change during the workflow:

  • at first, each row is a document;
  • after tokenisation, each row is a word occurrence;
  • after counting, each row may be a word, a document-word pair, or a field-word pair.

a small text dataset

Code
texts <- tibble::tribble(
  ~doc_id, ~field, ~text,
  1, "engineering", "The structure was tested under dynamic stress and material fatigue.",
  2, "engineering", "Mechanical systems require models of vibration, load and failure.",
  3, "engineering", "Sensors collect signals from machines under stress and vibration.",
  4, "sport", "Training load, recovery and fatigue affect athletic performance.",
  5, "sport", "Injury prevention requires monitoring intensity and recovery.",
  6, "sport", "Athletes improve performance through training, recovery and adaptation.",
  7, "economics", "Innovation and sustainability influence firms, markets and management.",
  8, "economics", "Firms compete in markets through strategy, investment and innovation.",
  9, "economics", "Management decisions affect costs, productivity and sustainability.",
  10, "humanities", "Texts, sources and contexts shape historical interpretation.",
  11, "humanities", "Archives preserve documents, memories and cultural heritage.",
  12, "humanities", "Interpretation depends on language, context and historical sources."
)

texts |> 
  kbl() |> 
  kable_styling(full_width = FALSE)
doc_id field text
1 engineering The structure was tested under dynamic stress and material fatigue.
2 engineering Mechanical systems require models of vibration, load and failure.
3 engineering Sensors collect signals from machines under stress and vibration.
4 sport Training load, recovery and fatigue affect athletic performance.
5 sport Injury prevention requires monitoring intensity and recovery.
6 sport Athletes improve performance through training, recovery and adaptation.
7 economics Innovation and sustainability influence firms, markets and management.
8 economics Firms compete in markets through strategy, investment and innovation.
9 economics Management decisions affect costs, productivity and sustainability.
10 humanities Texts, sources and contexts shape historical interpretation.
11 humanities Archives preserve documents, memories and cultural heritage.
12 humanities Interpretation depends on language, context and historical sources.

From documents to words

The first transformation is called tokenisation: we split each text into smaller units.

Code
library(tidytext)

tokens_raw <- texts |>
  tidytext::unnest_tokens(word, text)

tokens_raw |>
  slice_head(n = 15) |>
  kbl() |>
  kable_styling(full_width = FALSE)
doc_id field word
1 engineering the
1 engineering structure
1 engineering was
1 engineering tested
1 engineering under
1 engineering dynamic
1 engineering stress
1 engineering and
1 engineering material
1 engineering fatigue
2 engineering mechanical
2 engineering systems
2 engineering require
2 engineering models
2 engineering of

Removing common words

Many frequent words are not informative: the, and, of, to, …

These are usually called stop words.

Code
tokens <- tokens_raw |>
  anti_join(tidytext::stop_words, by = "word")

tokens |>
  count(word, sort = TRUE) |>
  slice_head(n = 15) |>
  kbl() |>
  kable_styling(full_width = FALSE)
word n
recovery 3
affect 2
fatigue 2
firms 2
historical 2
innovation 2
interpretation 2
load 2
management 2
markets 2
performance 2
sources 2
stress 2
sustainability 2
training 2

Word frequencies

After tokenisation, we can count how often each word appears.

Code
tokens |>
  count(word, sort = TRUE) |>
  slice_max(n, n = 15) |>
  ggplot(aes(x = n, y = reorder(word, n))) +
  geom_col(fill = "indianred", alpha = 0.7) +
  labs(x = "Frequency", y = NULL) +
  theme_minimal()

Word frequencies by field

Now we compare which words are common in different fields.

Code
tokens |>
  count(field, word, sort = TRUE) |>
  group_by(field) |>
  slice_max(n, n = 5, with_ties = FALSE) |>
  ungroup() |>
  ggplot(aes(n, tidytext::reorder_within(word, n, field))) +
  geom_col(fill = "indianred", alpha = 0.7) +
  facet_wrap(~ field, scales = "free_y") +
  tidytext::scale_y_reordered() +
  labs(x = "Frequency", y = NULL) +
  theme_minimal()

From text to a data matrix

Many statistical methods require a rectangular data table.

A common representation is the document-term matrix:

  • rows are documents;
  • columns are words;
  • values are word frequencies.
Code
document_term_matrix <- tokens |>
  count(doc_id, word) |>
  pivot_wider(
    names_from = word,
    values_from = n,
    values_fill = 0
  )

document_term_matrix |>
  select(1:min(10, ncol(document_term_matrix))) |>
  kbl() |>
  kable_styling(full_width = FALSE)
doc_id dynamic fatigue material stress structure tested failure load mechanical
1 1 1 1 1 1 1 0 0 0
2 0 0 0 0 0 0 1 1 1
3 0 0 0 1 0 0 0 0 0
4 0 1 0 0 0 0 0 1 0
5 0 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 0
10 0 0 0 0 0 0 0 0 0
11 0 0 0 0 0 0 0 0 0
12 0 0 0 0 0 0 0 0 0

Term frequency by field

We can also build a field-term matrix.

Code
field_term_matrix <- tokens |>
  count(field, word) |>
  pivot_wider(
    names_from = word,
    values_from = n,
    values_fill = 0
  )

field_term_matrix |>
  select(1:min(12, ncol(field_term_matrix))) |>
  kbl() |>
  kable_styling(full_width = FALSE)
field affect compete costs decisions firms influence innovation investment management markets productivity
economics 1 1 1 1 2 1 2 1 2 2 1
engineering 0 0 0 0 0 0 0 0 0 0 0
humanities 0 0 0 0 0 0 0 0 0 0 0
sport 1 0 0 0 0 0 0 0 0 0 0

What makes a word distinctive?

A word may be frequent overall, but not distinctive.

For example, a word that appears in every field is less useful for distinguishing fields.

Code
field_words <- tokens |>
  count(field, word, sort = TRUE)

field_words |>
  bind_tf_idf(word, field, n) |>
  group_by(field) |>
  slice_max(tf_idf, n = 5, with_ties = FALSE) |>
  ungroup() |>
  ggplot(aes(tf_idf, tidytext::reorder_within(word, tf_idf, field))) +
  geom_col(fill = "dodgerblue", alpha = 0.7) +
  facet_wrap(~ field, scales = "free_y") +
  tidytext::scale_y_reordered() +
  labs(x = "tf-idf", y = NULL) +
  theme_minimal()

What did we do?

Starting from raw text, we created variables.

documents → words → word counts → document-term matrix

This is why text can be analysed statistically: after preprocessing, text becomes structured

Optional wordcloud

Code
library(tidytext)
library(wordcloud)

tokens |>
  count(word, sort = TRUE) |>
  filter(n >= 1) |>
  with(
    wordcloud(
      words = word,
      freq = n,
      max.words = 50,
      min.freq = 1,
      random.order = FALSE,
      scale = c(3.5, 0.8)
    )
  )

Note

A word cloud is useful as a quick visual impression, but a bar chart is often easier to read and compare.