Essential Statistical Data Analysis

Class 1 — Exploring, visualising, and constructing data

Alfonso Iodice D’Enza

2026-05-04

Part I — Theory and guided demonstration

What is statistics?

“Statistics is the science of learning from data.”
— David Hand

“Statistics is the grammar of science.”
— Karl Pearson

For this course, the practical question is:

How do we move from raw information to useful statistical learning workflows?

Data vs information

Just because one has data, it does not mean one has information.

Data: raw facts, numbers, texts, measurements, labels, events.
Information: data that have been collected, processed, organised, and interpreted in relation to a question.

Statistical work often starts with transforming raw data into something we can analyse.

A small dataset

Code

set.seed(123)

toy_data <- tibble(
  age = sample(22:60, size = 90, replace = TRUE),
  
  height_cm = round(rnorm(90, mean = 171, sd = 8), 1),
  
  gender = factor(
    sample(
      c("Male", "Female"),
      size = 90,
      replace = TRUE,
      prob = c(0.45, 0.55)
    )
  ),
  
  education = factor(
    sample(
      c("High School", "Bachelor's", "Master's", "PhD"),
      size = 90,
      replace = TRUE,
      prob = c(0.20, 0.35, 0.30, 0.15)
    ),
    levels = c("High School", "Bachelor's", "Master's", "PhD"),
    ordered = TRUE
  ),
  
  commute_by = sample(
    c("Car", "Bus", "Bike", "Walk"),
    size = 90,
    replace = TRUE,
    prob = c(0.40, 0.25, 0.20, 0.15)
  ),
  
  commute_cost = case_when(
    commute_by == "Car"  ~ round(rnorm(90, mean = 120, sd = 25)),
    commute_by == "Bus"  ~ round(rnorm(90, mean = 55, sd = 10)),
    commute_by == "Bike" ~ 0,
    commute_by == "Walk" ~ 0
  )
)

toy_data  |> slice_sample(n = 10) |>  kbl() |> kable_styling(full_width = FALSE)

age	height_cm	gender	education	commute_by	commute_cost
50	176.5	Female	Bachelor's	Bus	60
29	168.8	Female	Bachelor's	Walk	0
46	154.6	Male	High School	Car	184
46	186.3	Female	Bachelor's	Bike	0
60	171.6	Male	Bachelor's	Bike	0
46	173.4	Female	Bachelor's	Bike	0
48	161.2	Male	PhD	Bus	64
38	175.4	Female	PhD	Car	117
33	166.0	Female	Bachelor's	Car	162
56	176.2	Male	Master's	Walk	0

Dataset elements

Observation
A single unit of data: a person, a country, a patient, a firm, a text, an experimental run.

Variable
A characteristic measured on each observation.

Code

toy_data |>
  kbl() |>
  kable_styling(full_width = FALSE) |>
  row_spec(1, background = "#D9FDEC") |>
  row_spec(3, background = "#CAF6FC") |>
  row_spec(8, background = "#FDB5BA")

age	height_cm	gender	education	commute_by	commute_cost
52	187.4	Female	Bachelor's	Bike	0
36	167.1	Male	High School	Bike	0
35	152.5	Male	Bachelor's	Car	167
24	179.0	Female	High School	Car	86
58	165.3	Female	Bachelor's	Car	121
35	165.5	Male	Master's	Car	151
46	179.2	Female	Bachelor's	Car	102
47	168.7	Female	Bachelor's	Car	101
48	161.2	Male	PhD	Bus	64
26	172.5	Female	Bachelor's	Bus	47
48	169.9	Female	Master's	Walk	0
49	171.0	Male	Bachelor's	Car	128
30	174.1	Male	Master's	Bus	47
50	168.0	Female	Master's	Car	125
56	176.2	Male	Master's	Walk	0
29	169.2	Male	Bachelor's	Car	171
47	173.7	Female	PhD	Bike	0
28	179.8	Female	Master's	Car	139
30	174.5	Male	Bachelor's	Bus	49
40	168.4	Male	Master's	Bike	0
57	180.2	Male	Bachelor's	Car	111
35	178.9	Male	PhD	Bike	0
38	175.4	Female	PhD	Car	117
60	172.9	Female	Master's	Car	89
33	166.0	Female	Bachelor's	Car	162
36	181.9	Female	PhD	Car	143
53	166.2	Female	Bachelor's	Bus	60
28	188.5	Female	Master's	Bike	0
30	183.3	Male	Bachelor's	Walk	0
31	169.1	Female	Master's	Car	137
44	162.8	Female	Bachelor's	Walk	0
48	165.3	Male	Bachelor's	Car	137
28	173.1	Female	Bachelor's	Walk	0
48	169.0	Male	Master's	Bike	0
53	168.2	Female	High School	Bus	60
59	163.4	Female	Bachelor's	Bike	0
46	170.6	Male	PhD	Car	145
55	164.7	Male	High School	Bike	0
50	157.7	Male	High School	Car	102
26	168.0	Female	Bachelor's	Bike	0
29	178.4	Female	Master's	Bus	48
33	166.4	Male	High School	Car	69
34	175.9	Female	Master's	Bike	0
39	158.1	Female	Bachelor's	Car	115
54	170.6	Male	Bachelor's	Bike	0
48	175.2	Female	Bachelor's	Car	117
46	173.4	Female	Bachelor's	Bike	0
59	171.8	Male	PhD	Walk	0
42	165.9	Female	High School	Car	162
36	164.2	Female	High School	Walk	0
47	162.8	Female	Bachelor's	Bike	0
52	171.9	Female	Bachelor's	Car	76
37	163.4	Female	PhD	Car	122
51	167.1	Female	Master's	Car	106
27	169.0	Male	Master's	Bus	33
29	185.8	Female	Bachelor's	Walk	0
43	165.8	Female	Master's	Bus	43
43	172.9	Male	Bachelor's	Bus	60
60	171.6	Male	Bachelor's	Bike	0
52	163.3	Female	High School	Bus	53
38	170.4	Female	Bachelor's	Bus	49
55	182.6	Male	High School	Car	128
25	174.6	Male	Bachelor's	Car	130
34	171.3	Male	High School	Walk	0
26	167.6	Female	Master's	Bike	0
46	154.6	Male	High School	Car	184
43	180.1	Female	High School	Bike	0
46	159.3	Male	Master's	Bike	0
53	176.9	Female	Master's	Car	127
46	186.3	Female	Bachelor's	Bike	0
44	159.4	Female	Bachelor's	Bus	64
56	176.6	Female	High School	Walk	0
51	168.9	Male	Bachelor's	Bus	66
33	158.4	Female	High School	Walk	0
52	158.9	Male	Bachelor's	Bus	57
51	158.2	Male	High School	Bus	52
56	166.8	Female	PhD	Walk	0
35	159.3	Female	Bachelor's	Bike	0
50	176.5	Female	Bachelor's	Bus	60
53	187.8	Female	High School	Car	141
28	160.7	Female	Master's	Bus	44
24	177.3	Male	High School	Bus	67
44	177.2	Male	PhD	Walk	0
36	173.7	Female	Bachelor's	Bike	0
42	162.9	Male	Master's	Walk	0
58	170.0	Male	High School	Car	75
29	168.8	Female	Bachelor's	Walk	0
31	175.5	Female	PhD	Car	125
55	168.0	Male	Bachelor's	Car	124
31	178.8	Female	Bachelor's	Bike	0

Variable types

Numerical variables

continuous: height, income, speed;
discrete: number of children, number of goals.

Categorical variables

nominal: gender, species, country;
ordinal: education level, satisfaction scale.

The type of variable determines which summaries and visualisations are meaningful.

Describe and infer

Descriptive statistics

Describe what is observed:

summaries;
tables;
plots;
exploratory patterns.

Inferential statistics

Use observed data to reason about a larger process or population:

uncertainty;
sampling variability;
modelling;
prediction.

In this first class, we focus mostly on description and exploration.

observational vs experimental studies

Suppose a researcher wants to study the effect of the use of blue-screens (smartphone/tablet) before sleeping on the attention span.

observational studies

the researcher observes the statistical units without intervening.

assign the statistical units that use the blue-screens to one group, the remaining to another group.
measure the attention span of the two groups after two weeks.
if the group using blue-screens has a lower attention span, the researcher can conclude that the use of blue-screens is correlated to the attention span.

experimental studies

the researcher manipulates the statistical units to observe the effect of the manipulation.

split the statistical units into two groups via random assignment: one group is asked to use a blue-screen device before sleep, whereas the other group is asked NOT to use the device.
measure the attention span of the two groups after two weeks.
if the group using blue-screens has a lower attention span, the researcher can conclude that the use of blue-screens causes a lower attention span.

observational vs experimental studies

Suppose a researcher wants to study the effect of the use of blue-screens (smartphone/tablet) before sleeping on the attention span.

observational studies

the researcher observes the statistical units without intervening.

assign the statistical units that use the blue-screens to one group, the remaining to another group.
measure the attention span of the two groups after two weeks.
if the group using blue-screens has a lower attention span, the researcher can conclude that the use of blue-screens is correlated to the attention span.

experimental studies

the researcher manipulates the statistical units to observe the effect of the manipulation.

split the statistical units into two groups via random assignment: one group is asked to use a blue-screen device before sleep, whereas the other group is asked NOT to use the device.
measure the attention span of the two groups after two weeks.
if the group using blue-screens has a lower attention span, the researcher can conclude that the use of blue-screens causes a lower attention span.

random sampling and random assignment

In the setup of a study, two key concepts are desireable: random sampling and random assignment.

random sampling

the statistical units are randomly selected from the population.

this is to ensure that the sample is representative of the population.
the sample is used to make inferences about the population.

it makes it possible to generalize the results of the study to the population

random assigment

the statistical units are randomly assigned to case/control groups.

it makes it possible to detect causality (e.g. between blue-screen use and attention span)

random assigment

random assignment is a key feature of experimental studies and random sampling grants the generalizability of the results to the population.
to have both is the ideal setup for a study, but it is not always possible. (you cannot select people at random and force them to use or not use blue-screens before sleep!)

Frequency distributions: categorical variables

Code

toy_data |>
  count(commute_by, name = "frequency") |>
  mutate(relative_frequency = frequency / sum(frequency)) |>
  kbl(digits = 2) |>
  kable_styling(full_width = FALSE)

commute_by	frequency	relative_frequency
Bike	22	0.24
Bus	19	0.21
Car	34	0.38
Walk	15	0.17

Frequency distributions: numerical variables

Numerical variables are often grouped into intervals.

Code

toy_data |>
  mutate(height_group = cut(height_cm, breaks = seq(160, 185, by = 5), include.lowest = TRUE)) |>
  count(height_group, name = "frequency") |>
  kbl() |>
  kable_styling(full_width = FALSE)

height_group	frequency
[160,165]	10
(165,170]	26
(170,175]	18
(175,180]	16
(180,185]	5
NA	15

Informative plots

An effective data visualisation should be:

appropriate: suited to the variable type and the question;
clear: easy to understand;
simple: not cluttered;
accurate: not misleading.

Non-informative plots: lying with graphics

Code

tibble(company = c("company A", "company B"), revenue = c(800, 1000)) |>
  ggplot(aes(x = company, y = revenue, fill = company)) +
  geom_col() +
  coord_cartesian(ylim = c(750, 1000)) +
  theme_minimal() +
  theme(legend.position = "none", axis.text.y = element_blank(), axis.ticks.y = element_blank())

Code

tibble(company = c("company A", "company B"), revenue = c(800, 1000)) |>
  ggplot(aes(x = company, y = revenue, fill = company)) +
  geom_col() +
  theme_minimal() +
  theme(legend.position = "none")

visualization for numerical variables

Code

toy_data |> ggplot(aes(x = height_cm)) +
    geom_histogram( binwidth= 5,
                  boundary = 150,
                  fill = "indianred", color = "black",alpha=.5) +
  labs(title = "Histogram of height", x = "height (cm)", y = "Frequency") + theme_minimal()

visualization for numerical variables

Code

toy_data |> mutate(height_cm=cut(height_cm, 
                                         breaks=seq(150,200,by=5),include.lowest=T)) |> 
  count(height_cm) |> kbl() |> kable_styling()

height_cm	n
[150,155]	2
(155,160]	8
(160,165]	10
(165,170]	26
(170,175]	18
(175,180]	16
(180,185]	5
(185,190]	5

Code

toy_data |> ggplot(aes(x = height_cm)) +
  geom_histogram( binwidth= 5,
                  boundary = 150,
                  fill = "indianred", color = "black",alpha=.5) +
  labs(title = "Histogram of height", x = "height (cm)") + theme_minimal()

visualization for numerical variables

Code

toy_data |> mutate(height_cm=cut(height_cm, 
                                         breaks=seq(150,200,by=5),include.lowest=T)) |> 
  count(height_cm) |> mutate(prop=n/sum(n)) |> select(-n) |> kbl() |> kable_styling()

height_cm	prop
[150,155]	0.0222222
(155,160]	0.0888889
(160,165]	0.1111111
(165,170]	0.2888889
(170,175]	0.2000000
(175,180]	0.1777778
(180,185]	0.0555556
(185,190]	0.0555556

Code

toy_data |> ggplot(aes(x = height_cm)) +
  geom_histogram(aes(y = after_stat(count / sum(count))),
                  binwidth= 5,
                  boundary = 150,
                  fill = "indianred", color = "black",alpha=.5) +
  # geom_histogram( aes(y=after_stat(density)), binwidth= 5,
  #                 boundary = 150,
  #                 fill = "indianred", color = "black",alpha=.5) +
  labs(title = "Histogram of height", x = "height (cm)") + theme_minimal()

visualization for numerical variables

Code

toy_data |> mutate(height_cm=cut(height_cm, 
                                         breaks=seq(150,200,by=5),include.lowest=T)) |> 
  count(height_cm) |> mutate(prop=n/sum(n)) |> select(-n) |> kbl() |> kable_styling()

height_cm	prop
[150,155]	0.0222222
(155,160]	0.0888889
(160,165]	0.1111111
(165,170]	0.2888889
(170,175]	0.2000000
(175,180]	0.1777778
(180,185]	0.0555556
(185,190]	0.0555556

Code

toy_data |> ggplot(aes(x = height_cm)) +
  # geom_histogram(aes(y = after_stat(count / sum(count))),
  #                 binwidth= 5,
  #                 boundary = 150,
  #                 fill = "dodgerblue", color = "black",alpha=.5) +
  geom_histogram( aes(y=after_stat(density)), binwidth= 5,
                  boundary = 150,
                  fill = "dodgerblue", color = "black",alpha=.5) +
  labs(title = "Histogram of height", x = "height (cm)") + theme_minimal()

visualization for numerical variables

Code

custom_breaks <- c(150,170,175,195,200)
toy_data |> mutate(height_cm=cut(height_cm, 
                                         breaks=custom_breaks,include.lowest=T)) |> 
  count(height_cm) |> mutate(prop=n/sum(n)) |> select(-n) |> kbl() |> kable_styling()

height_cm	prop
[150,170]	0.5111111
(170,175]	0.2000000
(175,195]	0.2888889

Code

toy_data |> ggplot(aes(x = height_cm)) +
  geom_histogram(aes(y = after_stat(count / sum(count))),
                  breaks=custom_breaks,
                  boundary = 150,
                  fill = "indianred", color = "black",alpha=.5) +
  # geom_histogram( aes(y=after_stat(density)), binwidth= 5,
  #                 boundary = 150,
  #                 fill = "dodgerblue", color = "black",alpha=.5) +
  labs(title = "Histogram of height", x = "height (cm)") + theme_minimal()

visualization for numerical variables

Code

toy_data |> mutate(height_cm=cut(height_cm, 
                                         breaks=custom_breaks,include.lowest=T)) |> 
  count(height_cm) |> mutate(prop=n/sum(n)) |> select(-n) |> kbl() |> kable_styling()

height_cm	prop
[150,170]	0.5111111
(170,175]	0.2000000
(175,195]	0.2888889

Code

toy_data |> ggplot(aes(x = height_cm)) +
  geom_histogram(aes(y=after_stat(density)),
                  breaks=custom_breaks,
                  boundary = 150,
                  fill = "dodgerblue", color = "black",alpha=.5) +
  # geom_histogram( aes(y=after_stat(density)), binwidth= 5,
  #                 boundary = 150,
  #                 fill = "dodgerblue", color = "black",alpha=.5) +
  labs(title = "Histogram of height", x = "height (cm)") + theme_minimal()

visualization for numerical variables

Code

toy_data |> ggplot(aes(x = height_cm)) +
  geom_histogram(aes(y = after_stat(count / sum(count))),
                  breaks=custom_breaks,
                  boundary = 150,
                  fill = "indianred", color = "black",alpha=.5) +
  # geom_histogram( aes(y=after_stat(density)), binwidth= 5,
  #                 boundary = 150,
  #                 fill = "dodgerblue", color = "black",alpha=.5) +
  labs(title = "Histogram of height", x = "height (cm)") + theme_minimal()

the heights of the bars add up to 1

Code

toy_data |> ggplot(aes(x = height_cm)) +
  geom_histogram(aes(y=after_stat(density)),
                  breaks=custom_breaks,
                  boundary = 150,
                  fill = "dodgerblue", color = "black",alpha=.5) +
  # geom_histogram( aes(y=after_stat(density)), binwidth= 5,
  #                 boundary = 150,
  #                 fill = "dodgerblue", color = "black",alpha=.5) +
  labs(title = "Histogram of height", x = "height (cm)") + theme_minimal()

the area of the bars adds up to 1

more appropriate when the bins, intervals of values, are of different widths

visualization for numerical variables (integers)

While one can still consider bins and an histogram, since the variable is not continuous, a more appropriate option is the lollipop plot.

Code

toy_data |> 
count(age ) %>%
  ggplot(aes(x = age, y = n)) +
  geom_segment(aes(xend = age, yend = 0), color = "dodgerblue") +
  geom_point(color = "indianred", size = 3) + theme_minimal()

visualization for categorical variables

Barplots are the most common way to visualize categorical variables.

Code

toy_data |> ggplot(aes(x = factor(education))) +
  geom_bar(fill = "dodgerblue") +
  labs(title = "barplot", x = "education", y = "Count")+ theme_minimal()

visualization for categorical variables

pie charts are also popular, but less easily interpretable.

Code

 data_pie <- toy_data |> 
  count(education) %>%
  mutate(perc = (n / sum(n)),
         label = paste0(education, " ", scales::percent(perc)))

# Plot
ggplot(data_pie, aes(x = "", y = perc, fill = education)) +
  geom_col(width = 1, color = "white") +
  coord_polar(theta = "y") +
  labs(title = "Pie Chart of eduction", fill = "education") +
  geom_text(aes(label = label), position = position_stack(vjust = 0.5)) +
  theme_void()

visualization for categorical variables

similar to piecharts, but more interpretable, is the donut chart.

Code

data_pie |> ggplot(aes(x = 2, y = perc, fill = education)) +
  geom_col(width = 1, color = "white") +
  coord_polar(theta = "y") +
  xlim(0.5, 2.5) +  # create the hole
  geom_text(aes(label = label), position = position_stack(vjust = 0.5)) +
  labs(title = "Donut Chart of education", fill = "education") +
  theme_void()

visualization for categorical variables

If the factor levels are not ordered as in education, the bars of the barplot can be re-ordered in an ascending or descending order.

Code

toy_data |> ggplot(aes(x = fct_rev(fct_infreq(commute_by)))) +
  geom_bar(fill = "dodgerblue") +
  labs(title = "barplot", x = "commute_by", y = "Count")+ theme_minimal()

Code

toy_data |> ggplot(aes(x = fct_infreq(commute_by))) +
  geom_bar(fill = "dodgerblue") +
  labs(title = "barplot", x = "commute_by", y = "Count")+ theme_minimal()

This is useful for categorical variables with many levels, to put in evidence the most/least frequent ones.

summarise (numerical) data: indexes

four distributions

Suppose to have the distribution of employees heights from four companies.

Code

n=1000
set.seed(123)
library("sn")



four_heigths_data = tibble(
                      `A&co.`= rsn(n = n, xi=140, omega=25,alpha=15),
                      `B&co.`= rnorm(n, 170, 25),
                      `C&co.`= rnorm(n, 170, 9),
                      `D&co.`= rnorm(n, 190, 9)
)

four_heigths_data |> 
  pivot_longer(cols = everything(), names_to = "company", values_to = "height") |> 
  ggplot(aes(x = height, fill = company)) +
  geom_density(alpha = 0.5) +
  geom_histogram( aes(y=after_stat(density)), binwidth= 5,colour="black",alpha=.5)+
  scale_fill_manual(values = c("indianred","dodgerblue",  "darkorange", "darkgreen")) +
  labs(title = "Density of heights", x = "height (cm)", y = "Density") + theme_minimal() + facet_grid(company~.)

you can tell they are different from one another, but how?

company `C&co.` and `D&co.`

the two distributions look similar, but for their position

Code

n=1000
set.seed(123)
library("sn")

four_heigths_data |> 
  pivot_longer(cols = everything(), names_to = "company", values_to = "height") |> 
  filter(company %in% c("C&co.", "D&co.")) |>
  ggplot(aes(x = height, fill = company)) +
  geom_density(alpha = 0.5) +
  geom_histogram( aes(y=after_stat(density)), binwidth= 5,colour="black",alpha=.5)+
  scale_fill_manual(values = c("darkorange", "darkgreen")) +
  labs(title = "Density of heights", x = "height (cm)", y = "Density") + theme_minimal() + facet_grid(company~.)

company `B&co.` and `C&co.`

the two distributions seem to have

a similar position (centered around 170 cm)
different spread (the first one is more spread out than the second one)

Code

n=1000
set.seed(123)
library("sn")

four_heigths_data |> 
  pivot_longer(cols = everything(), names_to = "company", values_to = "height") |> 
  filter(company %in% c("B&co.", "C&co.")) |>
  ggplot(aes(x = height, fill = company)) +
  geom_density(alpha = 0.5) +
  geom_histogram( aes(y=after_stat(density)), binwidth= 5,colour="black",alpha=.5)+
  scale_fill_manual(values = c("dodgerblue","darkorange")) +
  labs(title = "Density of heights", x = "height (cm)", y = "Density") + theme_minimal() + facet_grid(company~.)

company `A&co.` and `D&co.`

the two distributions seem to have

different position
different spread
different shape (the first one is more skewed to the right than the second one)

Code

n=1000
set.seed(123)
library("sn")

four_heigths_data |> 
  pivot_longer(cols = everything(), names_to = "company", values_to = "height") |> 
  filter(company %in% c("A&co.", "D&co.")) |>
  ggplot(aes(x = height, fill = company)) +
  geom_density(alpha = 0.5) +
  geom_histogram( aes(y=after_stat(density)), binwidth= 5,colour="black",alpha=.5)+
  scale_fill_manual(values = c("indianred", "darkgreen")) +
  labs(title = "Density of heights", x = "height (cm)", y = "Density") + theme_minimal() + facet_grid(company~.)

Measuring `position`, `spread` and `shape`

position

mean
median and quartiles

spread

variance (standard deviation)
interquartile range (IQR)

shape

skewness

Visualising relationships

Interdependence and dependence

If the components of a bivariate variable \((X,Y)\) play the same role in the analysis, we study their interdependence.

If we are interested in how \(Y\) changes with respect to \(X\), we refer to the dependence of \(Y\) on \(X\).

\(Y\) is often called the dependent variable.
\(X\) is often called the independent variable.

Types of bivariate relationships

Given a pair of variables \((X,Y)\):

if both are categorical, we study association;
if both are continuous, we study correlation;
if one is categorical and one is continuous, we compare distributions across groups.

Two categorical variables

UFO abductions: joint frequency table

A sci-fi example: the relationship between alien species and method of abduction.

Code

set.seed(42)

ufo_data <- tibble(
  alien_species = sample(
    c("Martian", "Venusian", "Zeta Reticulan", "Reptilian"),
    size = 100, replace = TRUE, prob = c(0.4, 0.2, 0.3, 0.1)
  ),
  abduction_method = sample(
    c("Tractor Beam", "Teleportation", "Sleep Gas", "Invisibility Cloak"),
    size = 100, replace = TRUE
  )
)

ufo_table <- ufo_data |>
  count(alien_species, abduction_method) |>
  pivot_wider(names_from = abduction_method, values_from = n, values_fill = 0)

ufo_table |> kbl() |> kable_styling(full_width = FALSE)

alien_species	Invisibility Cloak	Sleep Gas	Teleportation	Tractor Beam
Martian	10	8	9	11
Reptilian	6	5	4	1
Venusian	3	2	8	5
Zeta Reticulan	5	3	10	10

Row profiles

Row profiles answer: within each alien species, how are abduction methods distributed?

Code

ufo_data |>
  count(alien_species, abduction_method) |>
  group_by(alien_species) |>
  mutate(row_profile = n / sum(n)) |>
  select(alien_species, abduction_method, row_profile) |>
  pivot_wider(names_from = abduction_method, values_from = row_profile, values_fill = 0) |>
  kbl(digits = 2) |>
  kable_styling(full_width = FALSE)

alien_species	Invisibility Cloak	Sleep Gas	Teleportation	Tractor Beam
Martian	0.26	0.21	0.24	0.29
Reptilian	0.38	0.31	0.25	0.06
Venusian	0.17	0.11	0.44	0.28
Zeta Reticulan	0.18	0.11	0.36	0.36

Visualising categorical association

Code

ufo_data |>
  ggplot(aes(x = alien_species, fill = abduction_method)) +
  geom_bar(position = "fill") +
  labs(x = "Alien species", y = "Proportion", fill = "Method") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 25, hjust = 1))

Residual heatmap

Association can also be visualised by comparing observed cells with what we would expect under independence.

Code

chisq_ufo <- chisq.test(
  xtabs(~ alien_species + abduction_method, data = ufo_data)
)

as.data.frame(as.table(chisq_ufo$stdres)) |>
  rename(residual = Freq) |>
  ggplot(aes(abduction_method, alien_species, fill = residual)) +
  geom_tile() +
  geom_text(aes(label = round(residual, 1))) +
  coord_equal() +
  labs(
    x = "Abduction method",
    y = "Alien species",
    fill = "Std. residual"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 30, hjust = 1))

Two continuous variables

Correlation: shark attacks and ice cream

Two continuous variables are correlated when large values of one variable tend to correspond to large or small values of the other.

Code

set.seed(123)

df1 <- tibble(
  ice_cream_sales = seq(100, 1000, length.out = 50),
  shark_attacks = ice_cream_sales * 0.05 + rnorm(50, 0, 10)
)

df1 |>
  ggplot(aes(x = ice_cream_sales, y = shark_attacks)) +
  geom_point(color = "indianred") +
  labs(x = "Ice cream sales", y = "Shark attacks") +
  theme_minimal()

Correlation: building the index

Correlation increases when both variables are above their average or both are below their average.

Code

df1 |>
  mutate(
    sign_index = if_else(
      (ice_cream_sales > mean(ice_cream_sales) & shark_attacks > mean(shark_attacks)) |
        (ice_cream_sales < mean(ice_cream_sales) & shark_attacks < mean(shark_attacks)),
      "+", "-"
    ),
    size_index = abs((ice_cream_sales - mean(ice_cream_sales)) * (shark_attacks - mean(shark_attacks)))
  ) |>
  ggplot(aes(x = ice_cream_sales, y = shark_attacks, color = sign_index)) +
  geom_text(aes(label = sign_index, size = size_index), alpha = 0.8) +
  geom_vline(xintercept = mean(df1$ice_cream_sales), color = "dodgerblue", linewidth = 1.2, alpha = 0.5) +
  geom_hline(yintercept = mean(df1$shark_attacks), color = "forestgreen", linewidth = 1.2, alpha = 0.5) +
  scale_size(range = c(3, 8)) +
  labs(x = "Ice cream sales", y = "Shark attacks") +
  theme_minimal() +
  theme(legend.position = "none")

Pearson correlation

The linear correlation index is:

\[ \rho_{xy} = \frac{\sigma_{xy}}{\sigma_x \sigma_y} \]

where \(\sigma_{xy}\) is the covariance and \(\sigma_x\), \(\sigma_y\) are the standard deviations of the two variables.

For the example:

Code

cor(df1$ice_cream_sales, df1$shark_attacks) |> round(2)

[1] 0.82

Correlation is not causation

The ice cream and shark example is deliberately provocative.

spurious correlations

A strong association does not automatically imply that one variable causes the other. A third variable, such as temperature or beach attendance, may explain both.

For some funny examples, see Tyler Vigen’s Spurious Correlations:
https://www.tylervigen.com/spurious-correlations

Negative and weak correlation

Code

df_negative <- tibble(
  doc_hours = seq(0, 10, length.out = 50),
  social_life = 100 - doc_hours * 8 + rnorm(50, 0, 5)
)

df_negative |>
  ggplot(aes(doc_hours, social_life)) +
  geom_point(color = "indianred") +
  labs(
    x = "Alien documentary hours",
    y = "Social life index",
    title = paste("Correlation =", round(cor(df_negative$doc_hours, df_negative$social_life), 2))
  ) +
  theme_minimal()

When correlation fails: nonlinear relationships

Code

set.seed(123)

df_nonlinear <- tibble(
  x = seq(-10, 10, length.out = 100),
  y = x^2 + rnorm(100, 0, 5)
)

df_nonlinear |>
  ggplot(aes(x = x, y = y)) +
  geom_point(color = "indianred", alpha = 0.7) +
  geom_vline(xintercept = mean(df_nonlinear$x), color = "dodgerblue", linewidth = 1.2, alpha = 0.5) +
  geom_hline(yintercept = mean(df_nonlinear$y), color = "forestgreen", linewidth = 1.2, alpha = 0.5) +
  labs(
    title = paste("Linear correlation =", round(cor(df_nonlinear$x, df_nonlinear$y), 2)),
    x = "X",
    y = "Y"
  ) +
  theme_minimal()

One categorical and one continuous variable

Comparing distributions across groups

Code

set.seed(42)

alien_power <- tibble(
  alien_species = rep(c("Martian", "Venusian", "Zeta Reticulan", "Reptilian"), each = 30),
  laser_power = c(
    rnorm(30, mean = 50, sd = 7),
    rnorm(30, mean = 65, sd = 8),
    rnorm(30, mean = 55, sd = 10),
    rnorm(30, mean = 80, sd = 6)
  )
)

alien_power |>
  ggplot(aes(x = alien_species, y = laser_power, fill = alien_species)) +
  geom_boxplot(alpha = 0.6, show.legend = FALSE) +
  labs(x = "Alien species", y = "Preferred laser power") +
  theme_minimal()

More expressive view: violin + boxplot + jitter

Code

alien_power |>
  ggplot(aes(x = alien_species, y = laser_power, fill = alien_species)) +
  geom_violin(alpha = 0.35, show.legend = FALSE) +
  geom_boxplot(width = 0.12, outlier.shape = NA, show.legend = FALSE) +
  geom_jitter(width = 0.08, alpha = 0.4, show.legend = FALSE) +
  labs(x = "Alien species", y = "Preferred laser power") +
  theme_minimal()

Fancier but useful visualisations

Multivariate view: pair plots

Pair plots are useful before clustering or classification because they help us see several relationships at once.

Code

library(palmerpenguins)
library(GGally)

palmerpenguins::penguins |>
  drop_na(species, bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g) |>
  select(species, bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g) |>
  GGally::ggpairs(aes(colour = species, alpha = 0.5))

Note

If GGally is not installed, this slide can be skipped or shown from the rendered instructor version.

Correlation heatmap

Code

library(palmerpenguins)

palmerpenguins::penguins |>
  select(where(is.numeric)) |>
  cor(use = "complete.obs") |>
  as.data.frame() |>
  rownames_to_column("var1") |>
  pivot_longer(-var1, names_to = "var2", values_to = "correlation") |>
  ggplot(aes(var1, var2, fill = correlation)) +
  geom_tile() +
  geom_text(aes(label = round(correlation, 2))) +
  coord_equal() +
  labs(x = NULL, y = NULL, fill = "Correlation") +
  theme_minimal()

Text as data

Why text?

Text is a useful example of non-standard data.

It does not arrive as a clean numerical matrix.
We must decide what the observational units are: documents, paragraphs, sentences, words.
We transform text into variables before analysing it.

In text analysis, the statistical unit may change during the workflow:

at first, each row is a document;
after tokenisation, each row is a word occurrence;
after counting, each row may be a word, a document-word pair, or a field-word pair.

a small text dataset

Code

texts <- tibble::tribble(
  ~doc_id, ~field, ~text,
  1, "engineering", "The structure was tested under dynamic stress and material fatigue.",
  2, "engineering", "Mechanical systems require models of vibration, load and failure.",
  3, "engineering", "Sensors collect signals from machines under stress and vibration.",
  4, "sport", "Training load, recovery and fatigue affect athletic performance.",
  5, "sport", "Injury prevention requires monitoring intensity and recovery.",
  6, "sport", "Athletes improve performance through training, recovery and adaptation.",
  7, "economics", "Innovation and sustainability influence firms, markets and management.",
  8, "economics", "Firms compete in markets through strategy, investment and innovation.",
  9, "economics", "Management decisions affect costs, productivity and sustainability.",
  10, "humanities", "Texts, sources and contexts shape historical interpretation.",
  11, "humanities", "Archives preserve documents, memories and cultural heritage.",
  12, "humanities", "Interpretation depends on language, context and historical sources."
)

texts |> 
  kbl() |> 
  kable_styling(full_width = FALSE)

doc_id	field	text
1	engineering	The structure was tested under dynamic stress and material fatigue.
2	engineering	Mechanical systems require models of vibration, load and failure.
3	engineering	Sensors collect signals from machines under stress and vibration.
4	sport	Training load, recovery and fatigue affect athletic performance.
5	sport	Injury prevention requires monitoring intensity and recovery.
6	sport	Athletes improve performance through training, recovery and adaptation.
7	economics	Innovation and sustainability influence firms, markets and management.
8	economics	Firms compete in markets through strategy, investment and innovation.
9	economics	Management decisions affect costs, productivity and sustainability.
10	humanities	Texts, sources and contexts shape historical interpretation.
11	humanities	Archives preserve documents, memories and cultural heritage.
12	humanities	Interpretation depends on language, context and historical sources.

From documents to words

The first transformation is called tokenisation: we split each text into smaller units.

Code

library(tidytext)

tokens_raw <- texts |>
  tidytext::unnest_tokens(word, text)

tokens_raw |>
  slice_head(n = 15) |>
  kbl() |>
  kable_styling(full_width = FALSE)

doc_id	field	word
1	engineering	the
1	engineering	structure
1	engineering	was
1	engineering	tested
1	engineering	under
1	engineering	dynamic
1	engineering	stress
1	engineering	and
1	engineering	material
1	engineering	fatigue
2	engineering	mechanical
2	engineering	systems
2	engineering	require
2	engineering	models
2	engineering	of

Removing common words

Many frequent words are not informative: the, and, of, to, …

These are usually called stop words.

Code

tokens <- tokens_raw |>
  anti_join(tidytext::stop_words, by = "word")

tokens |>
  count(word, sort = TRUE) |>
  slice_head(n = 15) |>
  kbl() |>
  kable_styling(full_width = FALSE)

word	n
recovery	3
affect	2
fatigue	2
firms	2
historical	2
innovation	2
interpretation	2
load	2
management	2
markets	2
performance	2
sources	2
stress	2
sustainability	2
training	2

Word frequencies

After tokenisation, we can count how often each word appears.

Code

tokens |>
  count(word, sort = TRUE) |>
  slice_max(n, n = 15) |>
  ggplot(aes(x = n, y = reorder(word, n))) +
  geom_col(fill = "indianred", alpha = 0.7) +
  labs(x = "Frequency", y = NULL) +
  theme_minimal()

Word frequencies by field

Now we compare which words are common in different fields.

Code

tokens |>
  count(field, word, sort = TRUE) |>
  group_by(field) |>
  slice_max(n, n = 5, with_ties = FALSE) |>
  ungroup() |>
  ggplot(aes(n, tidytext::reorder_within(word, n, field))) +
  geom_col(fill = "indianred", alpha = 0.7) +
  facet_wrap(~ field, scales = "free_y") +
  tidytext::scale_y_reordered() +
  labs(x = "Frequency", y = NULL) +
  theme_minimal()

From text to a data matrix

Many statistical methods require a rectangular data table.

A common representation is the document-term matrix:

rows are documents;
columns are words;
values are word frequencies.

Code

document_term_matrix <- tokens |>
  count(doc_id, word) |>
  pivot_wider(
    names_from = word,
    values_from = n,
    values_fill = 0
  )

document_term_matrix |>
  select(1:min(10, ncol(document_term_matrix))) |>
  kbl() |>
  kable_styling(full_width = FALSE)

doc_id	dynamic	fatigue	material	stress	structure	tested	failure	load	mechanical
1	1	1	1	1	1	1	0	0	0
2	0	0	0	0	0	0	1	1	1
3	0	0	0	1	0	0	0	0	0
4	0	1	0	0	0	0	0	1	0
5	0	0	0	0	0	0	0	0	0
6	0	0	0	0	0	0	0	0	0
7	0	0	0	0	0	0	0	0	0
8	0	0	0	0	0	0	0	0	0
9	0	0	0	0	0	0	0	0	0
10	0	0	0	0	0	0	0	0	0
11	0	0	0	0	0	0	0	0	0
12	0	0	0	0	0	0	0	0	0

Term frequency by field

We can also build a field-term matrix.

Code

field_term_matrix <- tokens |>
  count(field, word) |>
  pivot_wider(
    names_from = word,
    values_from = n,
    values_fill = 0
  )

field_term_matrix |>
  select(1:min(12, ncol(field_term_matrix))) |>
  kbl() |>
  kable_styling(full_width = FALSE)

field	affect	compete	costs	decisions	firms	influence	innovation	investment	management	markets	productivity
economics	1	1	1	1	2	1	2	1	2	2	1
engineering	0	0	0	0	0	0	0	0	0	0	0
humanities	0	0	0	0	0	0	0	0	0	0	0
sport	1	0	0	0	0	0	0	0	0	0	0

What makes a word distinctive?

A word may be frequent overall, but not distinctive.

For example, a word that appears in every field is less useful for distinguishing fields.

Code

field_words <- tokens |>
  count(field, word, sort = TRUE)

field_words |>
  bind_tf_idf(word, field, n) |>
  group_by(field) |>
  slice_max(tf_idf, n = 5, with_ties = FALSE) |>
  ungroup() |>
  ggplot(aes(tf_idf, tidytext::reorder_within(word, tf_idf, field))) +
  geom_col(fill = "dodgerblue", alpha = 0.7) +
  facet_wrap(~ field, scales = "free_y") +
  tidytext::scale_y_reordered() +
  labs(x = "tf-idf", y = NULL) +
  theme_minimal()

What did we do?

Starting from raw text, we created variables.

documents → words → word counts → document-term matrix

This is why text can be analysed statistically: after preprocessing, text becomes structured

Optional wordcloud

Code

library(tidytext)
library(wordcloud)

tokens |>
  count(word, sort = TRUE) |>
  filter(n >= 1) |>
  with(
    wordcloud(
      words = word,
      freq = n,
      max.words = 50,
      min.freq = 1,
      random.order = FALSE,
      scale = c(3.5, 0.8)
    )
  )

Note

A word cloud is useful as a quick visual impression, but a bar chart is often easier to read and compare.

doc_id	dynamic	fatigue	material	stress	structure	tested	failure	load	mechanical
1	1	1	1	1	1	1	0	0	0
2	0	0	0	0	0	0	1	1	1
3	0	0	0	1	0	0	0	0	0
4	0	1	0	0	0	0	0	1	0
5	0	0	0	0	0	0	0	0	0
6	0	0	0	0	0	0	0	0	0
7	0	0	0	0	0	0	0	0	0
8	0	0	0	0	0	0	0	0	0
9	0	0	0	0	0	0	0	0	0
10	0	0	0	0	0	0	0	0	0
11	0	0	0	0	0	0	0	0	0
12	0	0	0	0	0	0	0	0	0

doc_id	dynamic	fatigue	material	stress	structure	tested	failure	load	mechanical
1	1	1	1	1	1	1	0	0	0
2	0	0	0	0	0	0	1	1	1
3	0	0	0	1	0	0	0	0	0
4	0	1	0	0	0	0	0	1	0
5	0	0	0	0	0	0	0	0	0
6	0	0	0	0	0	0	0	0	0
7	0	0	0	0	0	0	0	0	0
8	0	0	0	0	0	0	0	0	0
9	0	0	0	0	0	0	0	0	0
10	0	0	0	0	0	0	0	0	0
11	0	0	0	0	0	0	0	0	0
12	0	0	0	0	0	0	0	0	0

Essential Statistical Data Analysis

Part I — Theory and guided demonstration

What is statistics?

Data vs information

A small dataset

Dataset elements

Variable types

Describe and infer

observational vs experimental studies

observational vs experimental studies

random sampling and random assignment

Frequency distributions: categorical variables

Frequency distributions: numerical variables

Informative plots

Non-informative plots: lying with graphics

visualization for numerical variables

visualization for numerical variables

visualization for numerical variables

visualization for numerical variables

visualization for numerical variables

visualization for numerical variables

visualization for numerical variables

visualization for numerical variables (integers)

visualization for categorical variables

visualization for categorical variables

visualization for categorical variables

visualization for categorical variables

summarise (numerical) data: indexes

four distributions

company C&co. and D&co.

company B&co. and C&co.

company A&co. and D&co.

Measuring position, spread and shape

Visualising relationships

Interdependence and dependence

Types of bivariate relationships

Two categorical variables

UFO abductions: joint frequency table

Row profiles

Visualising categorical association

Residual heatmap

Two continuous variables

Correlation: shark attacks and ice cream

Correlation: building the index

Pearson correlation

Correlation is not causation

Negative and weak correlation

When correlation fails: nonlinear relationships

One categorical and one continuous variable

Comparing distributions across groups

More expressive view: violin + boxplot + jitter

Fancier but useful visualisations

Multivariate view: pair plots

Correlation heatmap

Text as data

Why text?

a small text dataset

From documents to words

Removing common words

Word frequencies

Word frequencies by field

From text to a data matrix

Term frequency by field

What makes a word distinctive?

What did we do?

Optional wordcloud

company `C&co.` and `D&co.`

company `B&co.` and `C&co.`

company `A&co.` and `D&co.`

Measuring `position`, `spread` and `shape`

doc_id	dynamic	fatigue	material	stress	structure	tested	failure	load	mechanical
1	1	1	1	1	1	1	0	0	0
2	0	0	0	0	0	0	1	1	1
3	0	0	0	1	0	0	0	0	0
4	0	1	0	0	0	0	0	1	0
5	0	0	0	0	0	0	0	0	0
6	0	0	0	0	0	0	0	0	0
7	0	0	0	0	0	0	0	0	0
8	0	0	0	0	0	0	0	0	0
9	0	0	0	0	0	0	0	0	0
10	0	0	0	0	0	0	0	0	0
11	0	0	0	0	0	0	0	0	0
12	0	0	0	0	0	0	0	0	0