Essential statistics across disciplines

univariate statistics and visualization

Alfonso Iodice D’Enza

What is Statstics?

“Statistics is the science of learning from data.” — David Hand

“Statistics is the science which deals with the collection, classification, and tabulation of numerical facts as the basis for explanation, description, and comparison of phenomena.” — Lovis R. Connor

“Statistics is the branch of scientific method which deals with the data obtained by counting or measuring the properties of populations of natural phenomena.” — Maurice G. Kendall

“Statistics is the study of the collection, analysis, interpretation, presentation, and organization of data.” — Sir Ronald A. Fisher (paraphrased)

“Statistics is the grammar of science.” — Karl Pearson

What is Statstics: data vs information

Just because one has data, doesn’t mean one has information.

Data: raw facts, numbers, measurements.
Information: Data that has been collected, processed, organized, and structured in a way that provides meaningful insights and understanding about a studied phenomenon.

here’s a dataset

Code

library(tidyverse)
library(kableExtra)

# Create the dataset
toy_data <- tibble(
  `age` = c(25, 40, 30, 50, 22, 35, 29, 55),
  `height (cm)` = c(175.2, 160.5, 180.0, 170.3, 165.8, 177.4, 172.0, 168.2),
  `gender` = factor(
    c("Male", "Female", "Female", "Female", "Male", "Female", "Male", "Female")
  ),
  `education` = factor(
    c("High School", "Bachelor's", "Master's", "PhD", 
      "PhD", "Bachelor's", "Master's", "High School"),
    levels = c("High School", "Bachelor's", "Master's", "PhD"),
    ordered = TRUE
  ),
 `commute by` = c("Car", "Bus", "Bike", "Car", "Bike", "Car", "Bus", "Walk"),
  `commute cost` = c(100, 50, 0, 150, 0, 120, 60, 0)
) 

toy_data |> kbl()

age	height (cm)	gender	education	commute by	commute cost
25	175.2	Male	High School	Car	100
40	160.5	Female	Bachelor's	Bus	50
30	180.0	Female	Master's	Bike	0
50	170.3	Female	PhD	Car	150
22	165.8	Male	PhD	Bike	0
35	177.4	Female	Bachelor's	Car	120
29	172.0	Male	Master's	Bus	60
55	168.2	Female	High School	Walk	0

Let’s name names…of the data set elements

Observation: a single unit of data, e.g., a person, a day, a country.

in this example, the employees of a company: highlight three of them

Code

toy_data |> kbl() |> 
  row_spec(1,background = "#D9FDEC") |> 
  row_spec(3,background = "#CAF6FC") |> 
  row_spec(8,background = "#FDB5BA")

age	height (cm)	gender	education	commute by	commute cost
25	175.2	Male	High School	Car	100
40	160.5	Female	Bachelor's	Bus	50
30	180.0	Female	Master's	Bike	0
50	170.3	Female	PhD	Car	150
22	165.8	Male	PhD	Bike	0
35	177.4	Female	Bachelor's	Car	120
29	172.0	Male	Master's	Bus	60
55	168.2	Female	High School	Walk	0

Let’s name names…of the data set elements

Observation: a single unit of data, e.g., a person, a day, a country.

in this example, the employees of a company: highlight three of them

Variable (attribute) : a characteristic of an observation.

the observed values can be numeric, or categorical

Code

toy_data |> kbl() |> 
  column_spec(2,background = "#D9FDEC") |> 
  column_spec(5,background = "#CAF6FC")

age	height (cm)	gender	education	commute by	commute cost
25	175.2	Male	High School	Car	100
40	160.5	Female	Bachelor's	Bus	50
30	180.0	Female	Master's	Bike	0
50	170.3	Female	PhD	Car	150
22	165.8	Male	PhD	Bike	0
35	177.4	Female	Bachelor's	Car	120
29	172.0	Male	Master's	Bus	60
55	168.2	Female	High School	Walk	0

Let’s name names…of the data set elements

discrete variables: the values are integer.

age in completed years

continuous variables: the values are continuous.

the height of a person is rounded depending on the measuring tool.

Code

toy_data |> kbl() |> 
  column_spec(2,background = "#D9FDEC") |> 
  column_spec(1,background = "#CAF6FC")

age	height (cm)	gender	education	commute by	commute cost
25	175.2	Male	High School	Car	100
40	160.5	Female	Bachelor's	Bus	50
30	180.0	Female	Master's	Bike	0
50	170.3	Female	PhD	Car	150
22	165.8	Male	PhD	Bike	0
35	177.4	Female	Bachelor's	Car	120
29	172.0	Male	Master's	Bus	60
55	168.2	Female	High School	Walk	0

Let’s name names…of the data set elements

factors: each value belongs to a finite set of categories (levels).

Male/Female and Car/Bus/Bike/Walk

ordered factors: the categories are inherently ordered.

Bachelor’s < Master’s < PhD

Code

toy_data |> kbl() |> 
  column_spec(3,background = "#D9FDEC") |> 
  column_spec(4,background = "#CAF6FC") |> 
  column_spec(5,background = "#D9FDEC")

age	height (cm)	gender	education	commute by	commute cost
25	175.2	Male	High School	Car	100
40	160.5	Female	Bachelor's	Bus	50
30	180.0	Female	Master's	Bike	0
50	170.3	Female	PhD	Car	150
22	165.8	Male	PhD	Bike	0
35	177.4	Female	Bachelor's	Car	120
29	172.0	Male	Master's	Bus	60
55	168.2	Female	High School	Walk	0

statistics: describe and infer

Descriptive statistics

summarize and describe the main features of a dataset.

frequency tables
synthetic indexes
visualization tools

Inferential statistics

make inferences about the population based on the sample data.

population: e.g. all the employees of the company; all the microchips of a production line
sample: a set of statistical units (employees/microchips) randomly selected from the population

observational vs experimental studies

Suppose a researcher wants to study the effect of the use of blue-screens (smartphone/tablet) before sleeping on the attention span.

observational studies

the researcher observes the statistical units without intervening.

assign the statistical units that use the blue-screens to one group, the remaining to another group.
measure the attention span of the two groups after two weeks.
if the group using blue-screens has a lower attention span, the researcher can conclude that the use of blue-screens is correlated to the attention span.

experimental studies

the researcher manipulates the statistical units to observe the effect of the manipulation.

split the statistical units into two groups via random assignment: one group is asked to use a blue-screen device before sleep, whereas the other group is asked NOT to use the device.
measure the attention span of the two groups after two weeks.
if the group using blue-screens has a lower attention span, the researcher can conclude that the use of blue-screens causes a lower attention span.

observational vs experimental studies

Suppose a researcher wants to study the effect of the use of blue-screens (smartphone/tablet) before sleeping on the attention span.

observational studies

the researcher observes the statistical units without intervening.

assign the statistical units that use the blue-screens to one group, the remaining to another group.
measure the attention span of the two groups after two weeks.
if the group using blue-screens has a lower attention span, the researcher can conclude that the use of blue-screens is correlated to the attention span.

experimental studies

the researcher manipulates the statistical units to observe the effect of the manipulation.

split the statistical units into two groups via random assignment: one group is asked to use a blue-screen device before sleep, whereas the other group is asked NOT to use the device.
measure the attention span of the two groups after two weeks.
if the group using blue-screens has a lower attention span, the researcher can conclude that the use of blue-screens causes a lower attention span.

random sampling and random assignment

In the setup of a study, two key concepts are desireable: random sampling and random assignment.

random sampling

the statistical units are randomly selected from the population.

this is to ensure that the sample is representative of the population.
the sample is used to make inferences about the population.

it makes it possible to generalize the results of the study to the population

random assigment

the statistical units are randomly assigned to case/control groups.

it makes it possible to detect causality (e.g. between blue-screen use and attention span)

random assigment

random assignment is a key feature of experimental studies and random sampling grants the generalizability of the results to the population.
to have both is the ideal setup for a study, but it is not always possible. (you cannot select people at random and force them to use or not use blue-screens before sleep!)

summarise data: frequency tables

raw data

this is the data set from before: it only has 8 observations and 6 variables.

Code

toy_data |> kbl()

age	height (cm)	gender	education	commute by	commute cost
25	175.2	Male	High School	Car	100
40	160.5	Female	Bachelor's	Bus	50
30	180.0	Female	Master's	Bike	0
50	170.3	Female	PhD	Car	150
22	165.8	Male	PhD	Bike	0
35	177.4	Female	Bachelor's	Car	120
29	172.0	Male	Master's	Bus	60
55	168.2	Female	High School	Walk	0

raw data

What if the data set had 250 observations and 6 variables?

Code

n=250
set.seed(123)

raw_age <- rgamma(n, 2, 10)
raw_h <- rgamma(n, 15, 30)


big_toy_data = tibble(age = round(20 + (raw_age - min(raw_age)) * (80 - 20) / (max(raw_age) - min(raw_age)),0),
                      `height (cm)`=round(150 + (raw_h - min(raw_h)) * (200 - 150) / (max(raw_h) - min(raw_h)),2),
                      education=sample(c("high school", "bachelor", "master", "phd"), size=n,prob=c(.05,.5,.3,.1), replace=TRUE),
                      `commute by`=sample(c("car", "bus", "bike", "walk"), size=n,prob=c(.2,.4,.3,.1), replace=TRUE),
                      `commute cost`=sample(seq(10,100,by=1), size=n, replace=TRUE)) |> 
  mutate(`commute cost`=ifelse(`commute by`=="walk"|`commute by`=="bike", 0, `commute cost`))


big_toy_data |> kbl() |> kable_styling() |> scroll_box(height = "400px")

age	height (cm)	education	commute by	commute cost
28	200.00	bachelor	bike	0
50	180.90	phd	bike	0
21	177.69	bachelor	bike	0
35	167.47	phd	car	74
59	160.41	bachelor	car	98
39	172.63	bachelor	bus	90
23	165.45	phd	bus	94
21	174.57	bachelor	bus	45
50	174.42	bachelor	car	26
37	163.60	high school	bus	48
38	176.55	master	car	68
34	181.56	bachelor	car	78
28	181.34	bachelor	walk	0
51	166.77	bachelor	bike	0
44	176.07	master	walk	0
33	166.98	master	car	50
28	176.57	bachelor	car	40
31	171.87	bachelor	bus	44
24	167.61	bachelor	bus	82
32	173.77	bachelor	bus	58
21	172.52	bachelor	car	47
23	161.97	bachelor	bus	91
23	166.99	bachelor	bus	26
30	162.69	bachelor	bus	86
30	192.28	master	bike	0
32	172.25	master	bus	62
44	159.02	high school	car	25
42	184.82	bachelor	bus	24
40	163.36	high school	car	100
32	174.92	master	car	31
37	177.51	phd	bus	17
23	175.31	master	bus	74
31	179.13	bachelor	bike	0
31	171.63	bachelor	car	95
50	178.84	master	bike	0
24	180.07	master	walk	0
38	178.45	bachelor	bus	99
25	174.48	master	bike	0
32	176.65	bachelor	bus	40
29	161.68	phd	bike	0
33	175.49	master	bike	0
36	173.96	bachelor	bus	41
24	185.64	phd	car	62
21	166.37	bachelor	bus	32
24	178.61	bachelor	bike	0
36	178.28	bachelor	bike	0
38	171.28	high school	bike	0
28	160.60	phd	bike	0
68	158.49	bachelor	bus	34
22	175.63	phd	bus	47
37	163.99	bachelor	bike	0
39	173.90	master	walk	0
34	164.41	master	bus	13
45	183.62	bachelor	bus	90
66	155.02	bachelor	bus	79
28	175.98	master	bike	0
25	187.18	phd	bike	0
26	163.70	master	car	36
41	170.92	bachelor	walk	0
33	166.00	bachelor	bike	0
23	176.98	master	bus	30
26	181.50	bachelor	bus	81
33	196.49	bachelor	car	26
38	173.77	master	car	87
29	174.77	master	car	33
26	173.03	bachelor	walk	0
37	185.96	bachelor	bike	0
48	192.81	bachelor	bike	0
38	170.06	bachelor	bus	97
30	158.45	master	car	88
36	177.78	master	bus	100
28	169.21	master	bike	0
36	180.52	bachelor	walk	0
27	165.46	master	car	55
51	190.70	master	car	36
29	173.38	bachelor	bus	21
56	184.05	master	bike	0
31	193.61	master	walk	0
35	170.19	bachelor	bike	0
60	164.15	high school	bus	14
34	170.80	master	bus	86
49	163.95	master	bus	92
28	169.79	bachelor	car	35
33	159.81	bachelor	bus	29
39	172.96	phd	walk	0
40	175.76	phd	bike	0
31	186.26	bachelor	bike	0
28	173.91	bachelor	walk	0
39	179.74	bachelor	bus	94
30	180.41	master	car	58
40	181.83	bachelor	bus	59
37	167.60	master	bike	0
34	153.52	bachelor	bus	82
27	171.79	bachelor	bus	20
46	175.86	master	car	74
44	179.58	bachelor	bus	84
38	163.86	high school	bike	0
37	161.21	phd	bike	0
41	170.44	master	bus	87
62	176.25	master	walk	0
41	178.11	bachelor	bike	0
26	167.37	master	bus	97
34	183.26	master	walk	0
25	172.25	high school	car	19
48	160.07	bachelor	bike	0
50	169.04	master	bus	61
58	156.05	bachelor	car	80
35	176.23	master	walk	0
29	167.51	master	car	53
49	174.82	bachelor	bus	20
22	181.49	bachelor	walk	0
41	174.71	high school	bike	0
32	167.24	master	bus	80
42	173.79	bachelor	bus	12
30	163.04	master	bus	68
30	163.72	high school	car	55
48	171.55	bachelor	car	91
33	171.76	master	walk	0
40	195.55	bachelor	bus	58
29	185.26	phd	walk	0
26	174.08	high school	car	39
47	176.70	bachelor	car	79
24	167.74	bachelor	bike	0
43	164.51	master	walk	0
40	160.74	master	bike	0
38	163.24	master	car	46
31	180.64	bachelor	bike	0
26	184.48	bachelor	bike	0
67	176.20	master	walk	0
26	167.31	bachelor	bike	0
61	169.39	high school	bus	64
24	168.52	phd	bus	71
38	167.32	bachelor	bike	0
29	181.32	bachelor	bike	0
25	181.30	bachelor	walk	0
30	162.45	bachelor	car	41
36	179.58	bachelor	bus	10
48	175.79	bachelor	car	43
34	165.34	bachelor	walk	0
43	165.84	master	bike	0
28	173.46	bachelor	bike	0
42	188.55	master	bus	14
43	176.74	bachelor	bike	0
53	182.09	master	bus	90
28	170.24	bachelor	bike	0
64	170.95	bachelor	bus	36
41	174.40	bachelor	bus	71
23	165.13	master	bike	0
23	163.33	bachelor	bike	0
69	174.39	master	bus	90
52	174.31	master	bus	72
30	181.82	phd	car	51
22	174.38	bachelor	bike	0
28	179.43	bachelor	bus	73
33	167.74	bachelor	bus	66
58	182.31	bachelor	walk	0
33	160.46	bachelor	bus	60
38	170.57	phd	bike	0
29	168.73	bachelor	car	82
28	175.15	bachelor	bus	59
48	187.03	master	bus	82
35	169.78	bachelor	bike	0
80	173.01	master	car	68
37	169.87	master	bus	67
33	166.52	bachelor	bus	95
28	177.58	bachelor	bike	0
24	163.72	bachelor	walk	0
42	170.48	master	bike	0
29	165.83	high school	bike	0
29	170.38	high school	bus	27
26	151.36	bachelor	car	80
42	183.88	bachelor	bus	100
23	163.21	bachelor	bike	0
32	177.63	bachelor	bike	0
27	162.79	bachelor	bus	78
28	179.88	master	walk	0
45	165.20	bachelor	bike	0
42	192.26	bachelor	car	51
32	183.51	high school	bike	0
26	172.51	master	bus	37
45	197.26	bachelor	bike	0
39	164.01	master	car	41
23	170.83	bachelor	bike	0
31	172.95	bachelor	bike	0
22	181.58	bachelor	bike	0
22	181.14	bachelor	bike	0
21	189.41	phd	bike	0
29	173.29	bachelor	bike	0
65	184.92	phd	bus	88
57	158.33	bachelor	walk	0
32	183.52	bachelor	bike	0
30	189.70	master	bus	65
32	158.55	bachelor	car	57
45	188.68	bachelor	bike	0
30	160.78	master	car	12
34	171.38	bachelor	bus	32
51	174.45	bachelor	car	84
71	165.06	bachelor	car	70
56	173.02	bachelor	car	71
32	150.00	master	car	76
41	160.98	bachelor	bike	0
46	174.62	bachelor	walk	0
27	175.55	bachelor	bus	100
46	163.62	master	bus	100
42	175.72	master	bus	58
22	187.40	master	bike	0
30	174.34	bachelor	bus	37
35	174.06	bachelor	walk	0
34	175.76	bachelor	car	86
38	170.26	master	bus	88
33	175.34	phd	bike	0
21	174.25	master	bike	0
25	178.89	master	car	60
30	169.17	bachelor	bike	0
50	173.39	bachelor	bus	86
52	183.02	master	bike	0
31	168.12	bachelor	bus	84
31	182.62	master	car	66
33	156.96	bachelor	bus	55
47	182.52	master	walk	0
49	174.07	bachelor	car	69
28	163.27	bachelor	bike	0
36	165.45	master	bus	64
42	166.39	master	bike	0
37	180.31	bachelor	bike	0
48	173.90	phd	walk	0
41	181.29	bachelor	bus	53
26	186.83	master	bike	0
45	193.50	bachelor	walk	0
36	172.50	master	bike	0
50	176.23	master	bike	0
22	168.54	bachelor	bus	58
22	171.33	high school	car	53
42	179.28	high school	bike	0
33	173.59	bachelor	bus	46
48	171.75	master	bike	0
61	176.58	bachelor	car	86
25	167.81	bachelor	bike	0
34	175.11	bachelor	bike	0
41	171.42	bachelor	bus	80
30	165.24	bachelor	car	25
77	172.51	bachelor	bus	92
31	182.53	master	bus	65
49	166.30	bachelor	bus	35
42	174.82	phd	walk	0
37	164.70	master	bike	0
20	170.57	bachelor	bus	19
38	181.97	master	walk	0
25	190.11	phd	bus	19
71	175.48	master	car	47

it become harder to grasp the distribution of the data, or any information about it.

frequency distribution: categorical case

the table displays the number of times each category is observed, and the proportion of times each category is observed

Code

big_toy_data |> count(`commute by`) |> mutate(prop = n/sum(n)) |>  kbl() |> kable_styling()

commute by	n	prop
bike	80	0.320
bus	86	0.344
car	53	0.212
walk	31	0.124

Code

big_toy_data |> count(education) |> mutate(prop = n/sum(n)) |>   kbl() |> kable_styling()

education	n	prop
bachelor	133	0.532
high school	16	0.064
master	79	0.316
phd	22	0.088

frequency distribution: numerical case

counting the number of times a values is observed is not very informative for numerical variables.

think of height: it is unlikely that two observations have the exact same height (e.g. 180.3)
possible solution: divide the range of values into non-overlapping, adjacent intervals and count the observations in each interval.

Code

big_toy_data |> mutate(`height (cm)`=cut(`height (cm)`, breaks=seq(150,200,by=10),include.lowest=T)) |> count(`height (cm)`) |> mutate(prop = n/sum(n)) |>  kbl() |> kable_styling()

height (cm)	n	prop
[150,160]	12	0.048
(160,170]	72	0.288
(170,180]	110	0.440
(180,190]	45	0.180
(190,200]	11	0.044

Code

big_toy_data |> mutate(age=cut(age, breaks=c(seq(20,80,by=12)),include.lowest=T)) |> count(age) |> mutate(prop = n/sum(n)) |>  kbl() |> kable_styling()

age	n	prop
[20,32]	110	0.440
(32,44]	87	0.348
(44,56]	35	0.140
(56,68]	13	0.052
(68,80]	5	0.020

frequency distribution: numerical case

divide the range of values into non-overlapping, adjacent intervals and count the observations in each interval.
one may find it easier to refer to percentages rather than proportions.

Code

big_toy_data |> mutate(`height (cm)`=cut(`height (cm)`, breaks=seq(150,200,by=10),include.lowest=T)) |> count(`height (cm)`) |> mutate(
  prop = n/sum(n),
  perc=round(prop*100,2)) |>  kbl() |> kable_styling()

height (cm)	n	prop	perc
[150,160]	12	0.048	4.8
(160,170]	72	0.288	28.8
(170,180]	110	0.440	44.0
(180,190]	45	0.180	18.0
(190,200]	11	0.044	4.4

Code

big_toy_data |> mutate(age=cut(age, breaks=c(seq(20,80,by=12)),include.lowest=T)) |> count(age) |> 
  mutate(
  prop = n/sum(n),
  perc=round(prop*100,2)) |>  kbl() |> kable_styling()

age	n	prop	perc
[20,32]	110	0.440	44.0
(32,44]	87	0.348	34.8
(44,56]	35	0.140	14.0
(56,68]	13	0.052	5.2
(68,80]	5	0.020	2.0

frequency distribution: numerical case

proportions or percentages in the table indicate the relative frequency of each interval.
but what if I want to know, e.g:
- what percentage of people is older than 32?
- what percentage of people is less than 180 cm tall?

Code

big_toy_data |> mutate(`height (cm)`=cut(`height (cm)`, breaks=seq(150,200,by=10),include.lowest=T)) |> count(`height (cm)`) |> mutate(
  prop = n/sum(n),
  perc=round(prop*100,2),
  `F(X)`=cumsum(prop)) |>  kbl() |> kable_styling()

height (cm)	n	prop	perc	F(X)
[150,160]	12	0.048	4.8	0.048
(160,170]	72	0.288	28.8	0.336
(170,180]	110	0.440	44.0	0.776
(180,190]	45	0.180	18.0	0.956
(190,200]	11	0.044	4.4	1.000

Code

big_toy_data |> mutate(age=cut(age, breaks=c(seq(20,80,by=12)),include.lowest=T)) |> count(age) |> 
  mutate(
  prop = n/sum(n),
  perc=round(prop*100,2),
  `F(X)` =cumsum(prop)) |>  kbl() |> kable_styling()

age	n	prop	perc	F(X)
[20,32]	110	0.440	44.0	0.440
(32,44]	87	0.348	34.8	0.788
(44,56]	35	0.140	14.0	0.928
(56,68]	13	0.052	5.2	0.980
(68,80]	5	0.020	2.0	1.000

the cumulative distribution function (F(X)) returns the proportion of obseerved values that are less than or equal to x.

note: it adds up to 1, therefore, if one is interested in the proportion of values that are greater than x, one can simply subtract the value of the cumulative distribution function from 1.

frequency distribution: numerical case

Code

big_toy_data |> mutate(`height (cm)`=cut(`height (cm)`, breaks=seq(150,200,by=10),include.lowest=T)) |> count(`height (cm)`) |> mutate(
  prop = n/sum(n),
  perc=round(prop*100,2),
  `F(X)`=cumsum(prop)) |>  kbl() |> kable_styling()

height (cm)	n	prop	perc	F(X)
[150,160]	12	0.048	4.8	0.048
(160,170]	72	0.288	28.8	0.336
(170,180]	110	0.440	44.0	0.776
(180,190]	45	0.180	18.0	0.956
(190,200]	11	0.044	4.4	1.000

Code

big_toy_data |> mutate(age=cut(age, breaks=c(seq(20,80,by=12)),include.lowest=T)) |> count(age) |> 
  mutate(
  prop = n/sum(n),
  perc=round(prop*100,2),
  `F(X)` =cumsum(prop)) |>  kbl() |> kable_styling()

age	n	prop	perc	F(X)
[20,32]	110	0.440	44.0	0.440
(32,44]	87	0.348	34.8	0.788
(44,56]	35	0.140	14.0	0.928
(56,68]	13	0.052	5.2	0.980
(68,80]	5	0.020	2.0	1.000

The previous questions are easily replied

what percentage of people is less than 180 cm tall? \(F(180)=.776\)
what percentage of people is older than 32? \(1-F(32)=1-.44=.56\)

summarise data: visualization tools

informative plots

An effective data visualization has to be

appropriate: the visualization depends on the nature of the variable(s) at hand, and should be selected accordingly.
clear: the visualization should be clear and easy to understand, and the reader should be able to grasp the information at a glance.
simple: the visualization should be simple and not cluttered with unnecessary details.
accurate: the visualization should accurately represent the data and not mislead the reader.

non-informative plots: lie with graphics

Suppose you have two competing companies, A and B, that are in the stock market:

Code

tibble(company = c("company A","company B"),
       revenue = c(800, 1000)) |>
  ggplot(aes(x=company, y=revenue,fill=company)) +geom_col()+
  coord_cartesian(ylim = c(750, 1000)) + theme_minimal() +
  theme(legend.position = "none",
        axis.text.y = element_blank(),
        axis.ticks.x = element_blank())

it appears company B is five times more profitable than company A!

Code

tibble(company = c("company A","company B"),
       revenue = c(800, 1000)) |>
  ggplot(aes(x=company, y=revenue,fill=company)) +geom_col()+
  theme_minimal() +
  theme(legend.position = "none")

wait, what?

same data, different visualization

The first plot is misleading

it uses a non-zero baseline for the y-axis, which exaggerates the difference between the two companies.
no tick marks on the y-axis are reported. That is cheating!

:::

visualization for numerical variables

Code

big_toy_data |> ggplot(aes(x = `height (cm)`)) +
    geom_histogram( binwidth= 5,
                  boundary = 150,
                  fill = "indianred", color = "black",alpha=.5) +
  labs(title = "Histogram of height", x = "height (cm)", y = "Frequency") + theme_minimal()

visualization for numerical variables

Code

big_toy_data |> mutate(`height (cm)`=cut(`height (cm)`, 
                                         breaks=seq(150,200,by=5),include.lowest=T)) |> 
  count(`height (cm)`) |> kbl() |> kable_styling()

height (cm)	n
[150,155]	3
(155,160]	9
(160,165]	32
(165,170]	40
(170,175]	66
(175,180]	44
(180,185]	33
(185,190]	12
(190,195]	7
(195,200]	4

Code

big_toy_data |> ggplot(aes(x = `height (cm)`)) +
  geom_histogram( binwidth= 5,
                  boundary = 150,
                  fill = "indianred", color = "black",alpha=.5) +
  labs(title = "Histogram of height", x = "height (cm)") + theme_minimal()

visualization for numerical variables

Code

big_toy_data |> mutate(`height (cm)`=cut(`height (cm)`, 
                                         breaks=seq(150,200,by=5),include.lowest=T)) |> 
  count(`height (cm)`) |> mutate(prop=n/sum(n)) |> select(-n) |> kbl() |> kable_styling()

height (cm)	prop
[150,155]	0.012
(155,160]	0.036
(160,165]	0.128
(165,170]	0.160
(170,175]	0.264
(175,180]	0.176
(180,185]	0.132
(185,190]	0.048
(190,195]	0.028
(195,200]	0.016

Code

big_toy_data |> ggplot(aes(x = `height (cm)`)) +
  geom_histogram(aes(y = after_stat(count / sum(count))),
                  binwidth= 5,
                  boundary = 150,
                  fill = "indianred", color = "black",alpha=.5) +
  # geom_histogram( aes(y=after_stat(density)), binwidth= 5,
  #                 boundary = 150,
  #                 fill = "indianred", color = "black",alpha=.5) +
  labs(title = "Histogram of height", x = "height (cm)") + theme_minimal()

visualization for numerical variables

Code

big_toy_data |> mutate(`height (cm)`=cut(`height (cm)`, 
                                         breaks=seq(150,200,by=5),include.lowest=T)) |> 
  count(`height (cm)`) |> mutate(prop=n/sum(n)) |> select(-n) |> kbl() |> kable_styling()

height (cm)	prop
[150,155]	0.012
(155,160]	0.036
(160,165]	0.128
(165,170]	0.160
(170,175]	0.264
(175,180]	0.176
(180,185]	0.132
(185,190]	0.048
(190,195]	0.028
(195,200]	0.016

Code

big_toy_data |> ggplot(aes(x = `height (cm)`)) +
  # geom_histogram(aes(y = after_stat(count / sum(count))),
  #                 binwidth= 5,
  #                 boundary = 150,
  #                 fill = "dodgerblue", color = "black",alpha=.5) +
  geom_histogram( aes(y=after_stat(density)), binwidth= 5,
                  boundary = 150,
                  fill = "dodgerblue", color = "black",alpha=.5) +
  labs(title = "Histogram of height", x = "height (cm)") + theme_minimal()

visualization for numerical variables

Code

custom_breaks <- c(150,170,175,195,200)
big_toy_data |> mutate(`height (cm)`=cut(`height (cm)`, 
                                         breaks=custom_breaks,include.lowest=T)) |> 
  count(`height (cm)`) |> mutate(prop=n/sum(n)) |> select(-n) |> kbl() |> kable_styling()

height (cm)	prop
[150,170]	0.336
(170,175]	0.264
(175,195]	0.384
(195,200]	0.016

Code

big_toy_data |> ggplot(aes(x = `height (cm)`)) +
  geom_histogram(aes(y = after_stat(count / sum(count))),
                  breaks=custom_breaks,
                  boundary = 150,
                  fill = "indianred", color = "black",alpha=.5) +
  # geom_histogram( aes(y=after_stat(density)), binwidth= 5,
  #                 boundary = 150,
  #                 fill = "dodgerblue", color = "black",alpha=.5) +
  labs(title = "Histogram of height", x = "height (cm)") + theme_minimal()

visualization for numerical variables

Code

big_toy_data |> mutate(`height (cm)`=cut(`height (cm)`, 
                                         breaks=custom_breaks,include.lowest=T)) |> 
  count(`height (cm)`) |> mutate(prop=n/sum(n)) |> select(-n) |> kbl() |> kable_styling()

height (cm)	prop
[150,170]	0.336
(170,175]	0.264
(175,195]	0.384
(195,200]	0.016

Code

big_toy_data |> ggplot(aes(x = `height (cm)`)) +
  geom_histogram(aes(y=after_stat(density)),
                  breaks=custom_breaks,
                  boundary = 150,
                  fill = "dodgerblue", color = "black",alpha=.5) +
  # geom_histogram( aes(y=after_stat(density)), binwidth= 5,
  #                 boundary = 150,
  #                 fill = "dodgerblue", color = "black",alpha=.5) +
  labs(title = "Histogram of height", x = "height (cm)") + theme_minimal()

visualization for numerical variables

Code

big_toy_data |> ggplot(aes(x = `height (cm)`)) +
  geom_histogram(aes(y = after_stat(count / sum(count))),
                  breaks=custom_breaks,
                  boundary = 150,
                  fill = "indianred", color = "black",alpha=.5) +
  # geom_histogram( aes(y=after_stat(density)), binwidth= 5,
  #                 boundary = 150,
  #                 fill = "dodgerblue", color = "black",alpha=.5) +
  labs(title = "Histogram of height", x = "height (cm)") + theme_minimal()

the heights of the bars add up to 1

Code

big_toy_data |> ggplot(aes(x = `height (cm)`)) +
  geom_histogram(aes(y=after_stat(density)),
                  breaks=custom_breaks,
                  boundary = 150,
                  fill = "dodgerblue", color = "black",alpha=.5) +
  # geom_histogram( aes(y=after_stat(density)), binwidth= 5,
  #                 boundary = 150,
  #                 fill = "dodgerblue", color = "black",alpha=.5) +
  labs(title = "Histogram of height", x = "height (cm)") + theme_minimal()

the area of the bars adds up to 1

more appropriate when the bins, intervals of values, are of different widths

visualization for numerical variables (integers)

While one can still consider bins and an histogram, since the variable is not continuous, a more appropriate option is the lollipop plot.

Code

big_toy_data |> 
count(age ) %>%
  ggplot(aes(x = age, y = n)) +
  geom_segment(aes(xend = age, yend = 0), color = "dodgerblue") +
  geom_point(color = "indianred", size = 3) + theme_minimal()

visualization for categorical variables

Barplots are the most common way to visualize categorical variables.

Code

big_toy_data |> ggplot(aes(x = factor(education))) +
  geom_bar(fill = "dodgerblue") +
  labs(title = "barplot", x = "education", y = "Count")+ theme_minimal()

visualization for categorical variables

pie charts are also popular, but less easily interpretable.

Code

 data_pie <- big_toy_data |> 
  count(education) %>%
  mutate(perc = (n / sum(n)),
         label = paste0(education, " ", scales::percent(perc)))

# Plot
ggplot(data_pie, aes(x = "", y = perc, fill = education)) +
  geom_col(width = 1, color = "white") +
  coord_polar(theta = "y") +
  labs(title = "Pie Chart of eduction", fill = "education") +
  geom_text(aes(label = label), position = position_stack(vjust = 0.5)) +
  theme_void()

visualization for categorical variables

similar to piecharts, but more interpretable, is the donut chart.

Code

data_pie |> ggplot(aes(x = 2, y = perc, fill = education)) +
  geom_col(width = 1, color = "white") +
  coord_polar(theta = "y") +
  xlim(0.5, 2.5) +  # create the hole
  geom_text(aes(label = label), position = position_stack(vjust = 0.5)) +
  labs(title = "Donut Chart of education", fill = "education") +
  theme_void()

visualization for categorical variables

If the factor levels are not ordered as in education, the bars of the barplot can be re-ordered in an ascending or descending order.

Code

big_toy_data |> ggplot(aes(x = fct_rev(fct_infreq(`commute by`)))) +
  geom_bar(fill = "dodgerblue") +
  labs(title = "barplot", x = "commute_by", y = "Count")+ theme_minimal()

Code

big_toy_data |> ggplot(aes(x = fct_infreq(`commute by`))) +
  geom_bar(fill = "dodgerblue") +
  labs(title = "barplot", x = "commute_by", y = "Count")+ theme_minimal()

This is useful for categorical variables with many levels, to put in evidence the most/least frequent ones.

summarise (numerical) data: indexes

four distributions

Suppose to have the distribution of employees heights from four companies.

Code

n=1000
set.seed(123)
library("sn")



four_heigths_data = tibble(
                      `A&co.`= rsn(n = n, xi=140, omega=25,alpha=15),
                      `B&co.`= rnorm(n, 170, 25),
                      `C&co.`= rnorm(n, 170, 9),
                      `D&co.`= rnorm(n, 190, 9)
)

four_heigths_data |> 
  pivot_longer(cols = everything(), names_to = "company", values_to = "height") |> 
  ggplot(aes(x = height, fill = company)) +
  geom_density(alpha = 0.5) +
  geom_histogram( aes(y=after_stat(density)), binwidth= 5,colour="black",alpha=.5)+
  scale_fill_manual(values = c("indianred","dodgerblue",  "darkorange", "darkgreen")) +
  labs(title = "Density of heights", x = "height (cm)", y = "Density") + theme_minimal() + facet_grid(company~.)

you can tell they are different from one another, but how?

company `C&co.` and `D&co.`

the two distributions look similar, but for their position

Code

n=1000
set.seed(123)
library("sn")

four_heigths_data |> 
  pivot_longer(cols = everything(), names_to = "company", values_to = "height") |> 
  filter(company %in% c("C&co.", "D&co.")) |>
  ggplot(aes(x = height, fill = company)) +
  geom_density(alpha = 0.5) +
  geom_histogram( aes(y=after_stat(density)), binwidth= 5,colour="black",alpha=.5)+
  scale_fill_manual(values = c("darkorange", "darkgreen")) +
  labs(title = "Density of heights", x = "height (cm)", y = "Density") + theme_minimal() + facet_grid(company~.)

company `B&co.` and `C&co.`

the two distributions seem to have

a similar position (centered around 170 cm)
different spread (the first one is more spread out than the second one)

Code

n=1000
set.seed(123)
library("sn")

four_heigths_data |> 
  pivot_longer(cols = everything(), names_to = "company", values_to = "height") |> 
  filter(company %in% c("B&co.", "C&co.")) |>
  ggplot(aes(x = height, fill = company)) +
  geom_density(alpha = 0.5) +
  geom_histogram( aes(y=after_stat(density)), binwidth= 5,colour="black",alpha=.5)+
  scale_fill_manual(values = c("dodgerblue","darkorange")) +
  labs(title = "Density of heights", x = "height (cm)", y = "Density") + theme_minimal() + facet_grid(company~.)

company `A&co.` and `D&co.`

the two distributions seem to have

different position
different spread
different shape (the first one is more skewed to the right than the second one)

Code

n=1000
set.seed(123)
library("sn")

four_heigths_data |> 
  pivot_longer(cols = everything(), names_to = "company", values_to = "height") |> 
  filter(company %in% c("A&co.", "D&co.")) |>
  ggplot(aes(x = height, fill = company)) +
  geom_density(alpha = 0.5) +
  geom_histogram( aes(y=after_stat(density)), binwidth= 5,colour="black",alpha=.5)+
  scale_fill_manual(values = c("indianred", "darkgreen")) +
  labs(title = "Density of heights", x = "height (cm)", y = "Density") + theme_minimal() + facet_grid(company~.)

Measuring `position`, `spread` and `shape`

position

mean
median and quartiles

spread

variance (standard deviation)
interquartile range (IQR)

shape

skewness

Measuring `position`: the mean

In case of raw data, the mean is computed as:

\[ \bar{x} = \frac{x_{1}+x_{1}+ \ldots + x_{n}}{n}= \frac{1}{n} \sum_{i=1}^{n} x_i \]

where \(x_i\) is the \(i\)-th observation and \(n\) is the number of observations.

In case of absolute frequencies of the values (counts), we can compute the mean as:

\[ \bar{x}=\frac{\sum_{j=1}^{k} x_{j}n_{j}}{\sum_{j=1}^{k} n_{j}}=\frac{1}{n}\sum_{j=1}^{k} x_{j}n_{j} \] where \(x_j\) is the \(j\)-th value, \(n_j\) is the number of observations with value \(x_j\), and \(k\) is the number of different values.

In case of relative frequencies of the values (proportions), we can compute the mean as:

\[ \bar{x}=\sum_{j=1}^{k} x_{j}\frac{n_{j}}{n}=\sum_{j=1}^{k} x_{j}f_{j} \] where \(f_j\) is the relative frequency of the value \(x_j\).

Measuring `position`: the mean

Code

four_heigths_summary = four_heigths_data |> 
  pivot_longer(cols = everything(), names_to = "company", values_to = "height") |>
  group_by(company) |>
  summarise(
    mean = mean(height),
    median = median(height),
    q1 = quantile(height, 0.25),
    q3 = quantile(height, 0.75),
    var = var(height),
    sd = sd(height),
    IQR = IQR(height),
    skewness = (mean(height)-median(height))/sd(height)
  )

Code

four_heigths_data |> 
  pivot_longer(cols = everything(), names_to = "company", values_to = "height") |> 
  ggplot(aes(x = height, fill = company)) +
  geom_density(alpha = 0.5) +
  geom_histogram( aes(y=after_stat(density)), binwidth= 5,colour="black",alpha=.5)+
  scale_fill_manual(values = c("indianred","dodgerblue",  "darkorange", "darkgreen")) +
  labs(title = "Density of heights", x = "height (cm)", y = "Density") + theme_minimal() + facet_grid(company~.)

Code

four_heigths_summary |> 
  select(company, mean) |> #, median, q1, q3, var, sd, IQR, skewness) |>
  kbl() |> 
  row_spec(1, color = "indianred") |>
  row_spec(2, color = "dodgerblue") |>
  row_spec(3, color = "darkorange") |>
  row_spec(4, color = "darkgreen") |>
  kable_styling(full_width = F)

company	mean
A&co.	159.3382
B&co.	169.4972
C&co.	169.9176
D&co.	189.7105

Measuring `position`: the median (a.k.a. second quartile Q2)

The median is the value that separates the higher half from the lower half of a data sample: higher/lower halves refer to the sorted values: \(50%\) of the values are below the median and \(50%\) of the values are above the median.

Code

toy_data |> select(age) |> mutate(`sorted age`=sort(age)) |> kbl()

age	sorted age
25	22
40	25
30	29
50	30
22	35
35	40
29	50
55	55

to identify the middle position of the sorted values distribution, one can use the following formula:

\((\frac{n}{2}, \frac{n}{2}+1) \text{ if } n \text{ is even, and } (\frac{n+1}{2}) \text{ if } n \text{ is odd}\).

in the example, \(n=8\) is even, so the median is the average of the 4-th and 5-th values in the sorted list: \(\frac{30+35}{2}=32.5\).

Measuring `position`: the median

Code

four_heigths_data |> 
  pivot_longer(cols = everything(), names_to = "company", values_to = "height") |> 
  ggplot(aes(x = height, fill = company)) +
  geom_density(alpha = 0.5) +
  geom_histogram( aes(y=after_stat(density)), binwidth= 5,colour="black",alpha=.5)+
  scale_fill_manual(values = c("indianred","dodgerblue",  "darkorange", "darkgreen")) +
  labs(title = "Density of heights", x = "height (cm)", y = "Density") + theme_minimal() + facet_grid(company~.)

Code

four_heigths_summary |> 
  select(company, median) |> #, median, q1, q3, var, sd, IQR, skewness) |>
  kbl() |> 
  row_spec(1, color = "indianred") |>
  row_spec(2, color = "dodgerblue") |>
  row_spec(3, color = "darkorange") |>
  row_spec(4, color = "darkgreen") |>
  kable_styling(full_width = F)

company	median
A&co.	156.1525
B&co.	168.7356
C&co.	169.9263
D&co.	189.7048

Measuring `spread`: the variance

The variance is a measure of how far a set of numbers are spread out from their average value.

The variance is defined as:

\[ s^2 = \frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})^2 \] it is essentially the same formula as the mean, just unplug the values \(x_{i}\) and replace them with the squared deviations from the mean \((x_{i} - \bar{x})^{2}\).

for counts, the formula is

\[ s^2 = \frac{1}{n} \sum_{j=1}^{k} n_{j}(x_{j} - \bar{x})^2 \]

for proportions, the formula is \[ s^2 = \sum_{j=1}^{k} f_{j}(x_{j} - \bar{x})^2 \]

just like for the mean!

Measuring `spread`: the variance

Code

four_heigths_data |> 
  pivot_longer(cols = everything(), names_to = "company", values_to = "height") |> 
  ggplot(aes(x = height, fill = company)) +
  geom_density(alpha = 0.5) +
  geom_histogram( aes(y=after_stat(density)), binwidth= 5,colour="black",alpha=.5)+
  scale_fill_manual(values = c("indianred","dodgerblue",  "darkorange", "darkgreen")) +
  labs(title = "Density of heights", x = "height (cm)", y = "Density") + theme_minimal() + facet_grid(company~.)

Code

four_heigths_summary |> 
  select(company, var) |> #, median, q1, q3, var, sd, IQR, skewness) |>
  kbl() |> 
  row_spec(1, color = "indianred") |>
  row_spec(2, color = "dodgerblue") |>
  row_spec(3, color = "darkorange") |>
  row_spec(4, color = "darkgreen") |>
  kable_styling(full_width = F)

company	var
A&co.	224.13807
B&co.	598.23959
C&co.	79.76571
D&co.	81.11201

Measuring `spread`: the standard deviation

Code

four_heigths_data |> 
  pivot_longer(cols = everything(), names_to = "company", values_to = "height") |> 
  ggplot(aes(x = height, fill = company)) +
  geom_density(alpha = 0.5) +
  geom_histogram( aes(y=after_stat(density)), binwidth= 5,colour="black",alpha=.5)+
  scale_fill_manual(values = c("indianred","dodgerblue",  "darkorange", "darkgreen")) +
  labs(title = "Density of heights", x = "height (cm)", y = "Density") + theme_minimal() + facet_grid(company~.)

Code

four_heigths_summary |> 
  select(company, sd) |> #, median, q1, q3, var, sd, IQR, skewness) |>
  kbl() |> 
  row_spec(1, color = "indianred") |>
  row_spec(2, color = "dodgerblue") |>
  row_spec(3, color = "darkorange") |>
  row_spec(4, color = "darkgreen") |>
  kable_styling(full_width = F)

company	sd
A&co.	14.971241
B&co.	24.458937
C&co.	8.931165
D&co.	9.006221

Essential statistics across disciplines

What is Statstics?

What is Statstics: data vs information

here’s a dataset

Let’s name names…of the data set elements

Let’s name names…of the data set elements

Let’s name names…of the data set elements

Let’s name names…of the data set elements

statistics: describe and infer

observational vs experimental studies

observational vs experimental studies

random sampling and random assignment

summarise data: frequency tables

raw data

raw data

frequency distribution: categorical case

frequency distribution: numerical case

frequency distribution: numerical case

frequency distribution: numerical case

frequency distribution: numerical case

summarise data: visualization tools

informative plots

non-informative plots: lie with graphics

visualization for numerical variables

visualization for numerical variables

visualization for numerical variables

visualization for numerical variables

visualization for numerical variables

visualization for numerical variables

visualization for numerical variables

visualization for numerical variables (integers)

visualization for categorical variables

visualization for categorical variables

visualization for categorical variables

visualization for categorical variables

summarise (numerical) data: indexes

four distributions

company C&co. and D&co.

company B&co. and C&co.

company A&co. and D&co.

Measuring position, spread and shape

Measuring position: the mean

Measuring position: the mean

Measuring position: the median (a.k.a. second quartile Q2)

Measuring position: the median

Measuring spread: the variance

Measuring spread: the variance

Measuring spread: the standard deviation

company `C&co.` and `D&co.`

company `B&co.` and `C&co.`

company `A&co.` and `D&co.`

Measuring `position`, `spread` and `shape`

Measuring `position`: the mean

Measuring `position`: the mean

Measuring `position`: the median (a.k.a. second quartile Q2)

Measuring `position`: the median

Measuring `spread`: the variance

Measuring `spread`: the variance

Measuring `spread`: the standard deviation