What is Statstics?

“Statistics is the science of learning from data.”David Hand

“Statistics is the science which deals with the collection, classification, and tabulation of numerical facts as the basis for explanation, description, and comparison of phenomena.”Lovis R. Connor

“Statistics is the branch of scientific method which deals with the data obtained by counting or measuring the properties of populations of natural phenomena.”Maurice G. Kendall

“Statistics is the study of the collection, analysis, interpretation, presentation, and organization of data.”Sir Ronald A. Fisher (paraphrased)

“Statistics is the grammar of science.”Karl Pearson

What is Statstics: data vs information

Just because one has data, doesn’t mean one has information.

  • Data: raw facts, numbers, measurements.

  • Information: Data that has been collected, processed, organized, and structured in a way that provides meaningful insights and understanding about a studied phenomenon.

here’s a dataset

Code
library(tidyverse)
library(kableExtra)

# Create the dataset
toy_data <- tibble(
  `age` = c(25, 40, 30, 50, 22, 35, 29, 55),
  `height (cm)` = c(175.2, 160.5, 180.0, 170.3, 165.8, 177.4, 172.0, 168.2),
  `gender` = factor(
    c("Male", "Female", "Female", "Female", "Male", "Female", "Male", "Female")
  ),
  `education` = factor(
    c("High School", "Bachelor's", "Master's", "PhD", 
      "PhD", "Bachelor's", "Master's", "High School"),
    levels = c("High School", "Bachelor's", "Master's", "PhD"),
    ordered = TRUE
  ),
 `commute by` = c("Car", "Bus", "Bike", "Car", "Bike", "Car", "Bus", "Walk"),
  `commute cost` = c(100, 50, 0, 150, 0, 120, 60, 0)
) 

toy_data |> kbl()
age height (cm) gender education commute by commute cost
25 175.2 Male High School Car 100
40 160.5 Female Bachelor's Bus 50
30 180.0 Female Master's Bike 0
50 170.3 Female PhD Car 150
22 165.8 Male PhD Bike 0
35 177.4 Female Bachelor's Car 120
29 172.0 Male Master's Bus 60
55 168.2 Female High School Walk 0

Let’s name names…of the data set elements

Observation: a single unit of data, e.g., a person, a day, a country.

  • in this example, the employees of a company: highlight three of them
Code
toy_data |> kbl() |> 
  row_spec(1,background = "#D9FDEC") |> 
  row_spec(3,background = "#CAF6FC") |> 
  row_spec(8,background = "#FDB5BA")
age height (cm) gender education commute by commute cost
25 175.2 Male High School Car 100
40 160.5 Female Bachelor's Bus 50
30 180.0 Female Master's Bike 0
50 170.3 Female PhD Car 150
22 165.8 Male PhD Bike 0
35 177.4 Female Bachelor's Car 120
29 172.0 Male Master's Bus 60
55 168.2 Female High School Walk 0

Let’s name names…of the data set elements

Observation: a single unit of data, e.g., a person, a day, a country.

  • in this example, the employees of a company: highlight three of them

Variable (attribute) : a characteristic of an observation.

  • the observed values can be numeric, or categorical
Code
toy_data |> kbl() |> 
  column_spec(2,background = "#D9FDEC") |> 
  column_spec(5,background = "#CAF6FC") 
age height (cm) gender education commute by commute cost
25 175.2 Male High School Car 100
40 160.5 Female Bachelor's Bus 50
30 180.0 Female Master's Bike 0
50 170.3 Female PhD Car 150
22 165.8 Male PhD Bike 0
35 177.4 Female Bachelor's Car 120
29 172.0 Male Master's Bus 60
55 168.2 Female High School Walk 0

Let’s name names…of the data set elements

discrete variables: the values are integer.

  • age in completed years

continuous variables: the values are continuous.

  • the height of a person is rounded depending on the measuring tool.
Code
toy_data |> kbl() |> 
  column_spec(2,background = "#D9FDEC") |> 
  column_spec(1,background = "#CAF6FC") 
age height (cm) gender education commute by commute cost
25 175.2 Male High School Car 100
40 160.5 Female Bachelor's Bus 50
30 180.0 Female Master's Bike 0
50 170.3 Female PhD Car 150
22 165.8 Male PhD Bike 0
35 177.4 Female Bachelor's Car 120
29 172.0 Male Master's Bus 60
55 168.2 Female High School Walk 0

Let’s name names…of the data set elements

factors: each value belongs to a finite set of categories (levels).

  • Male/Female and Car/Bus/Bike/Walk

ordered factors: the categories are inherently ordered.

  • Bachelor’s < Master’s < PhD
Code
toy_data |> kbl() |> 
  column_spec(3,background = "#D9FDEC") |> 
  column_spec(4,background = "#CAF6FC") |> 
  column_spec(5,background = "#D9FDEC")
age height (cm) gender education commute by commute cost
25 175.2 Male High School Car 100
40 160.5 Female Bachelor's Bus 50
30 180.0 Female Master's Bike 0
50 170.3 Female PhD Car 150
22 165.8 Male PhD Bike 0
35 177.4 Female Bachelor's Car 120
29 172.0 Male Master's Bus 60
55 168.2 Female High School Walk 0

statistics: describe and infer

Descriptive statistics

summarize and describe the main features of a dataset.

  • frequency tables

  • synthetic indexes

  • visualization tools

Inferential statistics

make inferences about the population based on the sample data.

  • population: e.g. all the employees of the company; all the microchips of a production line

  • sample: a set of statistical units (employees/microchips) randomly selected from the population

observational vs experimental studies

Suppose a researcher wants to study the effect of the use of blue-screens (smartphone/tablet) before sleeping on the attention span.

observational studies

the researcher observes the statistical units without intervening.

  • assign the statistical units that use the blue-screens to one group, the remaining to another group.

  • measure the attention span of the two groups after two weeks.

  • if the group using blue-screens has a lower attention span, the researcher can conclude that the use of blue-screens is correlated to the attention span.

experimental studies

the researcher manipulates the statistical units to observe the effect of the manipulation.

  • split the statistical units into two groups via random assignment: one group is asked to use a blue-screen device before sleep, whereas the other group is asked NOT to use the device.

  • measure the attention span of the two groups after two weeks.

  • if the group using blue-screens has a lower attention span, the researcher can conclude that the use of blue-screens causes a lower attention span.

observational vs experimental studies

Suppose a researcher wants to study the effect of the use of blue-screens (smartphone/tablet) before sleeping on the attention span.

observational studies

the researcher observes the statistical units without intervening.

  • assign the statistical units that use the blue-screens to one group, the remaining to another group.

  • measure the attention span of the two groups after two weeks.

  • if the group using blue-screens has a lower attention span, the researcher can conclude that the use of blue-screens is correlated to the attention span.

experimental studies

the researcher manipulates the statistical units to observe the effect of the manipulation.

  • split the statistical units into two groups via random assignment: one group is asked to use a blue-screen device before sleep, whereas the other group is asked NOT to use the device.

  • measure the attention span of the two groups after two weeks.

  • if the group using blue-screens has a lower attention span, the researcher can conclude that the use of blue-screens causes a lower attention span.

random sampling and random assignment

In the setup of a study, two key concepts are desireable: random sampling and random assignment.

random sampling

the statistical units are randomly selected from the population.

  • this is to ensure that the sample is representative of the population.

  • the sample is used to make inferences about the population.

it makes it possible to generalize the results of the study to the population

random assigment

the statistical units are randomly assigned to case/control groups.

it makes it possible to detect causality (e.g. between blue-screen use and attention span)

random assigment

  • random assignment is a key feature of experimental studies and random sampling grants the generalizability of the results to the population.

  • to have both is the ideal setup for a study, but it is not always possible. (you cannot select people at random and force them to use or not use blue-screens before sleep!)

summarise data: frequency tables

raw data

this is the data set from before: it only has 8 observations and 6 variables.

Code
toy_data |> kbl()
age height (cm) gender education commute by commute cost
25 175.2 Male High School Car 100
40 160.5 Female Bachelor's Bus 50
30 180.0 Female Master's Bike 0
50 170.3 Female PhD Car 150
22 165.8 Male PhD Bike 0
35 177.4 Female Bachelor's Car 120
29 172.0 Male Master's Bus 60
55 168.2 Female High School Walk 0

raw data

What if the data set had 250 observations and 6 variables?

Code
n=250
set.seed(123)

raw_age <- rgamma(n, 2, 10)
raw_h <- rgamma(n, 15, 30)


big_toy_data = tibble(age = round(20 + (raw_age - min(raw_age)) * (80 - 20) / (max(raw_age) - min(raw_age)),0),
                      `height (cm)`=round(150 + (raw_h - min(raw_h)) * (200 - 150) / (max(raw_h) - min(raw_h)),2),
                      education=sample(c("high school", "bachelor", "master", "phd"), size=n,prob=c(.05,.5,.3,.1), replace=TRUE),
                      `commute by`=sample(c("car", "bus", "bike", "walk"), size=n,prob=c(.2,.4,.3,.1), replace=TRUE),
                      `commute cost`=sample(seq(10,100,by=1), size=n, replace=TRUE)) |> 
  mutate(`commute cost`=ifelse(`commute by`=="walk"|`commute by`=="bike", 0, `commute cost`))


big_toy_data |> kbl() |> kable_styling() |> scroll_box(height = "400px") 
age height (cm) education commute by commute cost
28 200.00 bachelor bike 0
50 180.90 phd bike 0
21 177.69 bachelor bike 0
35 167.47 phd car 74
59 160.41 bachelor car 98
39 172.63 bachelor bus 90
23 165.45 phd bus 94
21 174.57 bachelor bus 45
50 174.42 bachelor car 26
37 163.60 high school bus 48
38 176.55 master car 68
34 181.56 bachelor car 78
28 181.34 bachelor walk 0
51 166.77 bachelor bike 0
44 176.07 master walk 0
33 166.98 master car 50
28 176.57 bachelor car 40
31 171.87 bachelor bus 44
24 167.61 bachelor bus 82
32 173.77 bachelor bus 58
21 172.52 bachelor car 47
23 161.97 bachelor bus 91
23 166.99 bachelor bus 26
30 162.69 bachelor bus 86
30 192.28 master bike 0
32 172.25 master bus 62
44 159.02 high school car 25
42 184.82 bachelor bus 24
40 163.36 high school car 100
32 174.92 master car 31
37 177.51 phd bus 17
23 175.31 master bus 74
31 179.13 bachelor bike 0
31 171.63 bachelor car 95
50 178.84 master bike 0
24 180.07 master walk 0
38 178.45 bachelor bus 99
25 174.48 master bike 0
32 176.65 bachelor bus 40
29 161.68 phd bike 0
33 175.49 master bike 0
36 173.96 bachelor bus 41
24 185.64 phd car 62
21 166.37 bachelor bus 32
24 178.61 bachelor bike 0
36 178.28 bachelor bike 0
38 171.28 high school bike 0
28 160.60 phd bike 0
68 158.49 bachelor bus 34
22 175.63 phd bus 47
37 163.99 bachelor bike 0
39 173.90 master walk 0
34 164.41 master bus 13
45 183.62 bachelor bus 90
66 155.02 bachelor bus 79
28 175.98 master bike 0
25 187.18 phd bike 0
26 163.70 master car 36
41 170.92 bachelor walk 0
33 166.00 bachelor bike 0
23 176.98 master bus 30
26 181.50 bachelor bus 81
33 196.49 bachelor car 26
38 173.77 master car 87
29 174.77 master car 33
26 173.03 bachelor walk 0
37 185.96 bachelor bike 0
48 192.81 bachelor bike 0
38 170.06 bachelor bus 97
30 158.45 master car 88
36 177.78 master bus 100
28 169.21 master bike 0
36 180.52 bachelor walk 0
27 165.46 master car 55
51 190.70 master car 36
29 173.38 bachelor bus 21
56 184.05 master bike 0
31 193.61 master walk 0
35 170.19 bachelor bike 0
60 164.15 high school bus 14
34 170.80 master bus 86
49 163.95 master bus 92
28 169.79 bachelor car 35
33 159.81 bachelor bus 29
39 172.96 phd walk 0
40 175.76 phd bike 0
31 186.26 bachelor bike 0
28 173.91 bachelor walk 0
39 179.74 bachelor bus 94
30 180.41 master car 58
40 181.83 bachelor bus 59
37 167.60 master bike 0
34 153.52 bachelor bus 82
27 171.79 bachelor bus 20
46 175.86 master car 74
44 179.58 bachelor bus 84
38 163.86 high school bike 0
37 161.21 phd bike 0
41 170.44 master bus 87
62 176.25 master walk 0
41 178.11 bachelor bike 0
26 167.37 master bus 97
34 183.26 master walk 0
25 172.25 high school car 19
48 160.07 bachelor bike 0
50 169.04 master bus 61
58 156.05 bachelor car 80
35 176.23 master walk 0
29 167.51 master car 53
49 174.82 bachelor bus 20
22 181.49 bachelor walk 0
41 174.71 high school bike 0
32 167.24 master bus 80
42 173.79 bachelor bus 12
30 163.04 master bus 68
30 163.72 high school car 55
48 171.55 bachelor car 91
33 171.76 master walk 0
40 195.55 bachelor bus 58
29 185.26 phd walk 0
26 174.08 high school car 39
47 176.70 bachelor car 79
24 167.74 bachelor bike 0
43 164.51 master walk 0
40 160.74 master bike 0
38 163.24 master car 46
31 180.64 bachelor bike 0
26 184.48 bachelor bike 0
67 176.20 master walk 0
26 167.31 bachelor bike 0
61 169.39 high school bus 64
24 168.52 phd bus 71
38 167.32 bachelor bike 0
29 181.32 bachelor bike 0
25 181.30 bachelor walk 0
30 162.45 bachelor car 41
36 179.58 bachelor bus 10
48 175.79 bachelor car 43
34 165.34 bachelor walk 0
43 165.84 master bike 0
28 173.46 bachelor bike 0
42 188.55 master bus 14
43 176.74 bachelor bike 0
53 182.09 master bus 90
28 170.24 bachelor bike 0
64 170.95 bachelor bus 36
41 174.40 bachelor bus 71
23 165.13 master bike 0
23 163.33 bachelor bike 0
69 174.39 master bus 90
52 174.31 master bus 72
30 181.82 phd car 51
22 174.38 bachelor bike 0
28 179.43 bachelor bus 73
33 167.74 bachelor bus 66
58 182.31 bachelor walk 0
33 160.46 bachelor bus 60
38 170.57 phd bike 0
29 168.73 bachelor car 82
28 175.15 bachelor bus 59
48 187.03 master bus 82
35 169.78 bachelor bike 0
80 173.01 master car 68
37 169.87 master bus 67
33 166.52 bachelor bus 95
28 177.58 bachelor bike 0
24 163.72 bachelor walk 0
42 170.48 master bike 0
29 165.83 high school bike 0
29 170.38 high school bus 27
26 151.36 bachelor car 80
42 183.88 bachelor bus 100
23 163.21 bachelor bike 0
32 177.63 bachelor bike 0
27 162.79 bachelor bus 78
28 179.88 master walk 0
45 165.20 bachelor bike 0
42 192.26 bachelor car 51
32 183.51 high school bike 0
26 172.51 master bus 37
45 197.26 bachelor bike 0
39 164.01 master car 41
23 170.83 bachelor bike 0
31 172.95 bachelor bike 0
22 181.58 bachelor bike 0
22 181.14 bachelor bike 0
21 189.41 phd bike 0
29 173.29 bachelor bike 0
65 184.92 phd bus 88
57 158.33 bachelor walk 0
32 183.52 bachelor bike 0
30 189.70 master bus 65
32 158.55 bachelor car 57
45 188.68 bachelor bike 0
30 160.78 master car 12
34 171.38 bachelor bus 32
51 174.45 bachelor car 84
71 165.06 bachelor car 70
56 173.02 bachelor car 71
32 150.00 master car 76
41 160.98 bachelor bike 0
46 174.62 bachelor walk 0
27 175.55 bachelor bus 100
46 163.62 master bus 100
42 175.72 master bus 58
22 187.40 master bike 0
30 174.34 bachelor bus 37
35 174.06 bachelor walk 0
34 175.76 bachelor car 86
38 170.26 master bus 88
33 175.34 phd bike 0
21 174.25 master bike 0
25 178.89 master car 60
30 169.17 bachelor bike 0
50 173.39 bachelor bus 86
52 183.02 master bike 0
31 168.12 bachelor bus 84
31 182.62 master car 66
33 156.96 bachelor bus 55
47 182.52 master walk 0
49 174.07 bachelor car 69
28 163.27 bachelor bike 0
36 165.45 master bus 64
42 166.39 master bike 0
37 180.31 bachelor bike 0
48 173.90 phd walk 0
41 181.29 bachelor bus 53
26 186.83 master bike 0
45 193.50 bachelor walk 0
36 172.50 master bike 0
50 176.23 master bike 0
22 168.54 bachelor bus 58
22 171.33 high school car 53
42 179.28 high school bike 0
33 173.59 bachelor bus 46
48 171.75 master bike 0
61 176.58 bachelor car 86
25 167.81 bachelor bike 0
34 175.11 bachelor bike 0
41 171.42 bachelor bus 80
30 165.24 bachelor car 25
77 172.51 bachelor bus 92
31 182.53 master bus 65
49 166.30 bachelor bus 35
42 174.82 phd walk 0
37 164.70 master bike 0
20 170.57 bachelor bus 19
38 181.97 master walk 0
25 190.11 phd bus 19
71 175.48 master car 47

it become harder to grasp the distribution of the data, or any information about it.

frequency distribution: categorical case

  • the table displays the number of times each category is observed, and the proportion of times each category is observed
Code
big_toy_data |> count(`commute by`) |> mutate(prop = n/sum(n)) |>  kbl() |> kable_styling()
commute by n prop
bike 80 0.320
bus 86 0.344
car 53 0.212
walk 31 0.124
Code
big_toy_data |> count(education) |> mutate(prop = n/sum(n)) |>   kbl() |> kable_styling()
education n prop
bachelor 133 0.532
high school 16 0.064
master 79 0.316
phd 22 0.088

frequency distribution: numerical case

counting the number of times a values is observed is not very informative for numerical variables.

  • think of height: it is unlikely that two observations have the exact same height (e.g. 180.3)

  • possible solution: divide the range of values into non-overlapping, adjacent intervals and count the observations in each interval.

Code
big_toy_data |> mutate(`height (cm)`=cut(`height (cm)`, breaks=seq(150,200,by=10),include.lowest=T)) |> count(`height (cm)`) |> mutate(prop = n/sum(n)) |>  kbl() |> kable_styling()
height (cm) n prop
[150,160] 12 0.048
(160,170] 72 0.288
(170,180] 110 0.440
(180,190] 45 0.180
(190,200] 11 0.044
Code
big_toy_data |> mutate(age=cut(age, breaks=c(seq(20,80,by=12)),include.lowest=T)) |> count(age) |> mutate(prop = n/sum(n)) |>  kbl() |> kable_styling()
age n prop
[20,32] 110 0.440
(32,44] 87 0.348
(44,56] 35 0.140
(56,68] 13 0.052
(68,80] 5 0.020

frequency distribution: numerical case

  • divide the range of values into non-overlapping, adjacent intervals and count the observations in each interval.

  • one may find it easier to refer to percentages rather than proportions.

Code
big_toy_data |> mutate(`height (cm)`=cut(`height (cm)`, breaks=seq(150,200,by=10),include.lowest=T)) |> count(`height (cm)`) |> mutate(
  prop = n/sum(n),
  perc=round(prop*100,2)) |>  kbl() |> kable_styling()
height (cm) n prop perc
[150,160] 12 0.048 4.8
(160,170] 72 0.288 28.8
(170,180] 110 0.440 44.0
(180,190] 45 0.180 18.0
(190,200] 11 0.044 4.4
Code
big_toy_data |> mutate(age=cut(age, breaks=c(seq(20,80,by=12)),include.lowest=T)) |> count(age) |> 
  mutate(
  prop = n/sum(n),
  perc=round(prop*100,2)) |>  kbl() |> kable_styling()
age n prop perc
[20,32] 110 0.440 44.0
(32,44] 87 0.348 34.8
(44,56] 35 0.140 14.0
(56,68] 13 0.052 5.2
(68,80] 5 0.020 2.0

frequency distribution: numerical case

  • proportions or percentages in the table indicate the relative frequency of each interval.

  • but what if I want to know, e.g:

    • what percentage of people is older than 32?

    • what percentage of people is less than 180 cm tall?

Code
big_toy_data |> mutate(`height (cm)`=cut(`height (cm)`, breaks=seq(150,200,by=10),include.lowest=T)) |> count(`height (cm)`) |> mutate(
  prop = n/sum(n),
  perc=round(prop*100,2),
  `F(X)`=cumsum(prop)) |>  kbl() |> kable_styling()
height (cm) n prop perc F(X)
[150,160] 12 0.048 4.8 0.048
(160,170] 72 0.288 28.8 0.336
(170,180] 110 0.440 44.0 0.776
(180,190] 45 0.180 18.0 0.956
(190,200] 11 0.044 4.4 1.000
Code
big_toy_data |> mutate(age=cut(age, breaks=c(seq(20,80,by=12)),include.lowest=T)) |> count(age) |> 
  mutate(
  prop = n/sum(n),
  perc=round(prop*100,2),
  `F(X)` =cumsum(prop)) |>  kbl() |> kable_styling()
age n prop perc F(X)
[20,32] 110 0.440 44.0 0.440
(32,44] 87 0.348 34.8 0.788
(44,56] 35 0.140 14.0 0.928
(56,68] 13 0.052 5.2 0.980
(68,80] 5 0.020 2.0 1.000

the cumulative distribution function (F(X)) returns the proportion of obseerved values that are less than or equal to x.

  • note: it adds up to 1, therefore, if one is interested in the proportion of values that are greater than x, one can simply subtract the value of the cumulative distribution function from 1.

frequency distribution: numerical case

Code
big_toy_data |> mutate(`height (cm)`=cut(`height (cm)`, breaks=seq(150,200,by=10),include.lowest=T)) |> count(`height (cm)`) |> mutate(
  prop = n/sum(n),
  perc=round(prop*100,2),
  `F(X)`=cumsum(prop)) |>  kbl() |> kable_styling()
height (cm) n prop perc F(X)
[150,160] 12 0.048 4.8 0.048
(160,170] 72 0.288 28.8 0.336
(170,180] 110 0.440 44.0 0.776
(180,190] 45 0.180 18.0 0.956
(190,200] 11 0.044 4.4 1.000
Code
big_toy_data |> mutate(age=cut(age, breaks=c(seq(20,80,by=12)),include.lowest=T)) |> count(age) |> 
  mutate(
  prop = n/sum(n),
  perc=round(prop*100,2),
  `F(X)` =cumsum(prop)) |>  kbl() |> kable_styling()
age n prop perc F(X)
[20,32] 110 0.440 44.0 0.440
(32,44] 87 0.348 34.8 0.788
(44,56] 35 0.140 14.0 0.928
(56,68] 13 0.052 5.2 0.980
(68,80] 5 0.020 2.0 1.000

The previous questions are easily replied

  • what percentage of people is less than 180 cm tall? \(F(180)=.776\)

  • what percentage of people is older than 32? \(1-F(32)=1-.44=.56\)

summarise data: visualization tools

informative plots

An effective data visualization has to be

  • appropriate: the visualization depends on the nature of the variable(s) at hand, and should be selected accordingly.

  • clear: the visualization should be clear and easy to understand, and the reader should be able to grasp the information at a glance.

  • simple: the visualization should be simple and not cluttered with unnecessary details.

  • accurate: the visualization should accurately represent the data and not mislead the reader.

non-informative plots: lie with graphics

Suppose you have two competing companies, A and B, that are in the stock market:

Code
tibble(company = c("company A","company B"),
       revenue = c(800, 1000)) |>
  ggplot(aes(x=company, y=revenue,fill=company)) +geom_col()+
  coord_cartesian(ylim = c(750, 1000)) + theme_minimal() +
  theme(legend.position = "none",
        axis.text.y = element_blank(),
        axis.ticks.x = element_blank()) 

it appears company B is five times more profitable than company A!

Code
tibble(company = c("company A","company B"),
       revenue = c(800, 1000)) |>
  ggplot(aes(x=company, y=revenue,fill=company)) +geom_col()+
  theme_minimal() +
  theme(legend.position = "none")

wait, what?

same data, different visualization

The first plot is misleading

  • it uses a non-zero baseline for the y-axis, which exaggerates the difference between the two companies.

  • no tick marks on the y-axis are reported. That is cheating!

:::

visualization for numerical variables

Code
big_toy_data |> ggplot(aes(x = `height (cm)`)) +
    geom_histogram( binwidth= 5,
                  boundary = 150,
                  fill = "indianred", color = "black",alpha=.5) +
  labs(title = "Histogram of height", x = "height (cm)", y = "Frequency") + theme_minimal() 

visualization for numerical variables

Code
big_toy_data |> mutate(`height (cm)`=cut(`height (cm)`, 
                                         breaks=seq(150,200,by=5),include.lowest=T)) |> 
  count(`height (cm)`) |> kbl() |> kable_styling()
height (cm) n
[150,155] 3
(155,160] 9
(160,165] 32
(165,170] 40
(170,175] 66
(175,180] 44
(180,185] 33
(185,190] 12
(190,195] 7
(195,200] 4
Code
big_toy_data |> ggplot(aes(x = `height (cm)`)) +
  geom_histogram( binwidth= 5,
                  boundary = 150,
                  fill = "indianred", color = "black",alpha=.5) +
  labs(title = "Histogram of height", x = "height (cm)") + theme_minimal() 

visualization for numerical variables

Code
big_toy_data |> mutate(`height (cm)`=cut(`height (cm)`, 
                                         breaks=seq(150,200,by=5),include.lowest=T)) |> 
  count(`height (cm)`) |> mutate(prop=n/sum(n)) |> select(-n) |> kbl() |> kable_styling()
height (cm) prop
[150,155] 0.012
(155,160] 0.036
(160,165] 0.128
(165,170] 0.160
(170,175] 0.264
(175,180] 0.176
(180,185] 0.132
(185,190] 0.048
(190,195] 0.028
(195,200] 0.016
Code
big_toy_data |> ggplot(aes(x = `height (cm)`)) +
  geom_histogram(aes(y = after_stat(count / sum(count))),
                  binwidth= 5,
                  boundary = 150,
                  fill = "indianred", color = "black",alpha=.5) +
  # geom_histogram( aes(y=after_stat(density)), binwidth= 5,
  #                 boundary = 150,
  #                 fill = "indianred", color = "black",alpha=.5) +
  labs(title = "Histogram of height", x = "height (cm)") + theme_minimal() 

visualization for numerical variables

Code
big_toy_data |> mutate(`height (cm)`=cut(`height (cm)`, 
                                         breaks=seq(150,200,by=5),include.lowest=T)) |> 
  count(`height (cm)`) |> mutate(prop=n/sum(n)) |> select(-n) |> kbl() |> kable_styling()
height (cm) prop
[150,155] 0.012
(155,160] 0.036
(160,165] 0.128
(165,170] 0.160
(170,175] 0.264
(175,180] 0.176
(180,185] 0.132
(185,190] 0.048
(190,195] 0.028
(195,200] 0.016
Code
big_toy_data |> ggplot(aes(x = `height (cm)`)) +
  # geom_histogram(aes(y = after_stat(count / sum(count))),
  #                 binwidth= 5,
  #                 boundary = 150,
  #                 fill = "dodgerblue", color = "black",alpha=.5) +
  geom_histogram( aes(y=after_stat(density)), binwidth= 5,
                  boundary = 150,
                  fill = "dodgerblue", color = "black",alpha=.5) +
  labs(title = "Histogram of height", x = "height (cm)") + theme_minimal() 

visualization for numerical variables

Code
custom_breaks <- c(150,170,175,195,200)
big_toy_data |> mutate(`height (cm)`=cut(`height (cm)`, 
                                         breaks=custom_breaks,include.lowest=T)) |> 
  count(`height (cm)`) |> mutate(prop=n/sum(n)) |> select(-n) |> kbl() |> kable_styling()
height (cm) prop
[150,170] 0.336
(170,175] 0.264
(175,195] 0.384
(195,200] 0.016
Code
big_toy_data |> ggplot(aes(x = `height (cm)`)) +
  geom_histogram(aes(y = after_stat(count / sum(count))),
                  breaks=custom_breaks,
                  boundary = 150,
                  fill = "indianred", color = "black",alpha=.5) +
  # geom_histogram( aes(y=after_stat(density)), binwidth= 5,
  #                 boundary = 150,
  #                 fill = "dodgerblue", color = "black",alpha=.5) +
  labs(title = "Histogram of height", x = "height (cm)") + theme_minimal() 

visualization for numerical variables

Code
big_toy_data |> mutate(`height (cm)`=cut(`height (cm)`, 
                                         breaks=custom_breaks,include.lowest=T)) |> 
  count(`height (cm)`) |> mutate(prop=n/sum(n)) |> select(-n) |> kbl() |> kable_styling()
height (cm) prop
[150,170] 0.336
(170,175] 0.264
(175,195] 0.384
(195,200] 0.016
Code
big_toy_data |> ggplot(aes(x = `height (cm)`)) +
  geom_histogram(aes(y=after_stat(density)),
                  breaks=custom_breaks,
                  boundary = 150,
                  fill = "dodgerblue", color = "black",alpha=.5) +
  # geom_histogram( aes(y=after_stat(density)), binwidth= 5,
  #                 boundary = 150,
  #                 fill = "dodgerblue", color = "black",alpha=.5) +
  labs(title = "Histogram of height", x = "height (cm)") + theme_minimal() 

visualization for numerical variables

Code
big_toy_data |> ggplot(aes(x = `height (cm)`)) +
  geom_histogram(aes(y = after_stat(count / sum(count))),
                  breaks=custom_breaks,
                  boundary = 150,
                  fill = "indianred", color = "black",alpha=.5) +
  # geom_histogram( aes(y=after_stat(density)), binwidth= 5,
  #                 boundary = 150,
  #                 fill = "dodgerblue", color = "black",alpha=.5) +
  labs(title = "Histogram of height", x = "height (cm)") + theme_minimal() 

the heights of the bars add up to 1

Code
big_toy_data |> ggplot(aes(x = `height (cm)`)) +
  geom_histogram(aes(y=after_stat(density)),
                  breaks=custom_breaks,
                  boundary = 150,
                  fill = "dodgerblue", color = "black",alpha=.5) +
  # geom_histogram( aes(y=after_stat(density)), binwidth= 5,
  #                 boundary = 150,
  #                 fill = "dodgerblue", color = "black",alpha=.5) +
  labs(title = "Histogram of height", x = "height (cm)") + theme_minimal() 

the area of the bars adds up to 1

  • more appropriate when the bins, intervals of values, are of different widths

visualization for numerical variables (integers)

While one can still consider bins and an histogram, since the variable is not continuous, a more appropriate option is the lollipop plot.

Code
big_toy_data |> 
count(age ) %>%
  ggplot(aes(x = age, y = n)) +
  geom_segment(aes(xend = age, yend = 0), color = "dodgerblue") +
  geom_point(color = "indianred", size = 3) + theme_minimal() 

visualization for categorical variables

Barplots are the most common way to visualize categorical variables.

Code
big_toy_data |> ggplot(aes(x = factor(education))) +
  geom_bar(fill = "dodgerblue") +
  labs(title = "barplot", x = "education", y = "Count")+ theme_minimal()

visualization for categorical variables

pie charts are also popular, but less easily interpretable.

Code
 data_pie <- big_toy_data |> 
  count(education) %>%
  mutate(perc = (n / sum(n)),
         label = paste0(education, " ", scales::percent(perc)))

# Plot
ggplot(data_pie, aes(x = "", y = perc, fill = education)) +
  geom_col(width = 1, color = "white") +
  coord_polar(theta = "y") +
  labs(title = "Pie Chart of eduction", fill = "education") +
  geom_text(aes(label = label), position = position_stack(vjust = 0.5)) +
  theme_void()

visualization for categorical variables

similar to piecharts, but more interpretable, is the donut chart.

Code
data_pie |> ggplot(aes(x = 2, y = perc, fill = education)) +
  geom_col(width = 1, color = "white") +
  coord_polar(theta = "y") +
  xlim(0.5, 2.5) +  # create the hole
  geom_text(aes(label = label), position = position_stack(vjust = 0.5)) +
  labs(title = "Donut Chart of education", fill = "education") +
  theme_void()

visualization for categorical variables

If the factor levels are not ordered as in education, the bars of the barplot can be re-ordered in an ascending or descending order.

Code
big_toy_data |> ggplot(aes(x = fct_rev(fct_infreq(`commute by`)))) +
  geom_bar(fill = "dodgerblue") +
  labs(title = "barplot", x = "commute_by", y = "Count")+ theme_minimal()

Code
big_toy_data |> ggplot(aes(x = fct_infreq(`commute by`))) +
  geom_bar(fill = "dodgerblue") +
  labs(title = "barplot", x = "commute_by", y = "Count")+ theme_minimal()

This is useful for categorical variables with many levels, to put in evidence the most/least frequent ones.

summarise (numerical) data: indexes

four distributions

Suppose to have the distribution of employees heights from four companies.

Code
n=1000
set.seed(123)
library("sn")



four_heigths_data = tibble(
                      `A&co.`= rsn(n = n, xi=140, omega=25,alpha=15),
                      `B&co.`= rnorm(n, 170, 25),
                      `C&co.`= rnorm(n, 170, 9),
                      `D&co.`= rnorm(n, 190, 9)
)

four_heigths_data |> 
  pivot_longer(cols = everything(), names_to = "company", values_to = "height") |> 
  ggplot(aes(x = height, fill = company)) +
  geom_density(alpha = 0.5) +
  geom_histogram( aes(y=after_stat(density)), binwidth= 5,colour="black",alpha=.5)+
  scale_fill_manual(values = c("indianred","dodgerblue",  "darkorange", "darkgreen")) +
  labs(title = "Density of heights", x = "height (cm)", y = "Density") + theme_minimal() + facet_grid(company~.)

you can tell they are different from one another, but how?

company C&co. and D&co.

the two distributions look similar, but for their position

Code
n=1000
set.seed(123)
library("sn")

four_heigths_data |> 
  pivot_longer(cols = everything(), names_to = "company", values_to = "height") |> 
  filter(company %in% c("C&co.", "D&co.")) |>
  ggplot(aes(x = height, fill = company)) +
  geom_density(alpha = 0.5) +
  geom_histogram( aes(y=after_stat(density)), binwidth= 5,colour="black",alpha=.5)+
  scale_fill_manual(values = c("darkorange", "darkgreen")) +
  labs(title = "Density of heights", x = "height (cm)", y = "Density") + theme_minimal() + facet_grid(company~.)

company B&co. and C&co.

the two distributions seem to have

  • a similar position (centered around 170 cm)

  • different spread (the first one is more spread out than the second one)

Code
n=1000
set.seed(123)
library("sn")

four_heigths_data |> 
  pivot_longer(cols = everything(), names_to = "company", values_to = "height") |> 
  filter(company %in% c("B&co.", "C&co.")) |>
  ggplot(aes(x = height, fill = company)) +
  geom_density(alpha = 0.5) +
  geom_histogram( aes(y=after_stat(density)), binwidth= 5,colour="black",alpha=.5)+
  scale_fill_manual(values = c("dodgerblue","darkorange")) +
  labs(title = "Density of heights", x = "height (cm)", y = "Density") + theme_minimal() + facet_grid(company~.)

company A&co. and D&co.

the two distributions seem to have

  • different position

  • different spread

  • different shape (the first one is more skewed to the right than the second one)

Code
n=1000
set.seed(123)
library("sn")

four_heigths_data |> 
  pivot_longer(cols = everything(), names_to = "company", values_to = "height") |> 
  filter(company %in% c("A&co.", "D&co.")) |>
  ggplot(aes(x = height, fill = company)) +
  geom_density(alpha = 0.5) +
  geom_histogram( aes(y=after_stat(density)), binwidth= 5,colour="black",alpha=.5)+
  scale_fill_manual(values = c("indianred", "darkgreen")) +
  labs(title = "Density of heights", x = "height (cm)", y = "Density") + theme_minimal() + facet_grid(company~.)

Measuring position, spread and shape

position

  • mean

  • median and quartiles

spread

  • variance (standard deviation)

  • interquartile range (IQR)

shape

  • skewness

Measuring position: the mean

In case of raw data, the mean is computed as:

\[ \bar{x} = \frac{x_{1}+x_{1}+ \ldots + x_{n}}{n}= \frac{1}{n} \sum_{i=1}^{n} x_i \]

where \(x_i\) is the \(i\)-th observation and \(n\) is the number of observations.

In case of absolute frequencies of the values (counts), we can compute the mean as:

\[ \bar{x}=\frac{\sum_{j=1}^{k} x_{j}n_{j}}{\sum_{j=1}^{k} n_{j}}=\frac{1}{n}\sum_{j=1}^{k} x_{j}n_{j} \] where \(x_j\) is the \(j\)-th value, \(n_j\) is the number of observations with value \(x_j\), and \(k\) is the number of different values.

In case of relative frequencies of the values (proportions), we can compute the mean as:

\[ \bar{x}=\sum_{j=1}^{k} x_{j}\frac{n_{j}}{n}=\sum_{j=1}^{k} x_{j}f_{j} \] where \(f_j\) is the relative frequency of the value \(x_j\).

Measuring position: the mean

Code
four_heigths_summary = four_heigths_data |> 
  pivot_longer(cols = everything(), names_to = "company", values_to = "height") |>
  group_by(company) |>
  summarise(
    mean = mean(height),
    median = median(height),
    q1 = quantile(height, 0.25),
    q3 = quantile(height, 0.75),
    var = var(height),
    sd = sd(height),
    IQR = IQR(height),
    skewness = (mean(height)-median(height))/sd(height)
  ) 
Code
four_heigths_data |> 
  pivot_longer(cols = everything(), names_to = "company", values_to = "height") |> 
  ggplot(aes(x = height, fill = company)) +
  geom_density(alpha = 0.5) +
  geom_histogram( aes(y=after_stat(density)), binwidth= 5,colour="black",alpha=.5)+
  scale_fill_manual(values = c("indianred","dodgerblue",  "darkorange", "darkgreen")) +
  labs(title = "Density of heights", x = "height (cm)", y = "Density") + theme_minimal() + facet_grid(company~.)

Code
four_heigths_summary |> 
  select(company, mean) |> #, median, q1, q3, var, sd, IQR, skewness) |>
  kbl() |> 
  row_spec(1, color = "indianred") |>
  row_spec(2, color = "dodgerblue") |>
  row_spec(3, color = "darkorange") |>
  row_spec(4, color = "darkgreen") |>
  kable_styling(full_width = F) 
company mean
A&co. 159.3382
B&co. 169.4972
C&co. 169.9176
D&co. 189.7105

Measuring position: the median (a.k.a. second quartile Q2)

The median is the value that separates the higher half from the lower half of a data sample: higher/lower halves refer to the sorted values: \(50%\) of the values are below the median and \(50%\) of the values are above the median.

Code
toy_data |> select(age) |> mutate(`sorted age`=sort(age)) |> kbl()
age sorted age
25 22
40 25
30 29
50 30
22 35
35 40
29 50
55 55

to identify the middle position of the sorted values distribution, one can use the following formula:

\((\frac{n}{2}, \frac{n}{2}+1) \text{ if } n \text{ is even, and } (\frac{n+1}{2}) \text{ if } n \text{ is odd}\).

  • in the example, \(n=8\) is even, so the median is the average of the 4-th and 5-th values in the sorted list: \(\frac{30+35}{2}=32.5\).

Measuring position: the median

Code
four_heigths_data |> 
  pivot_longer(cols = everything(), names_to = "company", values_to = "height") |> 
  ggplot(aes(x = height, fill = company)) +
  geom_density(alpha = 0.5) +
  geom_histogram( aes(y=after_stat(density)), binwidth= 5,colour="black",alpha=.5)+
  scale_fill_manual(values = c("indianred","dodgerblue",  "darkorange", "darkgreen")) +
  labs(title = "Density of heights", x = "height (cm)", y = "Density") + theme_minimal() + facet_grid(company~.)

Code
four_heigths_summary |> 
  select(company, median) |> #, median, q1, q3, var, sd, IQR, skewness) |>
  kbl() |> 
  row_spec(1, color = "indianred") |>
  row_spec(2, color = "dodgerblue") |>
  row_spec(3, color = "darkorange") |>
  row_spec(4, color = "darkgreen") |>
  kable_styling(full_width = F) 
company median
A&co. 156.1525
B&co. 168.7356
C&co. 169.9263
D&co. 189.7048

Measuring spread: the variance

The variance is a measure of how far a set of numbers are spread out from their average value.

The variance is defined as:

\[ s^2 = \frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})^2 \] it is essentially the same formula as the mean, just unplug the values \(x_{i}\) and replace them with the squared deviations from the mean \((x_{i} - \bar{x})^{2}\).

for counts, the formula is

\[ s^2 = \frac{1}{n} \sum_{j=1}^{k} n_{j}(x_{j} - \bar{x})^2 \]

for proportions, the formula is \[ s^2 = \sum_{j=1}^{k} f_{j}(x_{j} - \bar{x})^2 \]

just like for the mean!

Measuring spread: the variance

Code
four_heigths_data |> 
  pivot_longer(cols = everything(), names_to = "company", values_to = "height") |> 
  ggplot(aes(x = height, fill = company)) +
  geom_density(alpha = 0.5) +
  geom_histogram( aes(y=after_stat(density)), binwidth= 5,colour="black",alpha=.5)+
  scale_fill_manual(values = c("indianred","dodgerblue",  "darkorange", "darkgreen")) +
  labs(title = "Density of heights", x = "height (cm)", y = "Density") + theme_minimal() + facet_grid(company~.)

Code
four_heigths_summary |> 
  select(company, var) |> #, median, q1, q3, var, sd, IQR, skewness) |>
  kbl() |> 
  row_spec(1, color = "indianred") |>
  row_spec(2, color = "dodgerblue") |>
  row_spec(3, color = "darkorange") |>
  row_spec(4, color = "darkgreen") |>
  kable_styling(full_width = F) 
company var
A&co. 224.13807
B&co. 598.23959
C&co. 79.76571
D&co. 81.11201

Measuring spread: the standard deviation

Code
four_heigths_data |> 
  pivot_longer(cols = everything(), names_to = "company", values_to = "height") |> 
  ggplot(aes(x = height, fill = company)) +
  geom_density(alpha = 0.5) +
  geom_histogram( aes(y=after_stat(density)), binwidth= 5,colour="black",alpha=.5)+
  scale_fill_manual(values = c("indianred","dodgerblue",  "darkorange", "darkgreen")) +
  labs(title = "Density of heights", x = "height (cm)", y = "Density") + theme_minimal() + facet_grid(company~.)

Code
four_heigths_summary |> 
  select(company, sd) |> #, median, q1, q3, var, sd, IQR, skewness) |>
  kbl() |> 
  row_spec(1, color = "indianred") |>
  row_spec(2, color = "dodgerblue") |>
  row_spec(3, color = "darkorange") |>
  row_spec(4, color = "darkgreen") |>
  kable_styling(full_width = F) 
company sd
A&co. 14.971241
B&co. 24.458937
C&co. 8.931165
D&co. 9.006221