“Statistics is the science of learning from data.” — David Hand
“Statistics is the science which deals with the collection, classification, and tabulation of numerical facts as the basis for explanation, description, and comparison of phenomena.” — Lovis R. Connor
“Statistics is the branch of scientific method which deals with the data obtained by counting or measuring the properties of populations of natural phenomena.” — Maurice G. Kendall
“Statistics is the study of the collection, analysis, interpretation, presentation, and organization of data.” — Sir Ronald A. Fisher (paraphrased)
“Statistics is the grammar of science.” — Karl Pearson
What is Statstics: data vs information
Just because one has data, doesn’t mean one has information.
Data: raw facts, numbers, measurements.
Information: Data that has been collected, processed, organized, and structured in a way that provides meaningful insights and understanding about a studied phenomenon.
summarize and describe the main features of a dataset.
frequency tables
synthetic indexes
visualization tools
Inferential statistics
make inferences about the population based on the sample data.
population: e.g. all the employees of the company; all the microchips of a production line
sample: a set of statistical units (employees/microchips) randomly selected from the population
observational vs experimental studies
Suppose a researcher wants to study the effect of the use of blue-screens (smartphone/tablet) before sleeping on the attention span.
observational studies
the researcher observes the statistical units without intervening.
assign the statistical units that use the blue-screens to one group, the remaining to another group.
measure the attention span of the two groups after two weeks.
if the group using blue-screens has a lower attention span, the researcher can conclude that the use of blue-screens is correlated to the attention span.
experimental studies
the researcher manipulates the statistical units to observe the effect of the manipulation.
split the statistical units into two groups via random assignment: one group is asked to use a blue-screen device before sleep, whereas the other group is asked NOT to use the device.
measure the attention span of the two groups after two weeks.
if the group using blue-screens has a lower attention span, the researcher can conclude that the use of blue-screens causes a lower attention span.
observational vs experimental studies
Suppose a researcher wants to study the effect of the use of blue-screens (smartphone/tablet) before sleeping on the attention span.
observational studies
the researcher observes the statistical units without intervening.
assign the statistical units that use the blue-screens to one group, the remaining to another group.
measure the attention span of the two groups after two weeks.
if the group using blue-screens has a lower attention span, the researcher can conclude that the use of blue-screens is correlated to the attention span.
experimental studies
the researcher manipulates the statistical units to observe the effect of the manipulation.
split the statistical units into two groups via random assignment: one group is asked to use a blue-screen device before sleep, whereas the other group is asked NOT to use the device.
measure the attention span of the two groups after two weeks.
if the group using blue-screens has a lower attention span, the researcher can conclude that the use of blue-screens causes a lower attention span.
random sampling and random assignment
In the setup of a study, two key concepts are desireable: random sampling and random assignment.
random sampling
the statistical units are randomly selected from the population.
this is to ensure that the sample is representative of the population.
the sample is used to make inferences about the population.
it makes it possible to generalize the results of the study to the population
random assigment
the statistical units are randomly assigned to case/control groups.
it makes it possible to detect causality (e.g. between blue-screen use and attention span)
random assigment
random assignment is a key feature of experimental studies and random sampling grants the generalizability of the results to the population.
to have both is the ideal setup for a study, but it is not always possible. (you cannot select people at random and force them to use or not use blue-screens before sleep!)
summarise data: frequency tables
raw data
this is the data set from before: it only has 8 observations and 6 variables.
Code
toy_data |>kbl()
age
height (cm)
gender
education
commute by
commute cost
25
175.2
Male
High School
Car
100
40
160.5
Female
Bachelor's
Bus
50
30
180.0
Female
Master's
Bike
0
50
170.3
Female
PhD
Car
150
22
165.8
Male
PhD
Bike
0
35
177.4
Female
Bachelor's
Car
120
29
172.0
Male
Master's
Bus
60
55
168.2
Female
High School
Walk
0
raw data
What if the data set had 250 observations and 6 variables?
the cumulative distribution function (F(X)) returns the proportion of obseerved values that are less than or equal to x.
note: it adds up to 1, therefore, if one is interested in the proportion of values that are greater than x, one can simply subtract the value of the cumulative distribution function from 1.
it uses a non-zero baseline for the y-axis, which exaggerates the difference between the two companies.
no tick marks on the y-axis are reported. That is cheating!
:::
visualization for numerical variables
Code
big_toy_data |>ggplot(aes(x =`height (cm)`)) +geom_histogram( binwidth=5,boundary =150,fill ="indianred", color ="black",alpha=.5) +labs(title ="Histogram of height", x ="height (cm)", y ="Frequency") +theme_minimal()
Barplots are the most common way to visualize categorical variables.
Code
big_toy_data |>ggplot(aes(x =factor(education))) +geom_bar(fill ="dodgerblue") +labs(title ="barplot", x ="education", y ="Count")+theme_minimal()
visualization for categorical variables
pie charts are also popular, but less easily interpretable.
Code
data_pie <- big_toy_data |>count(education) %>%mutate(perc = (n /sum(n)),label =paste0(education, " ", scales::percent(perc)))# Plotggplot(data_pie, aes(x ="", y = perc, fill = education)) +geom_col(width =1, color ="white") +coord_polar(theta ="y") +labs(title ="Pie Chart of eduction", fill ="education") +geom_text(aes(label = label), position =position_stack(vjust =0.5)) +theme_void()
visualization for categorical variables
similar to piecharts, but more interpretable, is the donut chart.
Code
data_pie |>ggplot(aes(x =2, y = perc, fill = education)) +geom_col(width =1, color ="white") +coord_polar(theta ="y") +xlim(0.5, 2.5) +# create the holegeom_text(aes(label = label), position =position_stack(vjust =0.5)) +labs(title ="Donut Chart of education", fill ="education") +theme_void()
visualization for categorical variables
If the factor levels are not ordered as in education, the bars of the barplot can be re-ordered in an ascending or descending order.
Code
big_toy_data |>ggplot(aes(x =fct_rev(fct_infreq(`commute by`)))) +geom_bar(fill ="dodgerblue") +labs(title ="barplot", x ="commute_by", y ="Count")+theme_minimal()
Code
big_toy_data |>ggplot(aes(x =fct_infreq(`commute by`))) +geom_bar(fill ="dodgerblue") +labs(title ="barplot", x ="commute_by", y ="Count")+theme_minimal()
This is useful for categorical variables with many levels, to put in evidence the most/least frequent ones.
summarise (numerical) data: indexes
four distributions
Suppose to have the distribution of employees heights from four companies.
Code
n=1000set.seed(123)library("sn")four_heigths_data =tibble(`A&co.`=rsn(n = n, xi=140, omega=25,alpha=15),`B&co.`=rnorm(n, 170, 25),`C&co.`=rnorm(n, 170, 9),`D&co.`=rnorm(n, 190, 9))four_heigths_data |>pivot_longer(cols =everything(), names_to ="company", values_to ="height") |>ggplot(aes(x = height, fill = company)) +geom_density(alpha =0.5) +geom_histogram( aes(y=after_stat(density)), binwidth=5,colour="black",alpha=.5)+scale_fill_manual(values =c("indianred","dodgerblue", "darkorange", "darkgreen")) +labs(title ="Density of heights", x ="height (cm)", y ="Density") +theme_minimal() +facet_grid(company~.)
you can tell they are different from one another, but how?
company C&co. and D&co.
the two distributions look similar, but for their position
Code
n=1000set.seed(123)library("sn")four_heigths_data |>pivot_longer(cols =everything(), names_to ="company", values_to ="height") |>filter(company %in%c("C&co.", "D&co.")) |>ggplot(aes(x = height, fill = company)) +geom_density(alpha =0.5) +geom_histogram( aes(y=after_stat(density)), binwidth=5,colour="black",alpha=.5)+scale_fill_manual(values =c("darkorange", "darkgreen")) +labs(title ="Density of heights", x ="height (cm)", y ="Density") +theme_minimal() +facet_grid(company~.)
company B&co. and C&co.
the two distributions seem to have
a similar position (centered around 170 cm)
different spread (the first one is more spread out than the second one)
Code
n=1000set.seed(123)library("sn")four_heigths_data |>pivot_longer(cols =everything(), names_to ="company", values_to ="height") |>filter(company %in%c("B&co.", "C&co.")) |>ggplot(aes(x = height, fill = company)) +geom_density(alpha =0.5) +geom_histogram( aes(y=after_stat(density)), binwidth=5,colour="black",alpha=.5)+scale_fill_manual(values =c("dodgerblue","darkorange")) +labs(title ="Density of heights", x ="height (cm)", y ="Density") +theme_minimal() +facet_grid(company~.)
company A&co. and D&co.
the two distributions seem to have
different position
different spread
different shape (the first one is more skewed to the right than the second one)
Code
n=1000set.seed(123)library("sn")four_heigths_data |>pivot_longer(cols =everything(), names_to ="company", values_to ="height") |>filter(company %in%c("A&co.", "D&co.")) |>ggplot(aes(x = height, fill = company)) +geom_density(alpha =0.5) +geom_histogram( aes(y=after_stat(density)), binwidth=5,colour="black",alpha=.5)+scale_fill_manual(values =c("indianred", "darkgreen")) +labs(title ="Density of heights", x ="height (cm)", y ="Density") +theme_minimal() +facet_grid(company~.)
where \(x_i\) is the \(i\)-th observation and \(n\) is the number of observations.
In case of absolute frequencies of the values (counts), we can compute the mean as:
\[
\bar{x}=\frac{\sum_{j=1}^{k} x_{j}n_{j}}{\sum_{j=1}^{k} n_{j}}=\frac{1}{n}\sum_{j=1}^{k} x_{j}n_{j}
\] where \(x_j\) is the \(j\)-th value, \(n_j\) is the number of observations with value \(x_j\), and \(k\) is the number of different values.
In case of relative frequencies of the values (proportions), we can compute the mean as:
\[
\bar{x}=\sum_{j=1}^{k} x_{j}\frac{n_{j}}{n}=\sum_{j=1}^{k} x_{j}f_{j}
\] where \(f_j\) is the relative frequency of the value \(x_j\).
four_heigths_data |>pivot_longer(cols =everything(), names_to ="company", values_to ="height") |>ggplot(aes(x = height, fill = company)) +geom_density(alpha =0.5) +geom_histogram( aes(y=after_stat(density)), binwidth=5,colour="black",alpha=.5)+scale_fill_manual(values =c("indianred","dodgerblue", "darkorange", "darkgreen")) +labs(title ="Density of heights", x ="height (cm)", y ="Density") +theme_minimal() +facet_grid(company~.)
Code
four_heigths_summary |>select(company, mean) |>#, median, q1, q3, var, sd, IQR, skewness) |>kbl() |>row_spec(1, color ="indianred") |>row_spec(2, color ="dodgerblue") |>row_spec(3, color ="darkorange") |>row_spec(4, color ="darkgreen") |>kable_styling(full_width = F)
company
mean
A&co.
159.3382
B&co.
169.4972
C&co.
169.9176
D&co.
189.7105
Measuring position: the median (a.k.a. second quartile Q2)
The median is the value that separates the higher half from the lower half of a data sample: higher/lower halves refer to the sorted values: \(50%\) of the values are below the median and \(50%\) of the values are above the median.
to identify the middle position of the sorted values distribution, one can use the following formula:
\((\frac{n}{2}, \frac{n}{2}+1) \text{ if } n \text{ is even, and }
(\frac{n+1}{2}) \text{ if } n \text{ is odd}\).
in the example, \(n=8\) is even, so the median is the average of the 4-th and 5-th values in the sorted list: \(\frac{30+35}{2}=32.5\).
Measuring position: the median
Code
four_heigths_data |>pivot_longer(cols =everything(), names_to ="company", values_to ="height") |>ggplot(aes(x = height, fill = company)) +geom_density(alpha =0.5) +geom_histogram( aes(y=after_stat(density)), binwidth=5,colour="black",alpha=.5)+scale_fill_manual(values =c("indianred","dodgerblue", "darkorange", "darkgreen")) +labs(title ="Density of heights", x ="height (cm)", y ="Density") +theme_minimal() +facet_grid(company~.)
Code
four_heigths_summary |>select(company, median) |>#, median, q1, q3, var, sd, IQR, skewness) |>kbl() |>row_spec(1, color ="indianred") |>row_spec(2, color ="dodgerblue") |>row_spec(3, color ="darkorange") |>row_spec(4, color ="darkgreen") |>kable_styling(full_width = F)
company
median
A&co.
156.1525
B&co.
168.7356
C&co.
169.9263
D&co.
189.7048
Measuring spread: the variance
The variance is a measure of how far a set of numbers are spread out from their average value.
The variance is defined as:
\[
s^2 = \frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})^2
\] it is essentially the same formula as the mean, just unplug the values \(x_{i}\) and replace them with the squared deviations from the mean \((x_{i} - \bar{x})^{2}\).