library(tidyverse)
library(tidymodels)
library(janitor)
A basic analysis
= read_csv("toy_data_phd.csv") |> select(-1) |> clean_names() toy_data
|> slice_head(n=10)
toy_data ## # A tibble: 10 × 10
## month_of_birth day_of_birth year_of_birth country_of_origin
## <chr> <chr> <chr> <chr>
## 1 July 18 1977 Italy
## 2 January 28 1999 Italy
## 3 August 1 1998 Italia
## 4 May 19 1997 Italy
## 5 April 10 1998 Argentina
## 6 October 17 1994 Italy
## 7 September 3 1998 Italy
## 8 May 23 1989 Italy
## 9 December 10 1996 TURKEY
## 10 February 08 1995 Ethiopia
## # ℹ 6 more variables:
## # field_of_your_previous_degree_e_g_engineering_psychology_economics <chr>,
## # name_of_your_phd_program <chr>, how_do_you_commute_to_campus <chr>,
## # height <chr>, foot_size <chr>,
## # how_often_do_you_practice_sport_workout <chr>
pre-processing
here’s what’s to be done
there are issues with the imported data, because of the way the data was saved.
the day of birth is not in the correct format: it is a character because one entry was
26/2
intead of just 26.the year of birth is not in the correct format: it is a character because one entry was
march
.Country of origin: one entry is Cassino, and one is Turkey in capital letters.
previous degree column: some levels should be merged (the ones that contain
engineering
).Phd program name: I removed it.
footsize, some indicated the size as 1 foot, that is 12 inches, that in EU format is 45.
= toy_data |> mutate(
toy_data_clean day_of_birth = str_replace(day_of_birth, "26/2", "26") |> parse_number(),
year_of_birth = str_replace(year_of_birth, "March",replacement = " ")|> parse_number(),
country_of_origin = str_to_title(str_replace(country_of_origin, "Cassino", "Italy")),
height= str_replace(height, "cm", "") |>str_remove(",") |> parse_number(),
foot_size = str_replace(foot_size, "1 foot", "45") |> parse_number(),
foot_size = ifelse(foot_size>100, foot_size/10,foot_size),
foot_size = ifelse(foot_size==1, 45,foot_size),
height= ifelse(height<2, height*100,height),
|> select(-name_of_your_phd_program) )
library("skimr")
|> skim() toy_data_clean
Name | toy_data_clean |
Number of rows | 39 |
Number of columns | 9 |
_______________________ | |
Column type frequency: | |
character | 5 |
numeric | 4 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
month_of_birth | 0 | 1 | 3 | 9 | 0 | 12 | 0 |
country_of_origin | 0 | 1 | 4 | 9 | 0 | 9 | 0 |
field_of_your_previous_degree_e_g_engineering_psychology_economics | 0 | 1 | 3 | 26 | 0 | 16 | 0 |
how_do_you_commute_to_campus | 0 | 1 | 4 | 16 | 0 | 5 | 0 |
how_often_do_you_practice_sport_workout | 0 | 1 | 5 | 26 | 0 | 6 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
day_of_birth | 0 | 1.00 | 15.38 | 9.10 | 1.0 | 9.0 | 14.0 | 22.5 | 31 | ▅▇▃▅▅ |
year_of_birth | 1 | 0.97 | 1993.34 | 6.67 | 1971.0 | 1992.0 | 1995.5 | 1998.0 | 2001 | ▁▁▂▅▇ |
height | 0 | 1.00 | 161.46 | 37.96 | 5.9 | 165.0 | 173.0 | 179.0 | 187 | ▁▁▁▁▇ |
foot_size | 0 | 1.00 | 41.32 | 2.52 | 36.0 | 39.5 | 42.0 | 43.0 | 46 | ▃▆▇▆▂ |
Note: from feet to cm, 1 foot = 30.48 cm
= toy_data_clean |>
toy_data_clean mutate(height=ifelse(height<10, height*30.48,height))
birthday problem
The “birthday paradox” is a surprising probabilistic phenomenon that states there’s a surprisingly high chance (over 50%) that two people in a group of 23 or more will share the same birthday.
- Can you check there is a pair of people with the same birthday in the dataset?
|> count(month_of_birth,day_of_birth) |> arrange(desc(n))
toy_data_clean ## # A tibble: 36 × 3
## month_of_birth day_of_birth n
## <chr> <dbl> <int>
## 1 December 10 2
## 2 January 28 2
## 3 July 11 2
## 4 April 2 1
## 5 April 10 1
## 6 April 24 1
## 7 April 30 1
## 8 August 1 1
## 9 August 3 1
## 10 August 5 1
## # ℹ 26 more rows
visualize the data
- Create a scatterplot of height vs foot size.
- compare the distributions of the considered quantitative variables using boxplots.
recall that a box plot is based on the 5-number summary: minimum, first quartile (Q1), median (Q2), third quartile (Q3), and maximum. The interquartile range (IQR) is the difference between Q3 and Q1. The whiskers extend to the smallest and largest values within 1.5 times the IQR from the quartiles. Any data points outside this range are considered outliers. (hint: of course geom_boxplot
exists).
### your code goes here
- Create a barplot of the variable associated with the question
how_do_you_commute_to_campus
, the bars being grouped according to the other variablehow_often_do_you_practice_sport_workout
### your code goes here
Statistical analysis
- Measure the asymmetry of the variables distribution where applicable. Recall that the asymmetry can be computed as \[\frac{\bar{X}-median(X)}{\sigma}\]
### your code goes here
- measure the association between the two categorical variables above. Can we say they are not independent?
### your code goes here
Pretend that the dataset at hand is a sample of a larger population.
- Compute the confidence interval for the mean of the variable
height
### your code goes here
verify that the mean footsize is, at a population level, greater than 179 cm.
what is the null hypothesis?
what is the alternative hypothesis?
what is value for the test statistic?
what is the p-value?
what is the conclusion of the test?
### your code goes here