A basic analysis

library(tidyverse)
library(tidymodels)
library(janitor)

toy_data = read_csv("toy_data_phd.csv") |> select(-1) |> clean_names()

toy_data |> slice_head(n=10)
## # A tibble: 10 × 10
##    month_of_birth day_of_birth year_of_birth country_of_origin
##    <chr>          <chr>        <chr>         <chr>            
##  1 July           18           1977          Italy            
##  2 January        28           1999          Italy            
##  3 August         1            1998          Italia           
##  4 May            19           1997          Italy            
##  5 April          10           1998          Argentina        
##  6 October        17           1994          Italy            
##  7 September      3            1998          Italy            
##  8 May            23           1989          Italy            
##  9 December       10           1996          TURKEY           
## 10 February       08           1995          Ethiopia         
## # ℹ 6 more variables:
## #   field_of_your_previous_degree_e_g_engineering_psychology_economics <chr>,
## #   name_of_your_phd_program <chr>, how_do_you_commute_to_campus <chr>,
## #   height <chr>, foot_size <chr>,
## #   how_often_do_you_practice_sport_workout <chr>

pre-processing

here’s what’s to be done

there are issues with the imported data, because of the way the data was saved.

the day of birth is not in the correct format: it is a character because one entry was 26/2 intead of just 26.
the year of birth is not in the correct format: it is a character because one entry was march.
Country of origin: one entry is Cassino, and one is Turkey in capital letters.
previous degree column: some levels should be merged (the ones that contain engineering).
Phd program name: I removed it.
footsize, some indicated the size as 1 foot, that is 12 inches, that in EU format is 45.

toy_data_clean = toy_data |> mutate(
  day_of_birth = str_replace(day_of_birth, "26/2", "26") |> parse_number(),
  year_of_birth = str_replace(year_of_birth, "March",replacement = " ")|> parse_number(),
  country_of_origin = str_to_title(str_replace(country_of_origin, "Cassino", "Italy")),
  height= str_replace(height, "cm", "") |>str_remove(",") |>  parse_number(),
  foot_size = str_replace(foot_size, "1 foot", "45")  |> parse_number(),
  foot_size = ifelse(foot_size>100, foot_size/10,foot_size),
  foot_size = ifelse(foot_size==1, 45,foot_size),
  height= ifelse(height<2, height*100,height),
  ) |> select(-name_of_your_phd_program)

library("skimr")
toy_data_clean |> skim()

Data summary
Name	toy_data_clean
Number of rows	39
Number of columns	9
_______________________
Column type frequency:
character	5
numeric	4
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
month_of_birth	1	3	9	12
country_of_origin	1	4	9	9
field_of_your_previous_degree_e_g_engineering_psychology_economics	1	3	26	16
how_do_you_commute_to_campus	1	4	16	5
how_often_do_you_practice_sport_workout	1	5	26	6

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
day_of_birth	0	1.00	15.38	9.10	1.0	9.0	14.0	22.5	31	▅▇▃▅▅
year_of_birth	1	0.97	1993.34	6.67	1971.0	1992.0	1995.5	1998.0	2001	▁▁▂▅▇
height	0	1.00	161.46	37.96	5.9	165.0	173.0	179.0	187	▁▁▁▁▇
foot_size	0	1.00	41.32	2.52	36.0	39.5	42.0	43.0	46	▃▆▇▆▂

Note: from feet to cm, 1 foot = 30.48 cm

toy_data_clean = toy_data_clean |> 
  mutate(height=ifelse(height<10, height*30.48,height))

birthday problem

The “birthday paradox” is a surprising probabilistic phenomenon that states there’s a surprisingly high chance (over 50%) that two people in a group of 23 or more will share the same birthday.

Can you check there is a pair of people with the same birthday in the dataset?

toy_data_clean |> count(month_of_birth,day_of_birth) |> arrange(desc(n))
## # A tibble: 36 × 3
##    month_of_birth day_of_birth     n
##    <chr>                 <dbl> <int>
##  1 December                 10     2
##  2 January                  28     2
##  3 July                     11     2
##  4 April                     2     1
##  5 April                    10     1
##  6 April                    24     1
##  7 April                    30     1
##  8 August                    1     1
##  9 August                    3     1
## 10 August                    5     1
## # ℹ 26 more rows

visualize the data

Create a scatterplot of height vs foot size.

compare the distributions of the considered quantitative variables using boxplots.

boxplot?

recall that a box plot is based on the 5-number summary: minimum, first quartile (Q1), median (Q2), third quartile (Q3), and maximum. The interquartile range (IQR) is the difference between Q3 and Q1. The whiskers extend to the smallest and largest values within 1.5 times the IQR from the quartiles. Any data points outside this range are considered outliers. (hint: of course geom_boxplot exists).

### your code goes here

Create a barplot of the variable associated with the question how_do_you_commute_to_campus, the bars being grouped according to the other variable how_often_do_you_practice_sport_workout

### your code goes here

Statistical analysis

Measure the asymmetry of the variables distribution where applicable. Recall that the asymmetry can be computed as \[\frac{\bar{X}-median(X)}{\sigma}\]

### your code goes here

measure the association between the two categorical variables above. Can we say they are not independent?

### your code goes here

Pretend that the dataset at hand is a sample of a larger population.

Compute the confidence interval for the mean of the variable height

### your code goes here

verify that the mean footsize is, at a population level, greater than 179 cm.
what is the null hypothesis?
what is the alternative hypothesis?
what is value for the test statistic?
what is the p-value?
what is the conclusion of the test?

### your code goes here