what is tidyverse?

click the figure for all things tidyverse!

click the figure for all things tidyverse!

This is!

A tibble (sort of a data frame)

Code
library("tidyverse")
library("gapminder")
library("kableExtra")
gapminder |> slice_sample(n=5) |>  kable(format="html") |> kable_styling(font_size=10)
country continent year lifeExp pop gdpPercap
Hungary Europe 1967 69.500 10223422 9326.6447
Nepal Asia 1997 59.426 23001113 1010.8921
Sao Tome and Principe Africa 1987 61.728 110812 1516.5255
Nepal Asia 1982 49.594 15796314 718.3731
Israel Asia 1997 78.269 5531387 20896.6092
  • the symbol |> or |> is the so-called pipe operator: it inputs what’s on its left to what’s on its right

  • One would obtain the same result by typing slice_sample(.data=gapminder,n=5)

data manipulation with dplyr

dplyr verbs

In a tibble, observations are on rows and variables are on columns

by row

  • filter : retrieve the observations that meet specified conditions
  • slice : retrieve the observations by position (slice_sample is a variation)
  • arrange : sorts the observations according to one or more variables

by column

  • select : select the variables by name
  • mutate : transform existing variables or create new ones
  • summarize : create descriptive stats

filter

The filter verb just retrieves observations that meet one or more conditions.

Say we want the data for European countries after year 2000.

Code
gapminder |> 
  filter(continent=="Europe",year>=2000) |> slice(1:6) |>
  kable(format="html") |> kable_styling(font_size=10)
country continent year lifeExp pop gdpPercap
Albania Europe 2002 75.651 3508512 4604.212
Albania Europe 2007 76.423 3600523 5937.030
Austria Europe 2002 78.980 8148312 32417.608
Austria Europe 2007 79.829 8199783 36126.493
Belgium Europe 2002 78.320 10311970 30485.884
Belgium Europe 2007 79.441 10392226 33692.605

filter

filter(condition1,condition2) returns the observations that meet condition1 AND condition2

  • For more complex conditions, one can use the logical operators

Say we want the data for countries before year 2000 OR with a life expectancy higher than 70.

Code
gapminder |> 
  filter((year>=2000)|(lifeExp>70)) |>  slice(1:6) |> 
  kable(format="html") |> kable_styling(font_size=10)
country continent year lifeExp pop gdpPercap
Afghanistan Asia 2002 42.129 25268405 726.7341
Afghanistan Asia 2007 43.828 31889923 974.5803
Albania Europe 1982 70.420 2780097 3630.8807
Albania Europe 1987 72.000 3075321 3738.9327
Albania Europe 1992 71.581 3326498 2497.4379
Albania Europe 1997 72.950 3428038 3193.0546

note

no new tibble is created until the assigment is made. To create a new tibble just assign the modified object to a name

Code
filtered_gapminder = gapminder |> 
  filter((year>=2000)|(lifeExp>70)) 

arrange

  • The arrange verb re-orders (ascending) the observations according to a variable.

Say we want to arrange the countries with lowest gdpPercap

Code
gapminder |> 
  arrange(gdpPercap) |>  slice(1:4) |> 
  kable(format="html") |> kable_styling(font_size=8)
country continent year lifeExp pop gdpPercap
Congo, Dem. Rep. Africa 2002 44.966 55379852 241.1659
Congo, Dem. Rep. Africa 2007 46.462 64606759 277.5519
Lesotho Africa 1952 42.138 748747 298.8462
Guinea-Bissau Africa 1952 32.500 580653 299.8503

and then we want to re-order the previous selection by country name

Code
gapminder |> 
  arrange(gdpPercap) |>  slice(1:4) |> arrange(country) |>
  kable(format="html") |> kable_styling(font_size=8)
country continent year lifeExp pop gdpPercap
Congo, Dem. Rep. Africa 2002 44.966 55379852 241.1659
Congo, Dem. Rep. Africa 2007 46.462 64606759 277.5519
Guinea-Bissau Africa 1952 32.500 580653 299.8503
Lesotho Africa 1952 42.138 748747 298.8462

arrange

  • To arrange in descending order, it just takes
Code
gapminder |> 
  arrange(desc(gdpPercap)) |>  slice(1:4) |> 
  kable(format="html") |> kable_styling(font_size=8)
country continent year lifeExp pop gdpPercap
Kuwait Asia 1957 58.033 212846 113523.13
Kuwait Asia 1972 67.712 841934 109347.87
Kuwait Asia 1952 55.565 160000 108382.35
Kuwait Asia 1962 60.470 358266 95458.11

slice

The slice verb picks the observations up according to their position in the tibble

Code
slice1=gapminder |>  slice(1:3) |> 
  kable(format="html") |> kable_styling(font_size=8)
country continent year lifeExp pop gdpPercap
Afghanistan Asia 1952 28.801 8425333 779.4453
Afghanistan Asia 1957 30.332 9240934 820.8530
Afghanistan Asia 1962 31.997 10267083 853.1007

We can indicate some specific positions

Code
slice2=gapminder |>  slice(c(20,37,49)) |> 
  kable(format="html") |>
  kable_styling(font_size=10)
country continent year lifeExp pop gdpPercap
Albania Europe 1987 72.000 3075321 3738.933
Angola Africa 1952 30.015 4232095 3520.610
Argentina Americas 1952 62.485 17876956 5911.315

Or we can pick them up at random, using slice_sample(n= )}$

Code
slice3 = gapminder |>  slice_sample(n=3) |> 
  kable(format="html") |>
  kable_styling(font_size=10)
country continent year lifeExp pop gdpPercap
Iraq Asia 1977 60.413 11882916 14688.235
Nicaragua Americas 1982 59.298 2979423 3470.338
West Bank and Gaza Asia 1967 51.631 1142636 2649.715

slice_min(n= ) and slice_max(n= ) are combinations of arrange and slice. Check ?slice_min for help

select

The select verb refers to variables, indicated by names

Code
gapminder |>  
  select(country,gdpPercap) |> 
  slice(1:4) |> kable(format="html") |>  kable_styling(font_size=10)
country gdpPercap
Afghanistan 779.4453
Afghanistan 820.8530
Afghanistan 853.1007
Afghanistan 836.1971

One can use the : operator in between variable names, to select a sequence of variables

Code
gapminder |>  
  select(country:lifeExp) |> 
  slice(1:4) |> kable(format="html") |>  kable_styling(font_size=10)
country continent year lifeExp
Afghanistan Asia 1952 28.801
Afghanistan Asia 1957 30.332
Afghanistan Asia 1962 31.997
Afghanistan Asia 1967 34.020

select: helper functions

Selecting variables by name becomes increasingly tedious as then nunber of variables to deal with increases.

  • The helper functions makes possible to select multiple variables at a time based on patterns in their name. Self-explaining examples are

  • starts_with(pattern="abc")

  • ends_with(pattern="abc")

  • contains(pattern="abc")

  • there is more to it, ?tidyselect::language

mutate

The mutate verb can modify and/or create new variables

  • One may want to express the population in millions, and create a variable with the full Gdp, not just the Gdp per-capita
Code
gapminder |> 
  mutate(pop=round(pop/1000000,2),
         gdp=gdpPercap*pop) |> 
  select(pop, contains("gdp")) |> 
  slice_sample(n=3) |> 
  kable(format="html") |> kable_styling(font_size = 10)
pop gdpPercap gdp
3.34 17364.2754 57996.6798
0.63 522.0344 328.8817
10.15 9786.5347 99333.3273

Depending on the name assigned to mutated variable

  • the new variable will overwrite the existing one with same name (as for pop)

  • the new variable will added to the tibble if its name is new as well (as for gdp)

summarize

The summarize makes it very easy to compute descriptive stats of given variable

Code
gapminder |> 
  summarize(min_gdp=min(gdpPercap),
            q1_gdp=quantile(gdpPercap,.25),
            median_gdp=quantile(gdpPercap,.5),
            mean_gdp_pc=mean(gdpPercap),
            q3_gdp=quantile(gdpPercap,.75),
            max_gdp=max(gdpPercap)
            )  |> kbl() |> kable_styling(font_size=12)
min_gdp q1_gdp median_gdp mean_gdp_pc q3_gdp max_gdp
241.1659 1202.06 3531.847 7215.327 9325.462 113523.1

group by

The group_by verb imposes a conditioning on the further operations. It works great with summarize, to have conditional descriptive statistics

Code
gapminder |> 
  group_by(continent) |> 
  summarize(min_gdp=min(gdpPercap),
            q1_gdp=quantile(gdpPercap,.25),
            median_gdp=quantile(gdpPercap,.5),
            mean_gdp_pc=mean(gdpPercap),
            q3_gdp=quantile(gdpPercap,.75),
            max_gdp=max(gdpPercap)
            )  |> kbl() |> kable_styling(font_size=12)
continent min_gdp q1_gdp median_gdp mean_gdp_pc q3_gdp max_gdp
Africa 241.1659 761.247 1192.138 2193.755 2377.417 21951.21
Americas 1201.6372 3427.779 5465.510 7136.110 7830.210 42951.65
Asia 331.0000 1056.993 2646.787 7902.150 8549.256 113523.13
Europe 973.5332 7213.085 12081.749 14469.476 20461.386 49357.19
Oceania 10039.5956 14141.859 17983.304 18621.609 22214.117 34435.37

group by

The group_by verb imposes a conditioning on the further operations. It works great with summarize, to have conditional descriptive statistics

Code
gapminder |> 
  filter(year>2000) |> 
  group_by(year,continent) |> 
  summarize(min_gdp=min(gdpPercap),
            q1_gdp=quantile(gdpPercap,.25),
            median_gdp=quantile(gdpPercap,.5),
            mean_gdp_pc=mean(gdpPercap),
            q3_gdp=quantile(gdpPercap,.75),
            max_gdp=max(gdpPercap)
            ) |> kbl() |> kable_styling(font_size = 8)
year continent min_gdp q1_gdp median_gdp mean_gdp_pc q3_gdp max_gdp
2002 Africa 241.1659 780.5778 1215.683 2599.385 3314.887 12521.71
2002 Americas 1270.3649 4858.3475 6994.775 9287.677 8797.641 39097.10
2002 Asia 611.0000 2092.7124 4090.925 10174.090 19233.988 36023.11
2002 Europe 4604.2117 11721.8515 23674.863 21711.732 30373.363 44683.98
2002 Oceania 23189.8014 25064.2897 26938.778 26938.778 28813.266 30687.75
2007 Africa 277.5519 862.9515 1452.267 3089.033 3993.502 13206.48
2007 Americas 1201.6372 5728.3535 8948.103 11003.032 11977.575 42951.65
2007 Asia 944.0000 2452.2104 4471.062 12473.027 22316.193 47306.99
2007 Europe 5937.0295 14811.8982 28054.066 25054.482 33817.963 49357.19
2007 Oceania 25185.0091 27497.5987 29810.188 29810.188 32122.778 34435.37

distinct

The distinct verb reports the distinct values of a variable…

Code
gapminder |> 
  distinct(year) |> kbl() |> kable_styling(font_size = 10) 
year
1952
1957
1962
1967
1972
1977
1982
1987
1992
1997
2002
2007

… or distinct combinations of values from multiple variables

Code
gapminder |> 
  distinct(year, continent) |> nrow() |> 
  kbl() |> kable_styling(font_size = 10) 
x
60

The value comes from the 12 distinct years considered, times the 5 continents.

count

The count verb reports the (absolute) frequency distribution of a variable…

Code
gapminder |>
  filter(year==2007) |> 
  count(continent) |> kbl() |> kable_styling(font_size = 10) 
continent n
Africa 52
Americas 25
Asia 33
Europe 30
Oceania 2

… or the joint freq distribution of multiple variables

Code
gapminder |> 
  filter(year==2007) |> 
  mutate(high_low_lexp = ifelse(lifeExp>65,"high","low"))  |> 
  count(continent,high_low_lexp) |> 
  kbl() |> kable_styling(font_size = 10) 
continent high_low_lexp n
Africa high 9
Africa low 43
Americas high 24
Americas low 1
Asia high 25
Asia low 8
Europe high 30
Oceania high 2

count vs group_by + summarise

One could consider to use group_by and then summarize the groups via the number of rows (via n())…

Code
gapminder |> 
  filter(year==2007) |> 
  mutate(high_low_lexp = ifelse(lifeExp>65,"high","low"))  |> 
  group_by(continent,high_low_lexp) |>
  summarize(n=n()) |> 
  kbl() |> kable_styling(font_size = 10)
continent high_low_lexp n
Africa high 9
Africa low 43
Americas high 24
Americas low 1
Asia high 25
Asia low 8
Europe high 30
Oceania high 2

Or use count

Code
gapminder |> 
  filter(year==2007) |> 
  mutate(high_low_lexp = ifelse(lifeExp>65,"high","low"))  |> 
  count(continent,high_low_lexp) |> 
  kbl() |> kable_styling(font_size = 10) 
continent high_low_lexp n
Africa high 9
Africa low 43
Americas high 24
Americas low 1
Asia high 25
Asia low 8
Europe high 30
Oceania high 2

It’s the same!…not so fast…

count vs group_by + summarize

Same as before, this time compute the relative frequencies, too

Code
gapminder |> 
  filter(year==2007) |> 
  mutate(high_low_lexp = ifelse(lifeExp>65,"high","low"))  |> 
  group_by(continent,high_low_lexp) |>
  summarize(n=n()) |> 
  mutate(relative_freqs= round(n/sum(n),3)) |> 
  kbl() |> kable_styling(font_size = 10)
continent high_low_lexp n relative_freqs
Africa high 9 0.173
Africa low 43 0.827
Americas high 24 0.960
Americas low 1 0.040
Asia high 25 0.758
Asia low 8 0.242
Europe high 30 1.000
Oceania high 2 1.000
Code
gapminder |> 
  filter(year==2007) |> 
  mutate(high_low_lexp = ifelse(lifeExp>65,"high","low"))  |> 
  count(continent,high_low_lexp) |>
  mutate(relative_freqs=round(n/sum(n),3)) |> 
  kbl() |> kable_styling(font_size = 10) 
continent high_low_lexp n relative_freqs
Africa high 9 0.063
Africa low 43 0.303
Americas high 24 0.169
Americas low 1 0.007
Asia high 25 0.176
Asia low 8 0.056
Europe high 30 0.211
Oceania high 2 0.014

It’s not the same!…why ?!

count vs group_by + summarize

There is clearly something wrong with group_by + summarize, as the relative frequencies do not add up to one

continent high_low_lexp n relative_freqs
Africa high 9 0.173
Africa low 43 0.827
Americas high 24 0.960
Americas low 1 0.040
Asia high 25 0.758
Asia low 8 0.242
Europe high 30 1.000
Oceania high 2 1.000
  • this is due to the tibble being still grouped
  • the function sum() is still applied group-wise (continent-wise), not overall (e.g. \(0.173=\frac{9}{9+43}\) )

to fix this, one has to ungroup before computing the relative frequecies

count vs group_by + summarize

Same as before, this time compute the relative frequencies, too

Code
gapminder |> 
  filter(year==2007) |> 
  mutate(high_low_lexp = ifelse(lifeExp>65,"high","low"))  |> 
  count(continent,high_low_lexp) |>
  mutate(relative_freqs=round(n/sum(n),3)) |> 
  kbl() |> kable_styling(font_size = 10) 
continent high_low_lexp n relative_freqs
Africa high 9 0.063
Africa low 43 0.303
Americas high 24 0.169
Americas low 1 0.007
Asia high 25 0.176
Asia low 8 0.056
Europe high 30 0.211
Oceania high 2 0.014
Code
gapminder |> 
  filter(year==2007) |> 
  mutate(high_low_lexp = ifelse(lifeExp>65,"high","low"))  |> 
  group_by(continent,high_low_lexp) |>
  summarize(n=n()) |> 
  ungroup() |> 
  mutate(relative_freqs= round(n/sum(n),3)) |> 
  kbl() |> kable_styling(font_size = 10)
continent high_low_lexp n relative_freqs
Africa high 9 0.063
Africa low 43 0.303
Americas high 24 0.169
Americas low 1 0.007
Asia high 25 0.176
Asia low 8 0.056
Europe high 30 0.211
Oceania high 2 0.014

Now it’s the same!

data taming in the tidyverse

what is data taming?

import data

The first step in analysis is to import the data from another platform, or in another format

  • readr is the package providing functions to import common format, e.g. csv or txt
  • readxls and haven provide import functions for a variety of file types xls, sql, json

taming imported data

Opening a data set for the first time, one has to check that everything is as it should be, which is unlikely at best. Data taming is needed

  • Cast variable types: are continuous variables, characters, factors, dates alle correctly identified?

  • Are the variable names, and the strings, consistently coded? (remember R is case sensitive, and a space is a character)

  • Are the missings coded in a consistent way? (several labels could be used in the data set (“.”, “NA”,“not available”, ” “,…..)

data taming tools

Some packages are particularly useful for data taming

  • cast types: readr import functions (e.g. read_csv) have an option to specify the variable types as the data is imported (col_types)
  • missings identification: readr import functions (e.g. read_csv ) have an option na= where is possible to specify different labels associated to missing values (the default is na=c("","NA"), so empty cells and cells containing “NA”, will be considered missing)
  • janitor package consists of functions to clean up and homogeneize variable names format, and of all the strings.
  • stringr package consists of functions to manipulate, combine, select and, in general, to deal with strings.
  • lubridate package has functions to to deal with dates.

Check out the tidyverse website for more!

tidy data?

tidy tables are all alike, every un-tidy table is un-tidy in its own way

Hadley Wickham

tidy data

  • Each row corresponds to a different observation

  • Each column corresponds to a different variable

  • Each cell corresponds to a unique combination observation/variable

Consider four students, and register whether they passed some of the test during a course.

Is it tidy ?

student_name homework_1 homework_2 final_proj_3
A 1 0 1
B 1 1 1
C 1 0 1
D 0 NA 0

It is not, values in cols 2 to 4 record whether a student passed a test.

Is it tidy ?

test_name A B C D
homework_1 1 1 1 0
homework_2 0 1 0 NA
final_proj_3 1 1 1 0

It is not, the variables refer to the test type, and to student names

they are both un-tidy for different reasons

tidify a table

To make this table to be tidy, we need a single column recording whether the test is passed or not

Un-tidy

student_name homework_1 homework_2 final_proj_3
A 1 0 1
B 1 1 1
C 1 0 1
D 0 NA 0

Tidy: the pivot_longer verb helps

Code
untidy_tab_1 |> pivot_longer(names_to = "test_name", 
                              values_to = "passed?",
                              cols=homework_1:final_proj_3) |> 
  kbl(format="html") |> kable_styling(font_size=8)
student_name test_name passed?
A homework_1 1
A homework_2 0
A final_proj_3 1
B homework_1 1
B homework_2 1
B final_proj_3 1
C homework_1 1
C homework_2 0
C final_proj_3 1
D homework_1 0
D homework_2 NA
D final_proj_3 0

tidify a table

To make this table to be tidy, we need a single column recording whether the test is passed or not

Un-tidy

test_name A B C D
homework_1 1 1 1 0
homework_2 0 1 0 NA
final_proj_3 1 1 1 0

Tidy

Code
untidy_tab_2 |> pivot_longer(names_to = "student_name", 
                              values_to = "passed?",cols=A:D) |> 
  arrange(student_name) |> select(student_name,everything()) |> 
  kbl(format="html") |> kable_styling(font_size=8)
student_name test_name passed?
A homework_1 1
A homework_2 0
A final_proj_3 1
B homework_1 1
B homework_2 1
B final_proj_3 1
C homework_1 1
C homework_2 0
C final_proj_3 1
D homework_1 0
D homework_2 NA
D final_proj_3 0

tidify a table

To make this table to be tidy, we need different columns for different variables

Un-tidy

student variable_name value
A presences 11
B presences 10
C presences 11
D presences 10
A mode live
B mode live
C mode online
D mode live
A tests_passed 2
B tests_passed 3
C tests_passed 2
D tests_passed 0

Tidy

Code
tidy_tab=long_tab |> pivot_wider(names_from = variable_name, 
                              values_from = value) 
  
tidy_tab |> kbl(format="html") |> kable_styling(font_size=8)
student presences mode tests_passed
A 11 live 2
B 10 live 3
C 11 online 2
D 10 live 0

tidify a table

Code
glimpse(tidy_tab) 
Rows: 4
Columns: 4
$ student      <chr> "A", "B", "C", "D"
$ presences    <chr> "11", "10", "11", "10"
$ mode         <chr> "live", "live", "online", "live"
$ tests_passed <chr> "2", "3", "2", "0"

Note: presences and tests_passed are coded as character.

to fix this, we parse the two variables as numeric

Code
tidy_tab=tidy_tab |> mutate(across(.cols=c(presences,tests_passed), ~parse_double(.)))
glimpse(tidy_tab) 
Rows: 4
Columns: 4
$ student      <chr> "A", "B", "C", "D"
$ presences    <dbl> 11, 10, 11, 10
$ mode         <chr> "live", "live", "online", "live"
$ tests_passed <dbl> 2, 3, 2, 0

Now it works

data visualization in the tidyverse

ggplot2: the grammar of graphics in R

  • In ggplot2, graphics are made of different layers

  • the mapping operation assigns a variable in a tibble to an element of a plot: different plots may have different elements to map:aesthetics refer to axes (x and y), but also to color, size.

Code
pengs=palmerpenguins::penguins |> na.omit()
pengs |> ggplot(mapping=aes(x=flipper_length_mm,y=bill_length_mm,size=body_mass_g,color=island))

Nothing happens! (we just created the base layer)

ggplot2: the grammar of graphics in R

Depending on what we want to display, we can add a geom to the layer: to create a scatterplot we need to add points

Code
pengs |> ggplot(mapping=aes(x=flipper_length_mm,y=bill_length_mm,size=body_mass_g,color=island))+
  geom_point(alpha=.5)

Note

no new mapping has been specified for geom_point(), the mapping from the base layer is used (and could be used by other geoms). One characteristic that is specific for the points is alpha that is the transparency of the points: not a mapping, in fact, it is matched to a single value, not a variable.

ggplot2: the grammar of graphics in R

The geom choice depends on the nature of the mapped variables: barplots are for factors

Code
pengs |> ggplot(mapping=aes(x=species,fill=island))+
  geom_bar()

ggplot2: the grammar of graphics in R

For unordered factors, one usually wants to display the bars according to their occurrence.

  • This is easily done using fct_infreq}$ and fct_rev}$ functions from the forcats}$ package
Code
pengs |> ggplot(mapping=aes(x=fct_infreq(species),fill=island))+
  geom_bar()

Code
pengs |> ggplot(mapping=aes(x=fct_rev(fct_infreq(species)),fill=island))+
  geom_bar()

Note

check the forcats package out for more, very useful, functions for handling factors

ggplot2: the grammar of graphics in R

For single distributions, one may want to use an histogram, if one wants the relative frequencies instead of counts, the y axis has to be specified.

Code
pengs |> ggplot(mapping=aes(x=body_mass_g))+
  geom_histogram(aes(y=..density..),bins =15,fill="indianred",color="darkgrey",alpha=.3)+
  geom_density(fill="cyan",alpha=.25)

ggplot2: the grammar of graphics in R

More advanced plot are easily obtained

Code
pengs |> ggplot(mapping=aes(x=flipper_length_mm,y=bill_length_mm,size=body_mass_g,color=island))+
  geom_point(alpha=.5)+facet_grid(sex~species)