Data Manipulation and Visualization

Statistical Learning

Alfonso Iodice D’Enza

what is tidyverse?

click the figure for all things tidyverse!

This is!

A tibble (sort of a data frame)

Code

library("tidyverse")
library("gapminder")
library("kableExtra")
gapminder |> slice_sample(n=5) |>  kable(format="html") |> kable_styling(font_size=10)

country	continent	year	lifeExp	pop	gdpPercap
Hungary	Europe	1967	69.500	10223422	9326.6447
Nepal	Asia	1997	59.426	23001113	1010.8921
Sao Tome and Principe	Africa	1987	61.728	110812	1516.5255
Nepal	Asia	1982	49.594	15796314	718.3731
Israel	Asia	1997	78.269	5531387	20896.6092

the symbol |> or |> is the so-called pipe operator: it inputs what’s on its left to what’s on its right
One would obtain the same result by typing slice_sample(.data=gapminder,n=5)

data manipulation with dplyr

dplyr verbs

In a tibble, observations are on rows and variables are on columns

by row

filter : retrieve the observations that meet specified conditions
slice : retrieve the observations by position (slice_sample is a variation)
arrange : sorts the observations according to one or more variables

by column

select : select the variables by name
mutate : transform existing variables or create new ones
summarize : create descriptive stats

filter

The filter verb just retrieves observations that meet one or more conditions.

Say we want the data for European countries after year 2000.

Code

gapminder |> 
  filter(continent=="Europe",year>=2000) |> slice(1:6) |>
  kable(format="html") |> kable_styling(font_size=10)

country	continent	year	lifeExp	pop	gdpPercap
Albania	Europe	2002	75.651	3508512	4604.212
Albania	Europe	2007	76.423	3600523	5937.030
Austria	Europe	2002	78.980	8148312	32417.608
Austria	Europe	2007	79.829	8199783	36126.493
Belgium	Europe	2002	78.320	10311970	30485.884
Belgium	Europe	2007	79.441	10392226	33692.605

filter

filter(condition1,condition2) returns the observations that meet condition1 AND condition2

For more complex conditions, one can use the logical operators

Say we want the data for countries before year 2000 OR with a life expectancy higher than 70.

Code

gapminder |> 
  filter((year>=2000)|(lifeExp>70)) |>  slice(1:6) |> 
  kable(format="html") |> kable_styling(font_size=10)

country	continent	year	lifeExp	pop	gdpPercap
Afghanistan	Asia	2002	42.129	25268405	726.7341
Afghanistan	Asia	2007	43.828	31889923	974.5803
Albania	Europe	1982	70.420	2780097	3630.8807
Albania	Europe	1987	72.000	3075321	3738.9327
Albania	Europe	1992	71.581	3326498	2497.4379
Albania	Europe	1997	72.950	3428038	3193.0546

note

no new tibble is created until the assigment is made. To create a new tibble just assign the modified object to a name

Code

filtered_gapminder = gapminder |> 
  filter((year>=2000)|(lifeExp>70))

arrange

The arrange verb re-orders (ascending) the observations according to a variable.

Say we want to arrange the countries with lowest gdpPercap

Code

gapminder |> 
  arrange(gdpPercap) |>  slice(1:4) |> 
  kable(format="html") |> kable_styling(font_size=8)

country	continent	year	lifeExp	pop	gdpPercap
Congo, Dem. Rep.	Africa	2002	44.966	55379852	241.1659
Congo, Dem. Rep.	Africa	2007	46.462	64606759	277.5519
Lesotho	Africa	1952	42.138	748747	298.8462
Guinea-Bissau	Africa	1952	32.500	580653	299.8503

and then we want to re-order the previous selection by country name

Code

gapminder |> 
  arrange(gdpPercap) |>  slice(1:4) |> arrange(country) |>
  kable(format="html") |> kable_styling(font_size=8)

country	continent	year	lifeExp	pop	gdpPercap
Congo, Dem. Rep.	Africa	2002	44.966	55379852	241.1659
Congo, Dem. Rep.	Africa	2007	46.462	64606759	277.5519
Guinea-Bissau	Africa	1952	32.500	580653	299.8503
Lesotho	Africa	1952	42.138	748747	298.8462

arrange

To arrange in descending order, it just takes

Code

gapminder |> 
  arrange(desc(gdpPercap)) |>  slice(1:4) |> 
  kable(format="html") |> kable_styling(font_size=8)

country	continent	year	lifeExp	pop	gdpPercap
Kuwait	Asia	1957	58.033	212846	113523.13
Kuwait	Asia	1972	67.712	841934	109347.87
Kuwait	Asia	1952	55.565	160000	108382.35
Kuwait	Asia	1962	60.470	358266	95458.11

slice

The slice verb picks the observations up according to their position in the tibble

Code

slice1=gapminder |>  slice(1:3) |> 
  kable(format="html") |> kable_styling(font_size=8)

country	continent	year	lifeExp	pop	gdpPercap
Afghanistan	Asia	1952	28.801	8425333	779.4453
Afghanistan	Asia	1957	30.332	9240934	820.8530
Afghanistan	Asia	1962	31.997	10267083	853.1007

We can indicate some specific positions

Code

slice2=gapminder |>  slice(c(20,37,49)) |> 
  kable(format="html") |>
  kable_styling(font_size=10)

country	continent	year	lifeExp	pop	gdpPercap
Albania	Europe	1987	72.000	3075321	3738.933
Angola	Africa	1952	30.015	4232095	3520.610
Argentina	Americas	1952	62.485	17876956	5911.315

Or we can pick them up at random, using slice_sample(n= )}$

Code

slice3 = gapminder |>  slice_sample(n=3) |> 
  kable(format="html") |>
  kable_styling(font_size=10)

country	continent	year	lifeExp	pop	gdpPercap
Iraq	Asia	1977	60.413	11882916	14688.235
Nicaragua	Americas	1982	59.298	2979423	3470.338
West Bank and Gaza	Asia	1967	51.631	1142636	2649.715

slice_min(n= ) and slice_max(n= ) are combinations of arrange and slice. Check ?slice_min for help

select

The select verb refers to variables, indicated by names

Code

gapminder |>  
  select(country,gdpPercap) |> 
  slice(1:4) |> kable(format="html") |>  kable_styling(font_size=10)

country	gdpPercap
Afghanistan	779.4453
Afghanistan	820.8530
Afghanistan	853.1007
Afghanistan	836.1971

One can use the : operator in between variable names, to select a sequence of variables

Code

gapminder |>  
  select(country:lifeExp) |> 
  slice(1:4) |> kable(format="html") |>  kable_styling(font_size=10)

country	continent	year	lifeExp
Afghanistan	Asia	1952	28.801
Afghanistan	Asia	1957	30.332
Afghanistan	Asia	1962	31.997
Afghanistan	Asia	1967	34.020

select: helper functions

Selecting variables by name becomes increasingly tedious as then nunber of variables to deal with increases.

The helper functions makes possible to select multiple variables at a time based on patterns in their name. Self-explaining examples are
starts_with(pattern="abc")
ends_with(pattern="abc")
contains(pattern="abc")
there is more to it, ?tidyselect::language

mutate

The mutate verb can modify and/or create new variables

One may want to express the population in millions, and create a variable with the full Gdp, not just the Gdp per-capita

Code

gapminder |> 
  mutate(pop=round(pop/1000000,2),
         gdp=gdpPercap*pop) |> 
  select(pop, contains("gdp")) |> 
  slice_sample(n=3) |> 
  kable(format="html") |> kable_styling(font_size = 10)

pop	gdpPercap	gdp
3.34	17364.2754	57996.6798
0.63	522.0344	328.8817
10.15	9786.5347	99333.3273

Depending on the name assigned to mutated variable

the new variable will overwrite the existing one with same name (as for pop)
the new variable will added to the tibble if its name is new as well (as for gdp)

summarize

The summarize makes it very easy to compute descriptive stats of given variable

Code

gapminder |> 
  summarize(min_gdp=min(gdpPercap),
            q1_gdp=quantile(gdpPercap,.25),
            median_gdp=quantile(gdpPercap,.5),
            mean_gdp_pc=mean(gdpPercap),
            q3_gdp=quantile(gdpPercap,.75),
            max_gdp=max(gdpPercap)
            )  |> kbl() |> kable_styling(font_size=12)

min_gdp	q1_gdp	median_gdp	mean_gdp_pc	q3_gdp	max_gdp
241.1659	1202.06	3531.847	7215.327	9325.462	113523.1

group by

The group_by verb imposes a conditioning on the further operations. It works great with summarize, to have conditional descriptive statistics

Code

gapminder |> 
  group_by(continent) |> 
  summarize(min_gdp=min(gdpPercap),
            q1_gdp=quantile(gdpPercap,.25),
            median_gdp=quantile(gdpPercap,.5),
            mean_gdp_pc=mean(gdpPercap),
            q3_gdp=quantile(gdpPercap,.75),
            max_gdp=max(gdpPercap)
            )  |> kbl() |> kable_styling(font_size=12)

continent	min_gdp	q1_gdp	median_gdp	mean_gdp_pc	q3_gdp	max_gdp
Africa	241.1659	761.247	1192.138	2193.755	2377.417	21951.21
Americas	1201.6372	3427.779	5465.510	7136.110	7830.210	42951.65
Asia	331.0000	1056.993	2646.787	7902.150	8549.256	113523.13
Europe	973.5332	7213.085	12081.749	14469.476	20461.386	49357.19
Oceania	10039.5956	14141.859	17983.304	18621.609	22214.117	34435.37

group by

The group_by verb imposes a conditioning on the further operations. It works great with summarize, to have conditional descriptive statistics

Code

gapminder |> 
  filter(year>2000) |> 
  group_by(year,continent) |> 
  summarize(min_gdp=min(gdpPercap),
            q1_gdp=quantile(gdpPercap,.25),
            median_gdp=quantile(gdpPercap,.5),
            mean_gdp_pc=mean(gdpPercap),
            q3_gdp=quantile(gdpPercap,.75),
            max_gdp=max(gdpPercap)
            ) |> kbl() |> kable_styling(font_size = 8)

year	continent	min_gdp	q1_gdp	median_gdp	mean_gdp_pc	q3_gdp	max_gdp
2002	Africa	241.1659	780.5778	1215.683	2599.385	3314.887	12521.71
2002	Americas	1270.3649	4858.3475	6994.775	9287.677	8797.641	39097.10
2002	Asia	611.0000	2092.7124	4090.925	10174.090	19233.988	36023.11
2002	Europe	4604.2117	11721.8515	23674.863	21711.732	30373.363	44683.98
2002	Oceania	23189.8014	25064.2897	26938.778	26938.778	28813.266	30687.75
2007	Africa	277.5519	862.9515	1452.267	3089.033	3993.502	13206.48
2007	Americas	1201.6372	5728.3535	8948.103	11003.032	11977.575	42951.65
2007	Asia	944.0000	2452.2104	4471.062	12473.027	22316.193	47306.99
2007	Europe	5937.0295	14811.8982	28054.066	25054.482	33817.963	49357.19
2007	Oceania	25185.0091	27497.5987	29810.188	29810.188	32122.778	34435.37

distinct

The distinct verb reports the distinct values of a variable…

Code

gapminder |> 
  distinct(year) |> kbl() |> kable_styling(font_size = 10)

year
1952
1957
1962
1967
1972
1977
1982
1987
1992
1997
2002
2007

… or distinct combinations of values from multiple variables

Code

gapminder |> 
  distinct(year, continent) |> nrow() |> 
  kbl() |> kable_styling(font_size = 10)

x
60

The value comes from the 12 distinct years considered, times the 5 continents.

count

The count verb reports the (absolute) frequency distribution of a variable…

Code

gapminder |>
  filter(year==2007) |> 
  count(continent) |> kbl() |> kable_styling(font_size = 10)

continent	n
Africa	52
Americas	25
Asia	33
Europe	30
Oceania	2

… or the joint freq distribution of multiple variables

Code

gapminder |> 
  filter(year==2007) |> 
  mutate(high_low_lexp = ifelse(lifeExp>65,"high","low"))  |> 
  count(continent,high_low_lexp) |> 
  kbl() |> kable_styling(font_size = 10)

continent	high_low_lexp	n
Africa	high	9
Africa	low	43
Americas	high	24
Americas	low	1
Asia	high	25
Asia	low	8
Europe	high	30
Oceania	high	2

count vs group_by + summarise

One could consider to use group_by and then summarize the groups via the number of rows (via n())…

Code

gapminder |> 
  filter(year==2007) |> 
  mutate(high_low_lexp = ifelse(lifeExp>65,"high","low"))  |> 
  group_by(continent,high_low_lexp) |>
  summarize(n=n()) |> 
  kbl() |> kable_styling(font_size = 10)

continent	high_low_lexp	n
Africa	high	9
Africa	low	43
Americas	high	24
Americas	low	1
Asia	high	25
Asia	low	8
Europe	high	30
Oceania	high	2

Or use count

Code

gapminder |> 
  filter(year==2007) |> 
  mutate(high_low_lexp = ifelse(lifeExp>65,"high","low"))  |> 
  count(continent,high_low_lexp) |> 
  kbl() |> kable_styling(font_size = 10)

continent	high_low_lexp	n
Africa	high	9
Africa	low	43
Americas	high	24
Americas	low	1
Asia	high	25
Asia	low	8
Europe	high	30
Oceania	high	2

It’s the same!…not so fast…

`count` vs `group_by` + `summarize`

Same as before, this time compute the relative frequencies, too

Code

gapminder |> 
  filter(year==2007) |> 
  mutate(high_low_lexp = ifelse(lifeExp>65,"high","low"))  |> 
  group_by(continent,high_low_lexp) |>
  summarize(n=n()) |> 
  mutate(relative_freqs= round(n/sum(n),3)) |> 
  kbl() |> kable_styling(font_size = 10)

continent	high_low_lexp	n	relative_freqs
Africa	high	9	0.173
Africa	low	43	0.827
Americas	high	24	0.960
Americas	low	1	0.040
Asia	high	25	0.758
Asia	low	8	0.242
Europe	high	30	1.000
Oceania	high	2	1.000

Code

gapminder |> 
  filter(year==2007) |> 
  mutate(high_low_lexp = ifelse(lifeExp>65,"high","low"))  |> 
  count(continent,high_low_lexp) |>
  mutate(relative_freqs=round(n/sum(n),3)) |> 
  kbl() |> kable_styling(font_size = 10)

continent	high_low_lexp	n	relative_freqs
Africa	high	9	0.063
Africa	low	43	0.303
Americas	high	24	0.169
Americas	low	1	0.007
Asia	high	25	0.176
Asia	low	8	0.056
Europe	high	30	0.211
Oceania	high	2	0.014

It’s not the same!…why ?!

count vs group_by + summarize

There is clearly something wrong with group_by + summarize, as the relative frequencies do not add up to one

continent	high_low_lexp	n	relative_freqs
Africa	high	9	0.173
Africa	low	43	0.827
Americas	high	24	0.960
Americas	low	1	0.040
Asia	high	25	0.758
Asia	low	8	0.242
Europe	high	30	1.000
Oceania	high	2	1.000

this is due to the tibble being still grouped
the function sum() is still applied group-wise (continent-wise), not overall (e.g. $0.173=\frac{9}{9+43}$ )

to fix this, one has to `ungroup` before computing the relative frequecies

count vs group_by + summarize

Same as before, this time compute the relative frequencies, too

Code

gapminder |> 
  filter(year==2007) |> 
  mutate(high_low_lexp = ifelse(lifeExp>65,"high","low"))  |> 
  count(continent,high_low_lexp) |>
  mutate(relative_freqs=round(n/sum(n),3)) |> 
  kbl() |> kable_styling(font_size = 10)

continent	high_low_lexp	n	relative_freqs
Africa	high	9	0.063
Africa	low	43	0.303
Americas	high	24	0.169
Americas	low	1	0.007
Asia	high	25	0.176
Asia	low	8	0.056
Europe	high	30	0.211
Oceania	high	2	0.014

Code

gapminder |> 
  filter(year==2007) |> 
  mutate(high_low_lexp = ifelse(lifeExp>65,"high","low"))  |> 
  group_by(continent,high_low_lexp) |>
  summarize(n=n()) |> 
  ungroup() |> 
  mutate(relative_freqs= round(n/sum(n),3)) |> 
  kbl() |> kable_styling(font_size = 10)

continent	high_low_lexp	n	relative_freqs
Africa	high	9	0.063
Africa	low	43	0.303
Americas	high	24	0.169
Americas	low	1	0.007
Asia	high	25	0.176
Asia	low	8	0.056
Europe	high	30	0.211
Oceania	high	2	0.014

Now it’s the same!

data taming in the tidyverse

what is data taming?

import data

The first step in analysis is to import the data from another platform, or in another format

readr is the package providing functions to import common format, e.g. csv or txt

readxls and haven provide import functions for a variety of file types xls, sql, json

taming imported data

Opening a data set for the first time, one has to check that everything is as it should be, which is unlikely at best. Data taming is needed

Cast variable types: are continuous variables, characters, factors, dates alle correctly identified?
Are the variable names, and the strings, consistently coded? (remember R is case sensitive, and a space is a character)
Are the missings coded in a consistent way? (several labels could be used in the data set (“.”, “NA”,“not available”, ” “,…..)

data taming tools

Some packages are particularly useful for data taming

cast types: readr import functions (e.g. read_csv) have an option to specify the variable types as the data is imported (col_types)

missings identification: readr import functions (e.g. read_csv ) have an option na= where is possible to specify different labels associated to missing values (the default is na=c("","NA"), so empty cells and cells containing “NA”, will be considered missing)

janitor package consists of functions to clean up and homogeneize variable names format, and of all the strings.

stringr package consists of functions to manipulate, combine, select and, in general, to deal with strings.

lubridate package has functions to to deal with dates.

Check out the tidyverse website for more!

tidy data?

tidy tables are all alike, every un-tidy table is un-tidy in its own way

Hadley Wickham

tidy data

Each row corresponds to a different observation
Each column corresponds to a different variable
Each cell corresponds to a unique combination observation/variable

Consider four students, and register whether they passed some of the test during a course.

Is it tidy ?

student_name	homework_1	homework_2	final_proj_3
A	1	0	1
B	1	1	1
C	1	0	1
D	0	NA	0

It is not, values in cols 2 to 4 record whether a student passed a test.

Is it tidy ?

test_name	A	B	C	D
homework_1	1	1	1	0
homework_2	0	1	0	NA
final_proj_3	1	1	1	0

It is not, the variables refer to the test type, and to student names

they are both un-tidy for different reasons

tidify a table

To make this table to be tidy, we need a single column recording whether the test is passed or not

Un-tidy

student_name	homework_1	homework_2	final_proj_3
A	1	0	1
B	1	1	1
C	1	0	1
D	0	NA	0

Tidy: the pivot_longer verb helps

Code

untidy_tab_1 |> pivot_longer(names_to = "test_name", 
                              values_to = "passed?",
                              cols=homework_1:final_proj_3) |> 
  kbl(format="html") |> kable_styling(font_size=8)

student_name	test_name	passed?
A	homework_1	1
A	homework_2	0
A	final_proj_3	1
B	homework_1	1
B	homework_2	1
B	final_proj_3	1
C	homework_1	1
C	homework_2	0
C	final_proj_3	1
D	homework_1	0
D	homework_2	NA
D	final_proj_3	0

tidify a table

To make this table to be tidy, we need a single column recording whether the test is passed or not

Un-tidy

test_name	A	B	C	D
homework_1	1	1	1	0
homework_2	0	1	0	NA
final_proj_3	1	1	1	0

Tidy

Code

untidy_tab_2 |> pivot_longer(names_to = "student_name", 
                              values_to = "passed?",cols=A:D) |> 
  arrange(student_name) |> select(student_name,everything()) |> 
  kbl(format="html") |> kable_styling(font_size=8)

student_name	test_name	passed?
A	homework_1	1
A	homework_2	0
A	final_proj_3	1
B	homework_1	1
B	homework_2	1
B	final_proj_3	1
C	homework_1	1
C	homework_2	0
C	final_proj_3	1
D	homework_1	0
D	homework_2	NA
D	final_proj_3	0

tidify a table

To make this table to be tidy, we need different columns for different variables

Un-tidy

student	variable_name	value
A	presences	11
B	presences	10
C	presences	11
D	presences	10
A	mode	live
B	mode	live
C	mode	online
D	mode	live
A	tests_passed	2
B	tests_passed	3
C	tests_passed	2
D	tests_passed	0

Tidy

Code

tidy_tab=long_tab |> pivot_wider(names_from = variable_name, 
                              values_from = value) 
  
tidy_tab |> kbl(format="html") |> kable_styling(font_size=8)

student	presences	mode	tests_passed
A	11	live	2
B	10	live	3
C	11	online	2
D	10	live	0

tidify a table

Code

glimpse(tidy_tab)

Rows: 4
Columns: 4
$ student      <chr> "A", "B", "C", "D"
$ presences    <chr> "11", "10", "11", "10"
$ mode         <chr> "live", "live", "online", "live"
$ tests_passed <chr> "2", "3", "2", "0"

Note: presences and tests_passed are coded as character.

to fix this, we parse the two variables as numeric

Code

tidy_tab=tidy_tab |> mutate(across(.cols=c(presences,tests_passed), ~parse_double(.)))
glimpse(tidy_tab)

Rows: 4
Columns: 4
$ student      <chr> "A", "B", "C", "D"
$ presences    <dbl> 11, 10, 11, 10
$ mode         <chr> "live", "live", "online", "live"
$ tests_passed <dbl> 2, 3, 2, 0

Now it works

data visualization in the tidyverse

ggplot2: the grammar of graphics in R

In ggplot2, graphics are made of different layers
the mapping operation assigns a variable in a tibble to an element of a plot: different plots may have different elements to map:aesthetics refer to axes (x and y), but also to color, size.

Code

pengs=palmerpenguins::penguins |> na.omit()
pengs |> ggplot(mapping=aes(x=flipper_length_mm,y=bill_length_mm,size=body_mass_g,color=island))

Nothing happens! (we just created the base layer)

ggplot2: the grammar of graphics in R

Depending on what we want to display, we can add a geom to the layer: to create a scatterplot we need to add points

Code

pengs |> ggplot(mapping=aes(x=flipper_length_mm,y=bill_length_mm,size=body_mass_g,color=island))+
  geom_point(alpha=.5)

Note

no new mapping has been specified for geom_point(), the mapping from the base layer is used (and could be used by other geoms). One characteristic that is specific for the points is alpha that is the transparency of the points: not a mapping, in fact, it is matched to a single value, not a variable.

ggplot2: the grammar of graphics in R

The geom choice depends on the nature of the mapped variables: barplots are for factors

Code

pengs |> ggplot(mapping=aes(x=species,fill=island))+
  geom_bar()

ggplot2: the grammar of graphics in R

For unordered factors, one usually wants to display the bars according to their occurrence.

This is easily done using fct_infreq}$ and fct_rev}$ functions from the forcats}$ package

Code

pengs |> ggplot(mapping=aes(x=fct_infreq(species),fill=island))+
  geom_bar()

Code

pengs |> ggplot(mapping=aes(x=fct_rev(fct_infreq(species)),fill=island))+
  geom_bar()

Note

check the forcats package out for more, very useful, functions for handling factors

ggplot2: the grammar of graphics in R

For single distributions, one may want to use an histogram, if one wants the relative frequencies instead of counts, the y axis has to be specified.

Code

pengs |> ggplot(mapping=aes(x=body_mass_g))+
  geom_histogram(aes(y=..density..),bins =15,fill="indianred",color="darkgrey",alpha=.3)+
  geom_density(fill="cyan",alpha=.25)

ggplot2: the grammar of graphics in R

More advanced plot are easily obtained

Code

pengs |> ggplot(mapping=aes(x=flipper_length_mm,y=bill_length_mm,size=body_mass_g,color=island))+
  geom_point(alpha=.5)+facet_grid(sex~species)

Data Manipulation and Visualization

what is tidyverse?

This is!

A tibble (sort of a data frame)

data manipulation with dplyr

dplyr verbs

filter

filter

arrange

arrange

slice

select

select: helper functions

mutate

summarize

group by

group by

distinct

count

count vs group_by + summarise

It’s the same!…not so fast…

count vs group_by + summarize

It’s not the same!…why ?!

count vs group_by + summarize

to fix this, one has to ungroup before computing the relative frequecies

count vs group_by + summarize

Now it’s the same!

data taming in the tidyverse

what is data taming?

data taming tools

Check out the tidyverse website for more!

tidy data?

tidy tables are all alike, every un-tidy table is un-tidy in its own way

tidy data

they are both un-tidy for different reasons

tidify a table

tidify a table

tidify a table

tidify a table

data visualization in the tidyverse

ggplot2: the grammar of graphics in R

ggplot2: the grammar of graphics in R

ggplot2: the grammar of graphics in R

ggplot2: the grammar of graphics in R

ggplot2: the grammar of graphics in R

ggplot2: the grammar of graphics in R

`count` vs `group_by` + `summarize`

to fix this, one has to `ungroup` before computing the relative frequecies