Learn dplyr’s 5 main verbs: select, arrange, filter, mutate, summarise
Author
Yebelay Berehan
Introduction
dplyr is part of the tidyverse collection.
Provides tools to manipulate and summarize data.
Core verbs:
select()
arrange()
filter()
mutate()
summarise()
Example dataset: starwars (87 rows, 14 columns).
Why dplyr?
Simplifies data wrangling tasks.
Provides clear and concise code.
Integrates seamlessly with other tidyverse packages.
Optimized for large datasets.
dplyr and the tidyverse
tidyverse is a set of R packages designed for data science.
Main author: Hadley Wickham (Chief Scientist, RStudio).
Key features: Simplifies learning R.
What’s included in the tidyverse? - Core packages: ggplot2, tidyr, readr, purrr, tibble, stringr, and dplyr.
Extensible ecosystem with tools for data cleaning, visualization, and modeling.
# Load the tidyverselibrary(tidyverse)
Warning: package 'tidyverse' was built under R version 4.4.1
Warning: package 'ggplot2' was built under R version 4.4.2
Warning: package 'tidyr' was built under R version 4.4.1
Warning: package 'stringr' was built under R version 4.4.1
Warning: package 'forcats' was built under R version 4.4.1
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Example datasethead(starwars)
# A tibble: 6 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Luke Sky… 172 77 blond fair blue 19 male mascu…
2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
4 Darth Va… 202 136 none white yellow 41.9 male mascu…
5 Leia Org… 150 49 brown light brown 19 fema… femin…
6 Owen Lars 178 120 brown, gr… light blue 52 male mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
dplyr Syntax
Structure: verb(data, action)
Example:
# Filter species == 'Human'filter(starwars, species =="Human")
# A tibble: 35 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Luke Sk… 172 77 blond fair blue 19 male mascu…
2 Darth V… 202 136 none white yellow 41.9 male mascu…
3 Leia Or… 150 49 brown light brown 19 fema… femin…
4 Owen La… 178 120 brown, gr… light blue 52 male mascu…
5 Beru Wh… 165 75 brown light blue 47 fema… femin…
6 Biggs D… 183 84 black light brown 24 male mascu…
7 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu…
8 Anakin … 188 84 blond fair blue 41.9 male mascu…
9 Wilhuff… 180 NA auburn, g… fair blue 64 male mascu…
10 Han Solo 180 80 brown fair brown 29 male mascu…
# ℹ 25 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
Comparison with base R:
# Base R equivalentstarwars[starwars$species =="Human", ]
# A tibble: 39 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Luke Sk… 172 77 blond fair blue 19 male mascu…
2 Darth V… 202 136 none white yellow 41.9 male mascu…
3 Leia Or… 150 49 brown light brown 19 fema… femin…
4 Owen La… 178 120 brown, gr… light blue 52 male mascu…
5 Beru Wh… 165 75 brown light blue 47 fema… femin…
6 Biggs D… 183 84 black light brown 24 male mascu…
7 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu…
8 Anakin … 188 84 blond fair blue 41.9 male mascu…
9 Wilhuff… 180 NA auburn, g… fair blue 64 male mascu…
10 Han Solo 180 80 brown fair brown 29 male mascu…
# ℹ 29 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
Why prefer dplyr? - Cleaner syntax. - Easier to read and debug. - Chaining allows for modular pipelines.
The Pipe %>% Operator
%>% means “and then”.
Enables chaining operations for readable code.
# Example using the pipestarwars %>%filter(species =="Droid") %>%arrange(height)
# A tibble: 6 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 R2-D2 96 32 <NA> white, blue red 33 none masculi…
2 R4-P17 96 NA none silver, red red, blue NA none feminine
3 R5-D4 97 32 <NA> white, red red NA none masculi…
4 C-3PO 167 75 <NA> gold yellow 112 none masculi…
5 IG-88 200 140 none metal red 15 none masculi…
6 BB8 NA NA none none black NA none masculi…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
Keyboard shortcut: Ctrl + Shift + M.
Benefits of the pipe: - Reduces intermediate variables. - Improves clarity and flow of analysis.
Selecting Columns: select()
Use select() to specify columns.
# Select columns by namestarwars %>%select(name, height, species)
# A tibble: 87 × 3
name height species
<chr> <int> <chr>
1 Luke Skywalker 172 Human
2 C-3PO 167 Droid
3 R2-D2 96 Droid
4 Darth Vader 202 Human
5 Leia Organa 150 Human
6 Owen Lars 178 Human
7 Beru Whitesun Lars 165 Human
8 R5-D4 97 Droid
9 Biggs Darklighter 183 Human
10 Obi-Wan Kenobi 182 Human
# ℹ 77 more rows
Remove columns using -:
# Remove list columnsstarwars %>%select(-films, -vehicles, -starships)
# A tibble: 87 × 11
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Luke Sk… 172 77 blond fair blue 19 male mascu…
2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
4 Darth V… 202 136 none white yellow 41.9 male mascu…
5 Leia Or… 150 49 brown light brown 19 fema… femin…
6 Owen La… 178 120 brown, gr… light blue 52 male mascu…
7 Beru Wh… 165 75 brown light blue 47 fema… femin…
8 R5-D4 97 32 <NA> white, red red NA none mascu…
9 Biggs D… 183 84 black light brown 24 male mascu…
10 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu…
# ℹ 77 more rows
# ℹ 2 more variables: homeworld <chr>, species <chr>
Advanced selection: - Select columns based on conditions:
# A tibble: 87 × 3
character_name species_type height
<chr> <chr> <int>
1 Luke Skywalker Human 172
2 C-3PO Droid 167
3 R2-D2 Droid 96
4 Darth Vader Human 202
5 Leia Organa Human 150
6 Owen Lars Human 178
7 Beru Whitesun Lars Human 165
8 R5-D4 Droid 97
9 Biggs Darklighter Human 183
10 Obi-Wan Kenobi Human 182
# ℹ 77 more rows
Sorting Data: arrange()
Sort rows based on column values.
# Arrange by heightstarwars %>%arrange(height)
# A tibble: 87 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Yoda 66 17 white green brown 896 male mascu…
2 Ratts T… 79 15 none grey, blue unknown NA male mascu…
3 Wicket … 88 20 brown brown brown 8 male mascu…
4 Dud Bolt 94 45 none blue, grey yellow NA male mascu…
5 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
6 R4-P17 96 NA none silver, r… red, blue NA none femin…
7 R5-D4 97 32 <NA> white, red red NA none mascu…
8 Sebulba 112 40 none grey, red orange NA male mascu…
9 Gasgano 122 NA none white, bl… black NA male mascu…
10 Watto 137 NA black blue, grey yellow NA male mascu…
# ℹ 77 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
# A tibble: 87 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Yarael … 264 NA none white yellow NA male mascu…
2 Tarfful 234 136 brown brown blue NA male mascu…
3 Lama Su 229 88 none grey black NA male mascu…
4 Chewbac… 228 112 brown unknown blue 200 male mascu…
5 Roos Ta… 224 82 none grey orange NA male mascu…
6 Grievous 216 159 none brown, wh… green, y… NA male mascu…
7 Taun We 213 NA none grey black NA fema… femin…
8 Rugor N… 206 NA none green orange NA male mascu…
9 Tion Me… 206 80 none grey black NA male mascu…
10 Darth V… 202 136 none white yellow 41.9 male mascu…
# ℹ 77 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
Multi-column sorting:
# Sort by height, then by massstarwars %>%arrange(height, mass)
# A tibble: 87 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Yoda 66 17 white green brown 896 male mascu…
2 Ratts T… 79 15 none grey, blue unknown NA male mascu…
3 Wicket … 88 20 brown brown brown 8 male mascu…
4 Dud Bolt 94 45 none blue, grey yellow NA male mascu…
5 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
6 R4-P17 96 NA none silver, r… red, blue NA none femin…
7 R5-D4 97 32 <NA> white, red red NA none mascu…
8 Sebulba 112 40 none grey, red orange NA male mascu…
9 Gasgano 122 NA none white, bl… black NA male mascu…
10 Watto 137 NA black blue, grey yellow NA male mascu…
# ℹ 77 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
Filtering Rows: filter()
Subset rows based on conditions.
# Filter rows where species == 'Human'starwars %>%filter(species =="Human")
# A tibble: 35 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Luke Sk… 172 77 blond fair blue 19 male mascu…
2 Darth V… 202 136 none white yellow 41.9 male mascu…
3 Leia Or… 150 49 brown light brown 19 fema… femin…
4 Owen La… 178 120 brown, gr… light blue 52 male mascu…
5 Beru Wh… 165 75 brown light blue 47 fema… femin…
6 Biggs D… 183 84 black light brown 24 male mascu…
7 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu…
8 Anakin … 188 84 blond fair blue 41.9 male mascu…
9 Wilhuff… 180 NA auburn, g… fair blue 64 male mascu…
10 Han Solo 180 80 brown fair brown 29 male mascu…
# ℹ 25 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
Handle missing values:
# Filter non-missing mass valuesstarwars %>%filter(!is.na(mass))
# A tibble: 59 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Luke Sk… 172 77 blond fair blue 19 male mascu…
2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
4 Darth V… 202 136 none white yellow 41.9 male mascu…
5 Leia Or… 150 49 brown light brown 19 fema… femin…
6 Owen La… 178 120 brown, gr… light blue 52 male mascu…
7 Beru Wh… 165 75 brown light blue 47 fema… femin…
8 R5-D4 97 32 <NA> white, red red NA none mascu…
9 Biggs D… 183 84 black light brown 24 male mascu…
10 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu…
# ℹ 49 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
Complex filters: - Combine multiple conditions:
# Filter Humans taller than 170starwars %>%filter(species =="Human", height >170)
# A tibble: 21 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Luke Sk… 172 77 blond fair blue 19 male mascu…
2 Darth V… 202 136 none white yellow 41.9 male mascu…
3 Owen La… 178 120 brown, gr… light blue 52 male mascu…
4 Biggs D… 183 84 black light brown 24 male mascu…
5 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu…
6 Anakin … 188 84 blond fair blue 41.9 male mascu…
7 Wilhuff… 180 NA auburn, g… fair blue 64 male mascu…
8 Han Solo 180 80 brown fair brown 29 male mascu…
9 Boba Fe… 183 78.2 black fair brown 31.5 male mascu…
10 Lando C… 177 79 black dark brown 31 male mascu…
# ℹ 11 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
Creating New Columns: mutate()
Add or transform columns.
# Calculate BMIstarwars %>%mutate(BMI = mass / (height /100)^2)
# A tibble: 87 × 15
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Luke Sk… 172 77 blond fair blue 19 male mascu…
2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
4 Darth V… 202 136 none white yellow 41.9 male mascu…
5 Leia Or… 150 49 brown light brown 19 fema… femin…
6 Owen La… 178 120 brown, gr… light blue 52 male mascu…
7 Beru Wh… 165 75 brown light blue 47 fema… femin…
8 R5-D4 97 32 <NA> white, red red NA none mascu…
9 Biggs D… 183 84 black light brown 24 male mascu…
10 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu…
# ℹ 77 more rows
# ℹ 6 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>, BMI <dbl>
# Summary stats for numeric columnsstarwars %>%summarise(across(where(is.numeric), mean, na.rm =TRUE))
Warning: There was 1 warning in `summarise()`.
ℹ In argument: `across(where(is.numeric), mean, na.rm = TRUE)`.
Caused by warning:
! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
Supply arguments directly to `.fns` through an anonymous function instead.
# Previously
across(a:b, mean, na.rm = TRUE)
# Now
across(a:b, \(x) mean(x, na.rm = TRUE))
# A tibble: 1 × 3
height mass birth_year
<dbl> <dbl> <dbl>
1 175. 97.3 87.6
# A tibble: 87 × 14
name height[,1] mass[,1] hair_color skin_color eye_color birth_year[,1]
<chr> <dbl> <dbl> <chr> <chr> <chr> <dbl>
1 Luke Skyw… -0.0749 -0.120 blond fair blue -0.443
2 C-3PO -0.219 -0.132 <NA> gold yellow 0.158
3 R2-D2 -2.26 -0.385 <NA> white, bl… red -0.353
4 Darth Vad… 0.788 0.228 none white yellow -0.295
5 Leia Orga… -0.708 -0.285 brown light brown -0.443
6 Owen Lars 0.0976 0.134 brown, gr… light blue -0.230
7 Beru Whit… -0.276 -0.132 brown light blue -0.262
8 R5-D4 -2.23 -0.385 <NA> white, red red NA
9 Biggs Dar… 0.241 -0.0786 black light brown -0.411
10 Obi-Wan K… 0.213 -0.120 auburn, w… fair blue-gray -0.198
# ℹ 77 more rows
# ℹ 7 more variables: sex <chr>, gender <chr>, homeworld <chr>, species <chr>,
# films <list>, vehicles <list>, starships <list>
Case Study: Picking Basketball Teams
Select players based on height and mass.
# Filter tall and heavy playersstarwars %>%filter(height >180, mass >80) %>%arrange(desc(height))
# A tibble: 17 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Tarfful 234 136 brown brown blue NA male mascu…
2 Lama Su 229 88 none grey black NA male mascu…
3 Chewbac… 228 112 brown unknown blue 200 male mascu…
4 Roos Ta… 224 82 none grey orange NA male mascu…
5 Grievous 216 159 none brown, wh… green, y… NA male mascu…
6 Darth V… 202 136 none white yellow 41.9 male mascu…
7 IG-88 200 140 none metal red 15 none mascu…
8 Ki-Adi-… 198 82 white pale yellow 92 male mascu…
9 Dexter … 198 102 none brown yellow NA male mascu…
10 Kit Fis… 196 87 none green black NA male mascu…
11 Qui-Gon… 193 89 brown fair blue 92 male mascu…
12 Nute Gu… 191 90 none mottled g… red NA male mascu…
13 Bossk 190 113 none green red 53 male mascu…
14 Anakin … 188 84 blond fair blue 41.9 male mascu…
15 Mace Wi… 188 84 none dark brown 72 male mascu…
16 Gregar … 185 85 black dark brown NA <NA> <NA>
17 Biggs D… 183 84 black light brown 24 male mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
Additional criteria: - Filter specific species or hair color.
starwars %>%filter(height >180, mass >80, species =="Human")
# A tibble: 5 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Darth Va… 202 136 none white yellow 41.9 male mascu…
2 Biggs Da… 183 84 black light brown 24 male mascu…
3 Anakin S… 188 84 blond fair blue 41.9 male mascu…
4 Qui-Gon … 193 89 brown fair blue 92 male mascu…
5 Mace Win… 188 84 none dark brown 72 male mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
# A tibble: 1 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Yarael P… 264 NA none white yellow NA male mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
Custom thresholds:
# Characters shorter than 120starwars %>%filter(height <120)
# A tibble: 8 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
2 R5-D4 97 32 <NA> white, red red NA none mascu…
3 Yoda 66 17 white green brown 896 male mascu…
4 Wicket S… 88 20 brown brown brown 8 male mascu…
5 Sebulba 112 40 none grey, red orange NA male mascu…
6 Ratts Ty… 79 15 none grey, blue unknown NA male mascu…
7 Dud Bolt 94 45 none blue, grey yellow NA male mascu…
8 R4-P17 96 NA none silver, r… red, blue NA none femin…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>