Data Wrangling in R with dplyr

Learn dplyr’s 5 main verbs: select, arrange, filter, mutate, summarise

Author

Yebelay Berehan

Introduction

  • dplyr is part of the tidyverse collection.
  • Provides tools to manipulate and summarize data.
  • Core verbs:
    • select()
    • arrange()
    • filter()
    • mutate()
    • summarise()
  • Example dataset: starwars (87 rows, 14 columns).

Why dplyr?

  • Simplifies data wrangling tasks.
  • Provides clear and concise code.
  • Integrates seamlessly with other tidyverse packages.
  • Optimized for large datasets.

dplyr and the tidyverse

  • tidyverse is a set of R packages designed for data science.
  • Main author: Hadley Wickham (Chief Scientist, RStudio).
  • Key features: Simplifies learning R.

What’s included in the tidyverse? - Core packages: ggplot2, tidyr, readr, purrr, tibble, stringr, and dplyr.

  • Extensible ecosystem with tools for data cleaning, visualization, and modeling.
# Load the tidyverse
library(tidyverse)
Warning: package 'tidyverse' was built under R version 4.4.1
Warning: package 'ggplot2' was built under R version 4.4.2
Warning: package 'tidyr' was built under R version 4.4.1
Warning: package 'stringr' was built under R version 4.4.1
Warning: package 'forcats' was built under R version 4.4.1
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Example dataset
head(starwars)
# A tibble: 6 × 14
  name      height  mass hair_color skin_color eye_color birth_year sex   gender
  <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
1 Luke Sky…    172    77 blond      fair       blue            19   male  mascu…
2 C-3PO        167    75 <NA>       gold       yellow         112   none  mascu…
3 R2-D2         96    32 <NA>       white, bl… red             33   none  mascu…
4 Darth Va…    202   136 none       white      yellow          41.9 male  mascu…
5 Leia Org…    150    49 brown      light      brown           19   fema… femin…
6 Owen Lars    178   120 brown, gr… light      blue            52   male  mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

dplyr Syntax

  • Structure: verb(data, action)
  • Example:
# Filter species == 'Human'
filter(starwars, species == "Human")
# A tibble: 35 × 14
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Luke Sk…    172    77 blond      fair       blue            19   male  mascu…
 2 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
 3 Leia Or…    150    49 brown      light      brown           19   fema… femin…
 4 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
 5 Beru Wh…    165    75 brown      light      blue            47   fema… femin…
 6 Biggs D…    183    84 black      light      brown           24   male  mascu…
 7 Obi-Wan…    182    77 auburn, w… fair       blue-gray       57   male  mascu…
 8 Anakin …    188    84 blond      fair       blue            41.9 male  mascu…
 9 Wilhuff…    180    NA auburn, g… fair       blue            64   male  mascu…
10 Han Solo    180    80 brown      fair       brown           29   male  mascu…
# ℹ 25 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>
  • Comparison with base R:
# Base R equivalent
starwars[starwars$species == "Human", ]
# A tibble: 39 × 14
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Luke Sk…    172    77 blond      fair       blue            19   male  mascu…
 2 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
 3 Leia Or…    150    49 brown      light      brown           19   fema… femin…
 4 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
 5 Beru Wh…    165    75 brown      light      blue            47   fema… femin…
 6 Biggs D…    183    84 black      light      brown           24   male  mascu…
 7 Obi-Wan…    182    77 auburn, w… fair       blue-gray       57   male  mascu…
 8 Anakin …    188    84 blond      fair       blue            41.9 male  mascu…
 9 Wilhuff…    180    NA auburn, g… fair       blue            64   male  mascu…
10 Han Solo    180    80 brown      fair       brown           29   male  mascu…
# ℹ 29 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

Why prefer dplyr? - Cleaner syntax. - Easier to read and debug. - Chaining allows for modular pipelines.


The Pipe %>% Operator

  • %>% means “and then”.
  • Enables chaining operations for readable code.
# Example using the pipe
starwars %>%
  filter(species == "Droid") %>%
  arrange(height)
# A tibble: 6 × 14
  name   height  mass hair_color skin_color  eye_color birth_year sex   gender  
  <chr>   <int> <dbl> <chr>      <chr>       <chr>          <dbl> <chr> <chr>   
1 R2-D2      96    32 <NA>       white, blue red               33 none  masculi…
2 R4-P17     96    NA none       silver, red red, blue         NA none  feminine
3 R5-D4      97    32 <NA>       white, red  red               NA none  masculi…
4 C-3PO     167    75 <NA>       gold        yellow           112 none  masculi…
5 IG-88     200   140 none       metal       red               15 none  masculi…
6 BB8        NA    NA none       none        black             NA none  masculi…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>
  • Keyboard shortcut: Ctrl + Shift + M.

Benefits of the pipe: - Reduces intermediate variables. - Improves clarity and flow of analysis.


Selecting Columns: select()

  • Use select() to specify columns.
# Select columns by name
starwars %>%
  select(name, height, species)
# A tibble: 87 × 3
   name               height species
   <chr>               <int> <chr>  
 1 Luke Skywalker        172 Human  
 2 C-3PO                 167 Droid  
 3 R2-D2                  96 Droid  
 4 Darth Vader           202 Human  
 5 Leia Organa           150 Human  
 6 Owen Lars             178 Human  
 7 Beru Whitesun Lars    165 Human  
 8 R5-D4                  97 Droid  
 9 Biggs Darklighter     183 Human  
10 Obi-Wan Kenobi        182 Human  
# ℹ 77 more rows
  • Remove columns using -:
# Remove list columns
starwars %>%
  select(-films, -vehicles, -starships)
# A tibble: 87 × 11
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Luke Sk…    172    77 blond      fair       blue            19   male  mascu…
 2 C-3PO       167    75 <NA>       gold       yellow         112   none  mascu…
 3 R2-D2        96    32 <NA>       white, bl… red             33   none  mascu…
 4 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
 5 Leia Or…    150    49 brown      light      brown           19   fema… femin…
 6 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
 7 Beru Wh…    165    75 brown      light      blue            47   fema… femin…
 8 R5-D4        97    32 <NA>       white, red red             NA   none  mascu…
 9 Biggs D…    183    84 black      light      brown           24   male  mascu…
10 Obi-Wan…    182    77 auburn, w… fair       blue-gray       57   male  mascu…
# ℹ 77 more rows
# ℹ 2 more variables: homeworld <chr>, species <chr>

Advanced selection: - Select columns based on conditions:

# Select numeric columns
starwars %>%
  select(where(is.numeric))
# A tibble: 87 × 3
   height  mass birth_year
    <int> <dbl>      <dbl>
 1    172    77       19  
 2    167    75      112  
 3     96    32       33  
 4    202   136       41.9
 5    150    49       19  
 6    178   120       52  
 7    165    75       47  
 8     97    32       NA  
 9    183    84       24  
10    182    77       57  
# ℹ 77 more rows

Renaming Columns: rename()

  • Use rename() to change column names.
# Rename columns
starwars %>%
  rename(character_name = name, species_type = species)
# A tibble: 87 × 14
   character_name  height  mass hair_color skin_color eye_color birth_year sex  
   <chr>            <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr>
 1 Luke Skywalker     172    77 blond      fair       blue            19   male 
 2 C-3PO              167    75 <NA>       gold       yellow         112   none 
 3 R2-D2               96    32 <NA>       white, bl… red             33   none 
 4 Darth Vader        202   136 none       white      yellow          41.9 male 
 5 Leia Organa        150    49 brown      light      brown           19   fema…
 6 Owen Lars          178   120 brown, gr… light      blue            52   male 
 7 Beru Whitesun …    165    75 brown      light      blue            47   fema…
 8 R5-D4               97    32 <NA>       white, red red             NA   none 
 9 Biggs Darkligh…    183    84 black      light      brown           24   male 
10 Obi-Wan Kenobi     182    77 auburn, w… fair       blue-gray       57   male 
# ℹ 77 more rows
# ℹ 6 more variables: gender <chr>, homeworld <chr>, species_type <chr>,
#   films <list>, vehicles <list>, starships <list>

Pro tip: Combine rename() with select() to reorder and rename in one step:

starwars %>%
  select(character_name = name, species_type = species, height)
# A tibble: 87 × 3
   character_name     species_type height
   <chr>              <chr>         <int>
 1 Luke Skywalker     Human           172
 2 C-3PO              Droid           167
 3 R2-D2              Droid            96
 4 Darth Vader        Human           202
 5 Leia Organa        Human           150
 6 Owen Lars          Human           178
 7 Beru Whitesun Lars Human           165
 8 R5-D4              Droid            97
 9 Biggs Darklighter  Human           183
10 Obi-Wan Kenobi     Human           182
# ℹ 77 more rows

Sorting Data: arrange()

  • Sort rows based on column values.
# Arrange by height
starwars %>%
  arrange(height)
# A tibble: 87 × 14
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Yoda         66    17 white      green      brown            896 male  mascu…
 2 Ratts T…     79    15 none       grey, blue unknown           NA male  mascu…
 3 Wicket …     88    20 brown      brown      brown              8 male  mascu…
 4 Dud Bolt     94    45 none       blue, grey yellow            NA male  mascu…
 5 R2-D2        96    32 <NA>       white, bl… red               33 none  mascu…
 6 R4-P17       96    NA none       silver, r… red, blue         NA none  femin…
 7 R5-D4        97    32 <NA>       white, red red               NA none  mascu…
 8 Sebulba     112    40 none       grey, red  orange            NA male  mascu…
 9 Gasgano     122    NA none       white, bl… black             NA male  mascu…
10 Watto       137    NA black      blue, grey yellow            NA male  mascu…
# ℹ 77 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>
# Arrange descending
starwars %>%
  arrange(desc(height))
# A tibble: 87 × 14
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Yarael …    264    NA none       white      yellow          NA   male  mascu…
 2 Tarfful     234   136 brown      brown      blue            NA   male  mascu…
 3 Lama Su     229    88 none       grey       black           NA   male  mascu…
 4 Chewbac…    228   112 brown      unknown    blue           200   male  mascu…
 5 Roos Ta…    224    82 none       grey       orange          NA   male  mascu…
 6 Grievous    216   159 none       brown, wh… green, y…       NA   male  mascu…
 7 Taun We     213    NA none       grey       black           NA   fema… femin…
 8 Rugor N…    206    NA none       green      orange          NA   male  mascu…
 9 Tion Me…    206    80 none       grey       black           NA   male  mascu…
10 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
# ℹ 77 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

Multi-column sorting:

# Sort by height, then by mass
starwars %>%
  arrange(height, mass)
# A tibble: 87 × 14
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Yoda         66    17 white      green      brown            896 male  mascu…
 2 Ratts T…     79    15 none       grey, blue unknown           NA male  mascu…
 3 Wicket …     88    20 brown      brown      brown              8 male  mascu…
 4 Dud Bolt     94    45 none       blue, grey yellow            NA male  mascu…
 5 R2-D2        96    32 <NA>       white, bl… red               33 none  mascu…
 6 R4-P17       96    NA none       silver, r… red, blue         NA none  femin…
 7 R5-D4        97    32 <NA>       white, red red               NA none  mascu…
 8 Sebulba     112    40 none       grey, red  orange            NA male  mascu…
 9 Gasgano     122    NA none       white, bl… black             NA male  mascu…
10 Watto       137    NA black      blue, grey yellow            NA male  mascu…
# ℹ 77 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

Filtering Rows: filter()

  • Subset rows based on conditions.
# Filter rows where species == 'Human'
starwars %>%
  filter(species == "Human")
# A tibble: 35 × 14
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Luke Sk…    172    77 blond      fair       blue            19   male  mascu…
 2 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
 3 Leia Or…    150    49 brown      light      brown           19   fema… femin…
 4 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
 5 Beru Wh…    165    75 brown      light      blue            47   fema… femin…
 6 Biggs D…    183    84 black      light      brown           24   male  mascu…
 7 Obi-Wan…    182    77 auburn, w… fair       blue-gray       57   male  mascu…
 8 Anakin …    188    84 blond      fair       blue            41.9 male  mascu…
 9 Wilhuff…    180    NA auburn, g… fair       blue            64   male  mascu…
10 Han Solo    180    80 brown      fair       brown           29   male  mascu…
# ℹ 25 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>
  • Handle missing values:
# Filter non-missing mass values
starwars %>%
  filter(!is.na(mass))
# A tibble: 59 × 14
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Luke Sk…    172    77 blond      fair       blue            19   male  mascu…
 2 C-3PO       167    75 <NA>       gold       yellow         112   none  mascu…
 3 R2-D2        96    32 <NA>       white, bl… red             33   none  mascu…
 4 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
 5 Leia Or…    150    49 brown      light      brown           19   fema… femin…
 6 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
 7 Beru Wh…    165    75 brown      light      blue            47   fema… femin…
 8 R5-D4        97    32 <NA>       white, red red             NA   none  mascu…
 9 Biggs D…    183    84 black      light      brown           24   male  mascu…
10 Obi-Wan…    182    77 auburn, w… fair       blue-gray       57   male  mascu…
# ℹ 49 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

Complex filters: - Combine multiple conditions:

# Filter Humans taller than 170
starwars %>%
  filter(species == "Human", height > 170)
# A tibble: 21 × 14
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Luke Sk…    172  77   blond      fair       blue            19   male  mascu…
 2 Darth V…    202 136   none       white      yellow          41.9 male  mascu…
 3 Owen La…    178 120   brown, gr… light      blue            52   male  mascu…
 4 Biggs D…    183  84   black      light      brown           24   male  mascu…
 5 Obi-Wan…    182  77   auburn, w… fair       blue-gray       57   male  mascu…
 6 Anakin …    188  84   blond      fair       blue            41.9 male  mascu…
 7 Wilhuff…    180  NA   auburn, g… fair       blue            64   male  mascu…
 8 Han Solo    180  80   brown      fair       brown           29   male  mascu…
 9 Boba Fe…    183  78.2 black      fair       brown           31.5 male  mascu…
10 Lando C…    177  79   black      dark       brown           31   male  mascu…
# ℹ 11 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

Creating New Columns: mutate()

  • Add or transform columns.
# Calculate BMI
starwars %>%
  mutate(BMI = mass / (height / 100)^2)
# A tibble: 87 × 15
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Luke Sk…    172    77 blond      fair       blue            19   male  mascu…
 2 C-3PO       167    75 <NA>       gold       yellow         112   none  mascu…
 3 R2-D2        96    32 <NA>       white, bl… red             33   none  mascu…
 4 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
 5 Leia Or…    150    49 brown      light      brown           19   fema… femin…
 6 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
 7 Beru Wh…    165    75 brown      light      blue            47   fema… femin…
 8 R5-D4        97    32 <NA>       white, red red             NA   none  mascu…
 9 Biggs D…    183    84 black      light      brown           24   male  mascu…
10 Obi-Wan…    182    77 auburn, w… fair       blue-gray       57   male  mascu…
# ℹ 77 more rows
# ℹ 6 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>, BMI <dbl>

Multiple transformations:

# Add multiple columns
starwars %>%
  mutate(BMI = mass / (height / 100)^2,
         height_m = height / 100)
# A tibble: 87 × 16
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Luke Sk…    172    77 blond      fair       blue            19   male  mascu…
 2 C-3PO       167    75 <NA>       gold       yellow         112   none  mascu…
 3 R2-D2        96    32 <NA>       white, bl… red             33   none  mascu…
 4 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
 5 Leia Or…    150    49 brown      light      brown           19   fema… femin…
 6 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
 7 Beru Wh…    165    75 brown      light      blue            47   fema… femin…
 8 R5-D4        97    32 <NA>       white, red red             NA   none  mascu…
 9 Biggs D…    183    84 black      light      brown           24   male  mascu…
10 Obi-Wan…    182    77 auburn, w… fair       blue-gray       57   male  mascu…
# ℹ 77 more rows
# ℹ 7 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>, BMI <dbl>, height_m <dbl>
  • Use window functions:
# Add rank by height
starwars %>%
  mutate(rank_height = min_rank(height))
# A tibble: 87 × 15
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Luke Sk…    172    77 blond      fair       blue            19   male  mascu…
 2 C-3PO       167    75 <NA>       gold       yellow         112   none  mascu…
 3 R2-D2        96    32 <NA>       white, bl… red             33   none  mascu…
 4 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
 5 Leia Or…    150    49 brown      light      brown           19   fema… femin…
 6 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
 7 Beru Wh…    165    75 brown      light      blue            47   fema… femin…
 8 R5-D4        97    32 <NA>       white, red red             NA   none  mascu…
 9 Biggs D…    183    84 black      light      brown           24   male  mascu…
10 Obi-Wan…    182    77 auburn, w… fair       blue-gray       57   male  mascu…
# ℹ 77 more rows
# ℹ 6 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>, rank_height <int>

Aggregating Data: summarise()

  • Compute summary statistics.
# Average height
starwars %>%
  summarise(avg_height = mean(height, na.rm = TRUE))
# A tibble: 1 × 1
  avg_height
       <dbl>
1       175.
  • Grouped summaries with group_by():
# Average height by species
starwars %>%
  group_by(species) %>%
  summarise(avg_height = mean(height, na.rm = TRUE))
# A tibble: 38 × 2
   species   avg_height
   <chr>          <dbl>
 1 Aleena           79 
 2 Besalisk        198 
 3 Cerean          198 
 4 Chagrian        196 
 5 Clawdite        168 
 6 Droid           131.
 7 Dug             112 
 8 Ewok             88 
 9 Geonosian       183 
10 Gungan          209.
# ℹ 28 more rows

Using multiple aggregations:

# Count and average height
starwars %>%
  group_by(species) %>%
  summarise(count = n(), avg_height = mean(height, na.rm = TRUE))
# A tibble: 38 × 3
   species   count avg_height
   <chr>     <int>      <dbl>
 1 Aleena        1        79 
 2 Besalisk      1       198 
 3 Cerean        1       198 
 4 Chagrian      1       196 
 5 Clawdite      1       168 
 6 Droid         6       131.
 7 Dug           1       112 
 8 Ewok          1        88 
 9 Geonosian     1       183 
10 Gungan        3       209.
# ℹ 28 more rows

Applying Functions Across Columns: across()

  • Apply a function to multiple columns.
# Summary stats for numeric columns
starwars %>%
  summarise(across(where(is.numeric), mean, na.rm = TRUE))
Warning: There was 1 warning in `summarise()`.
ℹ In argument: `across(where(is.numeric), mean, na.rm = TRUE)`.
Caused by warning:
! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
Supply arguments directly to `.fns` through an anonymous function instead.

  # Previously
  across(a:b, mean, na.rm = TRUE)

  # Now
  across(a:b, \(x) mean(x, na.rm = TRUE))
# A tibble: 1 × 3
  height  mass birth_year
   <dbl> <dbl>      <dbl>
1   175.  97.3       87.6

Combining mutate() and across():

# Standardize numeric columns
starwars %>%
  mutate(across(where(is.numeric), scale))
# A tibble: 87 × 14
   name       height[,1] mass[,1] hair_color skin_color eye_color birth_year[,1]
   <chr>           <dbl>    <dbl> <chr>      <chr>      <chr>              <dbl>
 1 Luke Skyw…    -0.0749  -0.120  blond      fair       blue              -0.443
 2 C-3PO         -0.219   -0.132  <NA>       gold       yellow             0.158
 3 R2-D2         -2.26    -0.385  <NA>       white, bl… red               -0.353
 4 Darth Vad…     0.788    0.228  none       white      yellow            -0.295
 5 Leia Orga…    -0.708   -0.285  brown      light      brown             -0.443
 6 Owen Lars      0.0976   0.134  brown, gr… light      blue              -0.230
 7 Beru Whit…    -0.276   -0.132  brown      light      blue              -0.262
 8 R5-D4         -2.23    -0.385  <NA>       white, red red               NA    
 9 Biggs Dar…     0.241   -0.0786 black      light      brown             -0.411
10 Obi-Wan K…     0.213   -0.120  auburn, w… fair       blue-gray         -0.198
# ℹ 77 more rows
# ℹ 7 more variables: sex <chr>, gender <chr>, homeworld <chr>, species <chr>,
#   films <list>, vehicles <list>, starships <list>

Case Study: Picking Basketball Teams

  • Select players based on height and mass.
# Filter tall and heavy players
starwars %>%
  filter(height > 180, mass > 80) %>%
  arrange(desc(height))
# A tibble: 17 × 14
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Tarfful     234   136 brown      brown      blue            NA   male  mascu…
 2 Lama Su     229    88 none       grey       black           NA   male  mascu…
 3 Chewbac…    228   112 brown      unknown    blue           200   male  mascu…
 4 Roos Ta…    224    82 none       grey       orange          NA   male  mascu…
 5 Grievous    216   159 none       brown, wh… green, y…       NA   male  mascu…
 6 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
 7 IG-88       200   140 none       metal      red             15   none  mascu…
 8 Ki-Adi-…    198    82 white      pale       yellow          92   male  mascu…
 9 Dexter …    198   102 none       brown      yellow          NA   male  mascu…
10 Kit Fis…    196    87 none       green      black           NA   male  mascu…
11 Qui-Gon…    193    89 brown      fair       blue            92   male  mascu…
12 Nute Gu…    191    90 none       mottled g… red             NA   male  mascu…
13 Bossk       190   113 none       green      red             53   male  mascu…
14 Anakin …    188    84 blond      fair       blue            41.9 male  mascu…
15 Mace Wi…    188    84 none       dark       brown           72   male  mascu…
16 Gregar …    185    85 black      dark       brown           NA   <NA>  <NA>  
17 Biggs D…    183    84 black      light      brown           24   male  mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

Additional criteria: - Filter specific species or hair color.

starwars %>%
  filter(height > 180, mass > 80, species == "Human")
# A tibble: 5 × 14
  name      height  mass hair_color skin_color eye_color birth_year sex   gender
  <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
1 Darth Va…    202   136 none       white      yellow          41.9 male  mascu…
2 Biggs Da…    183    84 black      light      brown           24   male  mascu…
3 Anakin S…    188    84 blond      fair       blue            41.9 male  mascu…
4 Qui-Gon …    193    89 brown      fair       blue            92   male  mascu…
5 Mace Win…    188    84 none       dark       brown           72   male  mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

Case Study: Finding Outliers

  • Detect unusually tall characters.
# Outliers in height
starwars %>%
  filter(height > mean(height, na.rm = TRUE) + 2 * sd(height, na.rm = TRUE))
# A tibble: 1 × 14
  name      height  mass hair_color skin_color eye_color birth_year sex   gender
  <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
1 Yarael P…    264    NA none       white      yellow            NA male  mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

Custom thresholds:

# Characters shorter than 120
starwars %>%
  filter(height < 120)
# A tibble: 8 × 14
  name      height  mass hair_color skin_color eye_color birth_year sex   gender
  <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
1 R2-D2         96    32 <NA>       white, bl… red               33 none  mascu…
2 R5-D4         97    32 <NA>       white, red red               NA none  mascu…
3 Yoda          66    17 white      green      brown            896 male  mascu…
4 Wicket S…     88    20 brown      brown      brown              8 male  mascu…
5 Sebulba      112    40 none       grey, red  orange            NA male  mascu…
6 Ratts Ty…     79    15 none       grey, blue unknown           NA male  mascu…
7 Dud Bolt      94    45 none       blue, grey yellow            NA male  mascu…
8 R4-P17        96    NA none       silver, r… red, blue         NA none  femin…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

Reshaping Data: tidyr Pivots

  • Transform data from wide to long or vice versa.
# Pivot example
library(tidyr)

# Pivot wider
starwars %>%
  pivot_wider(names_from = species, values_from = height)
# A tibble: 87 × 50
   name   mass hair_color skin_color eye_color birth_year sex   gender homeworld
   <chr> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr>  <chr>    
 1 Luke…    77 blond      fair       blue            19   male  mascu… Tatooine 
 2 C-3PO    75 <NA>       gold       yellow         112   none  mascu… Tatooine 
 3 R2-D2    32 <NA>       white, bl… red             33   none  mascu… Naboo    
 4 Dart…   136 none       white      yellow          41.9 male  mascu… Tatooine 
 5 Leia…    49 brown      light      brown           19   fema… femin… Alderaan 
 6 Owen…   120 brown, gr… light      blue            52   male  mascu… Tatooine 
 7 Beru…    75 brown      light      blue            47   fema… femin… Tatooine 
 8 R5-D4    32 <NA>       white, red red             NA   none  mascu… Tatooine 
 9 Bigg…    84 black      light      brown           24   male  mascu… Tatooine 
10 Obi-…    77 auburn, w… fair       blue-gray       57   male  mascu… Stewjon  
# ℹ 77 more rows
# ℹ 41 more variables: films <list>, vehicles <list>, starships <list>,
#   Human <int>, Droid <int>, Wookiee <int>, Rodian <int>, Hutt <int>,
#   `NA` <int>, `Yoda's species` <int>, Trandoshan <int>, `Mon Calamari` <int>,
#   Ewok <int>, Sullustan <int>, Neimodian <int>, Gungan <int>,
#   Toydarian <int>, Dug <int>, Zabrak <int>, `Twi'lek` <int>, Aleena <int>,
#   Vulptereen <int>, Xexto <int>, Toong <int>, Cerean <int>, Nautolan <int>, …

Combine dplyr and tidyr:

Calculate average height, pivot results

starwars %>% group_by(species) %>% summarise(avg_height = mean(height, na.rm = TRUE)) %>% pivot_wider(names_from = species, values_from = avg_height)