dplyr Exercises with yrbss Dataset

Exercise based on the yrbss dataset, focusing on dplyr functions

For each question, write the appropriate dplyr code and provide a brief explanation of your approach.

library(readr)
yrbss <- read_csv("data/yrbss.csv")

Rows: 20000 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): age, sex, grade, race4, race7
dbl (3): record, bmi, stweight

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Q1: Select only the age, sex, grade, and bmi columns from the yrbss dataset.

library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

yrbss_selected <- yrbss |> select(age, sex, grade, bmi)
head(yrbss_selected)

# A tibble: 6 × 4
  age                   sex    grade   bmi
  <chr>                 <chr>  <chr> <dbl>
1 15 years old          Female 10th   17.2
2 17 years old          Female 12th   20.2
3 18 years old or older Male   11th   NA  
4 15 years old          Male   10th   28.0
5 14 years old          Male   9th    24.5
6 17 years old          Male   9th    NA

Explanation: We use select() to keep only the specified columns.

Q2: Rename the column `stweight` to `self_reported_weight`.

yrbss_renamed <- yrbss |> rename(self_reported_weight = stweight)
head(yrbss_renamed)

# A tibble: 6 × 8
   record age                 sex   grade race4 race7   bmi self_reported_weight
    <dbl> <chr>               <chr> <chr> <chr> <chr> <dbl>                <dbl>
1  931897 15 years old        Fema… 10th  White White  17.2                 54.4
2  333862 17 years old        Fema… 12th  White White  20.2                 57.2
3   36253 18 years old or ol… Male  11th  Hisp… Hisp…  NA                   NA  
4 1095530 15 years old        Male  10th  Blac… Blac…  28.0                 85.7
5 1303997 14 years old        Male  9th   All … Mult…  24.5                 66.7
6  261619 17 years old        Male  9th   All … <NA>   NA                   NA

Explanation: rename() changes the column name while keeping the rest of the dataset unchanged.

Q3: Filter the dataset to include only students in the 12th grade.

yrbss_12th <- yrbss |> filter(grade == "12th")
head(yrbss_12th)

# A tibble: 6 × 8
   record age                   sex    grade race4          race7   bmi stweight
    <dbl> <chr>                 <chr>  <chr> <chr>          <chr> <dbl>    <dbl>
1  333862 17 years old          Female 12th  White          White  20.2     57.2
2 1309082 17 years old          Male   12th  White          White  19.3     59.0
3  506337 18 years old or older Male   12th  Hispanic/Lati… Hisp…  33.1    123. 
4  938291 18 years old or older Female 12th  White          White  21.7     64.9
5 1316277 18 years old or older Female 12th  White          White  21.6     49.9
6 1101972 18 years old or older Male   12th  White          White  21.8     69.0

Explanation: filter() selects rows where grade is “12th”.

Q4: Filter the dataset for male students who are 17 years old.

yrbss_male_17 <- yrbss |> filter(sex == "Male", age == "17 years old")
head(yrbss_male_17)

# A tibble: 6 × 8
   record age          sex   grade race4           race7            bmi stweight
    <dbl> <chr>        <chr> <chr> <chr>           <chr>          <dbl>    <dbl>
1  261619 17 years old Male  9th   All other races <NA>            NA       NA  
2 1309082 17 years old Male  12th  White           White           19.3     59.0
3  326942 17 years old Male  11th  All other races Am Indian / A…  21.7     61.2
4  183965 17 years old Male  12th  White           White           NA       NA  
5 1309051 17 years old Male  11th  White           White           21.0     70.3
6 1307670 17 years old Male  12th  Hispanic/Latino Hispanic/Lati…  19.6     55.3

Explanation: filter() keeps only male students who are exactly 17 years old.

Q5: Filter the dataset to keep only students with BMI greater than 25.

yrbss_high_bmi <- yrbss |> filter(bmi > 25)
head(yrbss_high_bmi)

# A tibble: 6 × 8
   record age                   sex    grade race4          race7   bmi stweight
    <dbl> <chr>                 <chr>  <chr> <chr>          <chr> <dbl>    <dbl>
1 1095530 15 years old          Male   10th  Black or Afri… Blac…  28.0     85.7
2  506337 18 years old or older Male   12th  Hispanic/Lati… Hisp…  33.1    123. 
3  770177 16 years old          Female 10th  White          White  32.4     86.2
4 1306691 16 years old          Male   11th  White          White  28.3    102. 
5  924270 15 years old          Male   9th   All other rac… Asian  30.7     81.6
6 1105438 16 years old          Female 11th  All other rac… Mult…  30.7     81.6

Explanation: filter(bmi > 25) removes students with BMI ≤ 25.

Q6: Arrange the dataset by BMI in descending order.

yrbss_sorted <- yrbss |> arrange(desc(bmi))
head(yrbss_sorted)

# A tibble: 6 × 8
   record age                   sex    grade race4          race7   bmi stweight
    <dbl> <chr>                 <chr>  <chr> <chr>          <chr> <dbl>    <dbl>
1  324452 16 years old          Male   11th  Black or Afri… Blac…  53.9     91.2
2 1310082 18 years old or older Male   11th  Black or Afri… Blac…  53.5    160. 
3  328160 18 years old or older Male   <NA>  Black or Afri… Blac…  53.4    128. 
4 1315913 17 years old          Female 12th  Black or Afri… Blac…  53.3    142. 
5 1094597 13 years old          Male   9th   All other rac… Asian  52.9    181. 
6 1305503 15 years old          Male   9th   All other rac… Am I…  52.4    134.

Explanation: arrange(desc(bmi)) sorts BMI from highest to lowest.

Q7: Find the average BMI for each grade.

yrbss_avg_bmi <- yrbss |> group_by(grade) |> summarize(avg_bmi = mean(bmi, na.rm = TRUE))
yrbss_avg_bmi

# A tibble: 5 × 2
  grade avg_bmi
  <chr>   <dbl>
1 10th     23.2
2 11th     23.8
3 12th     24.2
4 9th      22.8
5 <NA>     23.5

Explanation: group_by(grade) groups the data by grade, and summarize() calculates the mean BMI.

Q8: Count how many students belong to each `race4` category.

yrbss_race_counts <- yrbss |> count(race4)
yrbss_race_counts

# A tibble: 5 × 2
  race4                         n
  <chr>                     <int>
1 All other races            4713
2 Black or African American  4093
3 Hispanic/Latino            4670
4 White                      5814
5 <NA>                        710

Explanation: count(race4) returns the number of observations per race category.

Q9: Create a new column `bmi_category` using the following rules:

BMI < 18.5 → “Underweight”
BMI 18.5 - 24.9 → “Normal weight”
BMI 25 - 29.9 → “Overweight”
BMI ≥ 30 → “Obese”

yrbss_bmi_category <- yrbss |> 
  mutate(bmi_category = case_when(
    bmi < 18.5 ~ "Underweight",
    bmi >= 18.5 & bmi < 25 ~ "Normal weight",
    bmi >= 25 & bmi < 30 ~ "Overweight",
    bmi >= 30 ~ "Obese"
  ))
head(yrbss_bmi_category)

# A tibble: 6 × 9
   record age                sex   grade race4 race7   bmi stweight bmi_category
    <dbl> <chr>              <chr> <chr> <chr> <chr> <dbl>    <dbl> <chr>       
1  931897 15 years old       Fema… 10th  White White  17.2     54.4 Underweight 
2  333862 17 years old       Fema… 12th  White White  20.2     57.2 Normal weig…
3   36253 18 years old or o… Male  11th  Hisp… Hisp…  NA       NA   <NA>        
4 1095530 15 years old       Male  10th  Blac… Blac…  28.0     85.7 Overweight  
5 1303997 14 years old       Male  9th   All … Mult…  24.5     66.7 Normal weig…
6  261619 17 years old       Male  9th   All … <NA>   NA       NA   <NA>

Explanation: case_when() assigns BMI categories based on the given conditions.

Q10: Calculate the average BMI for each combination of `sex` and `race4`.

yrbss_avg_bmi_sex_race <- yrbss |> 
  group_by(sex, race4) |> 
  summarize(avg_bmi = mean(bmi, na.rm = TRUE))

`summarise()` has regrouped the output.
ℹ Summaries were computed grouped by sex and race4.
ℹ Output is grouped by sex.
ℹ Use `summarise(.groups = "drop_last")` to silence this message.
ℹ Use `summarise(.by = c(sex, race4))` for per-operation grouping
  (`?dplyr::dplyr_by`) instead.

yrbss_avg_bmi_sex_race

# A tibble: 15 × 3
# Groups:   sex [3]
   sex    race4                     avg_bmi
   <chr>  <chr>                       <dbl>
 1 Female All other races              22.8
 2 Female Black or African American    24.6
 3 Female Hispanic/Latino              23.6
 4 Female White                        22.7
 5 Female <NA>                         23.5
 6 Male   All other races              23.4
 7 Male   Black or African American    24.0
 8 Male   Hispanic/Latino              24.3
 9 Male   White                        23.3
10 Male   <NA>                         23.1
11 <NA>   All other races             NaN  
12 <NA>   Black or African American   NaN  
13 <NA>   Hispanic/Latino             NaN  
14 <NA>   White                       NaN  
15 <NA>   <NA>                        NaN

Explanation: group_by(sex, race4) groups data by both columns before calculating mean BMI.

Q11: Find the grade with the highest average BMI.

yrbss_highest_bmi_grade <- yrbss_avg_bmi |> arrange(desc(avg_bmi)) |> head(1)
yrbss_highest_bmi_grade

# A tibble: 1 × 2
  grade avg_bmi
  <chr>   <dbl>
1 12th     24.2

Explanation: We use arrange(desc(avg_bmi)) to sort by highest average BMI and take the first row.

Q12: Create a new dataset with only students in 11th and 12th grade with a BMI above the average BMI of the entire dataset.

avg_bmi_overall <- mean(yrbss$bmi, na.rm = TRUE)
yrbss_high_bmi_grades <- yrbss |> 
  filter(grade %in% c("11th", "12th"), bmi > avg_bmi_overall)
head(yrbss_high_bmi_grades)

# A tibble: 6 × 8
   record age                   sex    grade race4          race7   bmi stweight
    <dbl> <chr>                 <chr>  <chr> <chr>          <chr> <dbl>    <dbl>
1  506337 18 years old or older Male   12th  Hispanic/Lati… Hisp…  33.1    123. 
2 1306691 16 years old          Male   11th  White          White  28.3    102. 
3 1105438 16 years old          Female 11th  All other rac… Mult…  30.7     81.6
4 1107831 17 years old          Female 11th  All other rac… Asian  24.5     56.7
5 1313627 16 years old          Female 11th  All other rac… Asian  24.3     59.9
6  925175 16 years old          Male   11th  Black or Afri… Blac…  32.8     89.4

Explanation: We calculate the overall average BMI first, then filter for grades 11 & 12 with BMI above this value.

Q13: Create a Summary Table

Generate a summary table that shows: - The total number of students - The average BMI - The proportion of male and female students

yrbss_summary <- yrbss |> 
  summarize(
    total_students = n(),
    avg_bmi = mean(bmi, na.rm = TRUE),
    prop_male = mean(sex == "Male", na.rm = TRUE),
    prop_female = mean(sex == "Female", na.rm = TRUE)
  )
yrbss_summary

# A tibble: 1 × 4
  total_students avg_bmi prop_male prop_female
           <int>   <dbl>     <dbl>       <dbl>
1          20000    23.5     0.515       0.485

Explanation: summarize() calculates total count, mean BMI, and gender proportions.

Q14: Find the Most Common Grade by Race

For each race4 category, determine the most frequent grade among students.

yrbss_common_grade <- yrbss |> 
  group_by(race4, grade) |> 
  summarize(count = n()) |> 
  arrange(race4, desc(count)) |> 
  slice(1)

`summarise()` has regrouped the output.
ℹ Summaries were computed grouped by race4 and grade.
ℹ Output is grouped by race4.
ℹ Use `summarise(.groups = "drop_last")` to silence this message.
ℹ Use `summarise(.by = c(race4, grade))` for per-operation grouping
  (`?dplyr::dplyr_by`) instead.

yrbss_common_grade

# A tibble: 5 × 3
# Groups:   race4 [5]
  race4                     grade count
  <chr>                     <chr> <int>
1 All other races           9th    1293
2 Black or African American 9th    1096
3 Hispanic/Latino           9th    1256
4 White                     11th   1477
5 <NA>                      9th     191

Explanation: We count the number of students per race4 and grade, then find the most common one per race.