Exercise based on the yrbss dataset, focusing on dplyr functions
For each question, write the appropriate dplyr code and provide a brief explanation of your approach.
library(readr)yrbss <-read_csv("data/yrbss.csv")
Rows: 20000 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): age, sex, grade, race4, race7
dbl (3): record, bmi, stweight
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Q1: Select only the age, sex, grade, and bmi columns from the yrbss dataset.
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
# A tibble: 6 × 4
age sex grade bmi
<chr> <chr> <chr> <dbl>
1 15 years old Female 10th 17.2
2 17 years old Female 12th 20.2
3 18 years old or older Male 11th NA
4 15 years old Male 10th 28.0
5 14 years old Male 9th 24.5
6 17 years old Male 9th NA
Explanation: We use select() to keep only the specified columns.
Q2: Rename the column stweight to self_reported_weight.
# A tibble: 6 × 8
record age sex grade race4 race7 bmi self_reported_weight
<dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 931897 15 years old Fema… 10th White White 17.2 54.4
2 333862 17 years old Fema… 12th White White 20.2 57.2
3 36253 18 years old or ol… Male 11th Hisp… Hisp… NA NA
4 1095530 15 years old Male 10th Blac… Blac… 28.0 85.7
5 1303997 14 years old Male 9th All … Mult… 24.5 66.7
6 261619 17 years old Male 9th All … <NA> NA NA
Explanation: rename() changes the column name while keeping the rest of the dataset unchanged.
Q3: Filter the dataset to include only students in the 12th grade.
# A tibble: 6 × 8
record age sex grade race4 race7 bmi stweight
<dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 333862 17 years old Female 12th White White 20.2 57.2
2 1309082 17 years old Male 12th White White 19.3 59.0
3 506337 18 years old or older Male 12th Hispanic/Lati… Hisp… 33.1 123.
4 938291 18 years old or older Female 12th White White 21.7 64.9
5 1316277 18 years old or older Female 12th White White 21.6 49.9
6 1101972 18 years old or older Male 12th White White 21.8 69.0
Explanation: filter() selects rows where grade is “12th”.
Q4: Filter the dataset for male students who are 17 years old.
yrbss_male_17 <- yrbss |>filter(sex =="Male", age =="17 years old")head(yrbss_male_17)
# A tibble: 6 × 8
record age sex grade race4 race7 bmi stweight
<dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 261619 17 years old Male 9th All other races <NA> NA NA
2 1309082 17 years old Male 12th White White 19.3 59.0
3 326942 17 years old Male 11th All other races Am Indian / A… 21.7 61.2
4 183965 17 years old Male 12th White White NA NA
5 1309051 17 years old Male 11th White White 21.0 70.3
6 1307670 17 years old Male 12th Hispanic/Latino Hispanic/Lati… 19.6 55.3
Explanation: filter() keeps only male students who are exactly 17 years old.
Q5: Filter the dataset to keep only students with BMI greater than 25.
# A tibble: 6 × 8
record age sex grade race4 race7 bmi stweight
<dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 1095530 15 years old Male 10th Black or Afri… Blac… 28.0 85.7
2 506337 18 years old or older Male 12th Hispanic/Lati… Hisp… 33.1 123.
3 770177 16 years old Female 10th White White 32.4 86.2
4 1306691 16 years old Male 11th White White 28.3 102.
5 924270 15 years old Male 9th All other rac… Asian 30.7 81.6
6 1105438 16 years old Female 11th All other rac… Mult… 30.7 81.6
Explanation: filter(bmi > 25) removes students with BMI ≤ 25.
Q6: Arrange the dataset by BMI in descending order.
# A tibble: 6 × 8
record age sex grade race4 race7 bmi stweight
<dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 324452 16 years old Male 11th Black or Afri… Blac… 53.9 91.2
2 1310082 18 years old or older Male 11th Black or Afri… Blac… 53.5 160.
3 328160 18 years old or older Male <NA> Black or Afri… Blac… 53.4 128.
4 1315913 17 years old Female 12th Black or Afri… Blac… 53.3 142.
5 1094597 13 years old Male 9th All other rac… Asian 52.9 181.
6 1305503 15 years old Male 9th All other rac… Am I… 52.4 134.
Explanation: arrange(desc(bmi)) sorts BMI from highest to lowest.
# A tibble: 6 × 9
record age sex grade race4 race7 bmi stweight bmi_category
<dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <chr>
1 931897 15 years old Fema… 10th White White 17.2 54.4 Underweight
2 333862 17 years old Fema… 12th White White 20.2 57.2 Normal weig…
3 36253 18 years old or o… Male 11th Hisp… Hisp… NA NA <NA>
4 1095530 15 years old Male 10th Blac… Blac… 28.0 85.7 Overweight
5 1303997 14 years old Male 9th All … Mult… 24.5 66.7 Normal weig…
6 261619 17 years old Male 9th All … <NA> NA NA <NA>
Explanation: case_when() assigns BMI categories based on the given conditions.
Q10: Calculate the average BMI for each combination of sex and race4.
`summarise()` has regrouped the output.
ℹ Summaries were computed grouped by sex and race4.
ℹ Output is grouped by sex.
ℹ Use `summarise(.groups = "drop_last")` to silence this message.
ℹ Use `summarise(.by = c(sex, race4))` for per-operation grouping
(`?dplyr::dplyr_by`) instead.
yrbss_avg_bmi_sex_race
# A tibble: 15 × 3
# Groups: sex [3]
sex race4 avg_bmi
<chr> <chr> <dbl>
1 Female All other races 22.8
2 Female Black or African American 24.6
3 Female Hispanic/Latino 23.6
4 Female White 22.7
5 Female <NA> 23.5
6 Male All other races 23.4
7 Male Black or African American 24.0
8 Male Hispanic/Latino 24.3
9 Male White 23.3
10 Male <NA> 23.1
11 <NA> All other races NaN
12 <NA> Black or African American NaN
13 <NA> Hispanic/Latino NaN
14 <NA> White NaN
15 <NA> <NA> NaN
Explanation: group_by(sex, race4) groups data by both columns before calculating mean BMI.
# A tibble: 6 × 8
record age sex grade race4 race7 bmi stweight
<dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 506337 18 years old or older Male 12th Hispanic/Lati… Hisp… 33.1 123.
2 1306691 16 years old Male 11th White White 28.3 102.
3 1105438 16 years old Female 11th All other rac… Mult… 30.7 81.6
4 1107831 17 years old Female 11th All other rac… Asian 24.5 56.7
5 1313627 16 years old Female 11th All other rac… Asian 24.3 59.9
6 925175 16 years old Male 11th Black or Afri… Blac… 32.8 89.4
Explanation: We calculate the overall average BMI first, then filter for grades 11 & 12 with BMI above this value.
Q13: Create a Summary Table
Generate a summary table that shows: - The total number of students - The average BMI - The proportion of male and female students
`summarise()` has regrouped the output.
ℹ Summaries were computed grouped by race4 and grade.
ℹ Output is grouped by race4.
ℹ Use `summarise(.groups = "drop_last")` to silence this message.
ℹ Use `summarise(.by = c(race4, grade))` for per-operation grouping
(`?dplyr::dplyr_by`) instead.
yrbss_common_grade
# A tibble: 5 × 3
# Groups: race4 [5]
race4 grade count
<chr> <chr> <int>
1 All other races 9th 1293
2 Black or African American 9th 1096
3 Hispanic/Latino 9th 1256
4 White 11th 1477
5 <NA> 9th 191
Explanation: We count the number of students per race4 and grade, then find the most common one per race.