dplyr Practice Exercises
Ethiopian Malaria Indicator Survey
0.1 How to use this worksheet
Work through each exercise in your own R session. Try it before opening the solution.
Set-up (run once):
library(tidyverse)
malaria <- read_csv("data/malaria_survey_ethiopia.csv")
glimpse(malaria)Reminder of the variables you will use most:
| Variable | Meaning |
|---|---|
region, zone, woreda, kebele |
location |
household_size, wealth_quintile, distance_health_facility, altitude, rainfall_mm, irs_received |
household |
age, sex, pregnancy_status, education_level |
person |
fever_last_2weeks, malaria_tested, rdt_result, malaria_positive, treatment_received |
malaria |
bednet_owned, bednet_used, bednet_condition |
prevention |
hemoglobin, anemia_status, bmi, nutritional_status |
health |
The five core dplyr verbs: select() (columns), filter() (rows), arrange() (sort), mutate() (new columns), summarise() (collapse to one row) — usually with group_by().
1 select() — choosing columns
Exercise 1. Keep only individual_id, region, age, and rdt_result.
malaria |> select(individual_id, region, age, rdt_result)Exercise 2. Select every column from age through education_level (a range).
malaria |> select(age:education_level)Exercise 3. Select all the bednet variables using a helper (they all start with "bednet").
malaria |> select(starts_with("bednet"))Exercise 4. Keep every column except zone, woreda, and kebele.
malaria |> select(-c(zone, woreda, kebele))Exercise 5. Select only the numeric columns.
malaria |> select(where(is.numeric))2 filter() — choosing rows
Exercise 6. Keep only individuals who live in Gambella.
malaria |> filter(region == "Gambella")Exercise 7. Keep children under 5 years old.
malaria |> filter(age < 5)Exercise 8. Keep individuals who were tested and had a positive RDT.
malaria |> filter(malaria_tested == "Yes", malaria_positive == "Yes")Exercise 9. Keep people in Gambella or Benishangul Gumuz whose hemoglobin is below 11.
malaria |> filter(region %in% c("Gambella", "Benishangul Gumuz"),
hemoglobin < 11)Exercise 10. Keep rows where malaria_positive is missing (i.e. the person was not tested).
malaria |> filter(is.na(malaria_positive))Exercise 11. How many pregnant women did not sleep under a bednet? (combine filter() with nrow()).
malaria |>
filter(pregnancy_status == "Pregnant", bednet_used == "No") |>
nrow()3 arrange() — sorting rows
Exercise 12. Show the 10 individuals with the lowest hemoglobin.
malaria |>
select(individual_id, region, hemoglobin) |>
arrange(hemoglobin) |>
head(10)Exercise 13. Sort by region (A–Z), then by hemoglobin from highest to lowest within each region.
malaria |> arrange(region, desc(hemoglobin))4 mutate() — creating columns
Exercise 14. Create a logical column anemic that is TRUE when hemoglobin < 11.
malaria |> mutate(anemic = hemoglobin < 11)Exercise 15. Create far_from_clinic = "Yes" when distance_health_facility > 10, else "No" (use if_else()).
malaria |>
mutate(far_from_clinic = if_else(distance_health_facility > 10, "Yes", "No"))Exercise 16. Create an age_group column with case_when(): "Under 5", "5-14", "15-49", "50+".
malaria |>
mutate(age_group = case_when(
age < 5 ~ "Under 5",
age < 15 ~ "5-14",
age < 50 ~ "15-49",
TRUE ~ "50+"))Exercise 17. Round hemoglobin and bmi to whole numbers in one step using across().
malaria |> mutate(across(c(hemoglobin, bmi), round))5 summarise() and group_by()
Exercise 18. What is the mean hemoglobin across all individuals (ignore missing values)?
malaria |> summarise(mean_hb = mean(hemoglobin, na.rm = TRUE))Exercise 19. Mean hemoglobin by sex.
malaria |>
group_by(sex) |>
summarise(mean_hb = mean(hemoglobin, na.rm = TRUE))Exercise 20. For each region, compute the number tested and the test positivity rate (% positive among tested). Sort from highest positivity.
malaria |>
filter(malaria_tested == "Yes") |>
group_by(region) |>
summarise(n_tested = n(),
positivity = mean(malaria_positive == "Yes") * 100) |>
arrange(desc(positivity))Exercise 21. For each wealth_quintile, report mean hemoglobin and bednet ownership rate (% owning a net).
malaria |>
group_by(wealth_quintile) |>
summarise(mean_hb = mean(hemoglobin, na.rm = TRUE),
net_own = mean(bednet_owned == "Yes") * 100)Exercise 22. Compare positivity for net users vs non-users (group by bednet_used, among the tested).
malaria |>
filter(malaria_tested == "Yes") |>
group_by(bednet_used) |>
summarise(positivity = mean(malaria_positive == "Yes") * 100)6 count(), distinct(), and frequency tables
Exercise 23. Count how many individuals are in each region.
malaria |> count(region, sort = TRUE)Exercise 24. How many distinct households are in the dataset?
malaria |> distinct(household_id) |> nrow()
# or: n_distinct(malaria$household_id)Exercise 25. Cross-tabulate bednet_used against rdt_result with count().
malaria |> count(bednet_used, rdt_result)7 Joins — two linked tables
The survey is also stored as two separate tables that share household_id:
hh <- read_csv("data/malaria_households_ethiopia.csv") # one row per household
indiv<- read_csv("data/malaria_individuals_ethiopia.csv") # one row per personExercise 26. Use left_join() to attach each person’s household region and wealth_quintile to the individual table.
indiv |>
left_join(hh |> select(household_id, region, wealth_quintile),
by = "household_id")Exercise 27. After joining, compute mean hemoglobin by wealth_quintile.
indiv |>
left_join(hh, by = "household_id") |>
group_by(wealth_quintile) |>
summarise(mean_hb = mean(hemoglobin, na.rm = TRUE))8 Putting it together (challenge)
Exercise 28. In one pipeline: among tested individuals, build age_group, then for each region and age group report the number tested and positivity (%), keeping only groups with at least 20 tested people, sorted by positivity.
malaria |>
filter(malaria_tested == "Yes") |>
mutate(age_group = case_when(
age < 5 ~ "Under 5",
age < 15 ~ "5-14",
age < 50 ~ "15-49",
TRUE ~ "50+")) |>
group_by(region, age_group) |>
summarise(n_tested = n(),
positivity = round(mean(malaria_positive == "Yes") * 100, 1),
.groups = "drop") |>
filter(n_tested >= 20) |>
arrange(desc(positivity))Exercise 29. Build a household-level summary from the hh table: for each region, the % of households that received IRS, the mean distance to a health facility, and the mean household size. (Note: bednet variables are recorded per person, so they live in the individual table, not hh.)
hh |>
group_by(region) |>
summarise(irs = mean(irs_received == "Yes") * 100,
mean_dist = mean(distance_health_facility),
mean_size = mean(household_size),
.groups = "drop") |>
arrange(desc(irs))These verbs chain with the %>% or |> pipe to form any analysis. Next: visualize a result with ggplot2, or turn one into a gtsummary table.