dplyr Practice Exercises

Ethiopian Malaria Indicator Survey

Author

C4ED · R training

Published

June 18, 2026

0.1 How to use this worksheet

Work through each exercise in your own R session. Try it before opening the solution.

Set-up (run once):

library(tidyverse)
malaria <- read_csv("data/malaria_survey_ethiopia.csv")
glimpse(malaria)

Reminder of the variables you will use most:

Variable	Meaning
`region`, `zone`, `woreda`, `kebele`	location
`household_size`, `wealth_quintile`, `distance_health_facility`, `altitude`, `rainfall_mm`, `irs_received`	household
`age`, `sex`, `pregnancy_status`, `education_level`	person
`fever_last_2weeks`, `malaria_tested`, `rdt_result`, `malaria_positive`, `treatment_received`	malaria
`bednet_owned`, `bednet_used`, `bednet_condition`	prevention
`hemoglobin`, `anemia_status`, `bmi`, `nutritional_status`	health

Tip

The five core dplyr verbs: select() (columns), filter() (rows), arrange() (sort), mutate() (new columns), summarise() (collapse to one row) — usually with group_by().

1 `select()` — choosing columns

Exercise 1. Keep only individual_id, region, age, and rdt_result.

Solution

malaria |> select(individual_id, region, age, rdt_result)

Exercise 2. Select every column from age through education_level (a range).

Solution

malaria |> select(age:education_level)

Exercise 3. Select all the bednet variables using a helper (they all start with "bednet").

Solution

malaria |> select(starts_with("bednet"))

Exercise 4. Keep every column except zone, woreda, and kebele.

Solution

malaria |> select(-c(zone, woreda, kebele))

Exercise 5. Select only the numeric columns.

Solution

malaria |> select(where(is.numeric))

2 `filter()` — choosing rows

Exercise 6. Keep only individuals who live in Gambella.

Solution

malaria |> filter(region == "Gambella")

Exercise 7. Keep children under 5 years old.

Solution

malaria |> filter(age < 5)

Exercise 8. Keep individuals who were tested and had a positive RDT.

Solution

malaria |> filter(malaria_tested == "Yes", malaria_positive == "Yes")

Exercise 9. Keep people in Gambella or Benishangul Gumuz whose hemoglobin is below 11.

Solution

malaria |> filter(region %in% c("Gambella", "Benishangul Gumuz"),
                  hemoglobin < 11)

Exercise 10. Keep rows where malaria_positive is missing (i.e. the person was not tested).

Solution

malaria |> filter(is.na(malaria_positive))

Exercise 11. How many pregnant women did not sleep under a bednet? (combine filter() with nrow()).

Solution

malaria |>
  filter(pregnancy_status == "Pregnant", bednet_used == "No") |>
  nrow()

3 `arrange()` — sorting rows

Exercise 12. Show the 10 individuals with the lowest hemoglobin.

Solution

malaria |>
  select(individual_id, region, hemoglobin) |>
  arrange(hemoglobin) |>
  head(10)

Exercise 13. Sort by region (A–Z), then by hemoglobin from highest to lowest within each region.

Solution

malaria |> arrange(region, desc(hemoglobin))

4 `mutate()` — creating columns

Exercise 14. Create a logical column anemic that is TRUE when hemoglobin < 11.

Solution

malaria |> mutate(anemic = hemoglobin < 11)

Exercise 15. Create far_from_clinic = "Yes" when distance_health_facility > 10, else "No" (use if_else()).

Solution

malaria |>
  mutate(far_from_clinic = if_else(distance_health_facility > 10, "Yes", "No"))

Exercise 16. Create an age_group column with case_when(): "Under 5", "5-14", "15-49", "50+".

Solution

malaria |>
  mutate(age_group = case_when(
    age < 5            ~ "Under 5",
    age < 15           ~ "5-14",
    age < 50           ~ "15-49",
    TRUE               ~ "50+"))

Exercise 17. Round hemoglobin and bmi to whole numbers in one step using across().

Solution

malaria |> mutate(across(c(hemoglobin, bmi), round))

5 `summarise()` and `group_by()`

Exercise 18. What is the mean hemoglobin across all individuals (ignore missing values)?

Solution

malaria |> summarise(mean_hb = mean(hemoglobin, na.rm = TRUE))

Exercise 19. Mean hemoglobin by sex.

Solution

malaria |>
  group_by(sex) |>
  summarise(mean_hb = mean(hemoglobin, na.rm = TRUE))

Exercise 20. For each region, compute the number tested and the test positivity rate (% positive among tested). Sort from highest positivity.

Solution

malaria |>
  filter(malaria_tested == "Yes") |>
  group_by(region) |>
  summarise(n_tested   = n(),
            positivity = mean(malaria_positive == "Yes") * 100) |>
  arrange(desc(positivity))

Exercise 21. For each wealth_quintile, report mean hemoglobin and bednet ownership rate (% owning a net).

Solution

malaria |>
  group_by(wealth_quintile) |>
  summarise(mean_hb  = mean(hemoglobin, na.rm = TRUE),
            net_own  = mean(bednet_owned == "Yes") * 100)

Exercise 22. Compare positivity for net users vs non-users (group by bednet_used, among the tested).

Solution

malaria |>
  filter(malaria_tested == "Yes") |>
  group_by(bednet_used) |>
  summarise(positivity = mean(malaria_positive == "Yes") * 100)

6 `count()`, `distinct()`, and frequency tables

Exercise 23. Count how many individuals are in each region.

Solution

malaria |> count(region, sort = TRUE)

Exercise 24. How many distinct households are in the dataset?

Solution

malaria |> distinct(household_id) |> nrow()
# or: n_distinct(malaria$household_id)

Exercise 25. Cross-tabulate bednet_used against rdt_result with count().

Solution

malaria |> count(bednet_used, rdt_result)

7 Joins — two linked tables

The survey is also stored as two separate tables that share household_id:

hh   <- read_csv("data/malaria_households_ethiopia.csv")    # one row per household
indiv<- read_csv("data/malaria_individuals_ethiopia.csv")   # one row per person

Exercise 26. Use left_join() to attach each person’s household region and wealth_quintile to the individual table.

Solution

indiv |>
  left_join(hh |> select(household_id, region, wealth_quintile),
            by = "household_id")

Exercise 27. After joining, compute mean hemoglobin by wealth_quintile.

Solution

indiv |>
  left_join(hh, by = "household_id") |>
  group_by(wealth_quintile) |>
  summarise(mean_hb = mean(hemoglobin, na.rm = TRUE))

8 Putting it together (challenge)

Exercise 28. In one pipeline: among tested individuals, build age_group, then for each region and age group report the number tested and positivity (%), keeping only groups with at least 20 tested people, sorted by positivity.

Solution

malaria |>
  filter(malaria_tested == "Yes") |>
  mutate(age_group = case_when(
    age < 5  ~ "Under 5",
    age < 15 ~ "5-14",
    age < 50 ~ "15-49",
    TRUE     ~ "50+")) |>
  group_by(region, age_group) |>
  summarise(n_tested   = n(),
            positivity = round(mean(malaria_positive == "Yes") * 100, 1),
            .groups = "drop") |>
  filter(n_tested >= 20) |>
  arrange(desc(positivity))

Exercise 29. Build a household-level summary from the hh table: for each region, the % of households that received IRS, the mean distance to a health facility, and the mean household size. (Note: bednet variables are recorded per person, so they live in the individual table, not hh.)

Solution

hh |>
  group_by(region) |>
  summarise(irs       = mean(irs_received == "Yes") * 100,
            mean_dist  = mean(distance_health_facility),
            mean_size  = mean(household_size),
            .groups = "drop") |>
  arrange(desc(irs))

Where to go next

These verbs chain with the %>% or |> pipe to form any analysis. Next: visualize a result with ggplot2, or turn one into a gtsummary table.

0.1 How to use this worksheet

1 select() — choosing columns

2 filter() — choosing rows

3 arrange() — sorting rows

4 mutate() — creating columns

5 summarise() and group_by()

6 count(), distinct(), and frequency tables

7 Joins — two linked tables

8 Putting it together (challenge)

1 `select()` — choosing columns

2 `filter()` — choosing rows

3 `arrange()` — sorting rows

4 `mutate()` — creating columns

5 `summarise()` and `group_by()`

6 `count()`, `distinct()`, and frequency tables