Factor Variables Management with forcats

National Data Management Center for Health (NDMC) at EPHI



What are Factor Variables?

  • Factors are used to represent categorical data in R.

  • Levels: Unique categories in the factor (e.g., “Male”, “Female”).

DataTypes

-  Nominal: Categories without a specific order (e.g., colors, gender).
-   Ordinal: Categories with a meaningful order or ranking (e.g., education levels, satisfaction ratings).

Example:

Code
gender <- factor(c("Male", "Female", "Female", "Male"))
levels(gender) 
[1] "Female" "Male"  

Why Use Factors?

Advantages:

  • Efficient storage of categorical data.
  • Essential for statistical modeling and visualization.
  • Ensures proper handling of categorical variables in analyses.

Challenges with Factors

  • Levels may not match the data (e.g., missing levels).

  • Reordering levels can be cumbersome.

  • Combining or recoding levels requires manual effort.

  • Therefore we use the forcats package, simplifies these tasks.

What is forcats package?:

  • A tidyverse package for working with factors.
  • Provides functions to manipulate factor levels easily.

Key Functions in forcats

  • Overview:

    • fct_relevel(): Reorder levels manually.
    • fct_infreq(): Reorder levels by frequency.
    • fct_reorder(): Reorder levels by another variable.
    • fct_lump(): Collapse least/most frequent levels.
    • fct_recode(): Recode levels.
    • fct_collapse(): Combine levels.
    • fct_drop(): Drop unused levels.
    • fct_other(): Replace levels with “Other”.
    • fct_anon(): Anonymize levels (e.g., for privacy).

Let use use gss_cat dataset

  • A sample dataset from the General Social Survey, included in the forcats package.

  • Variables:

    • year: Year of the survey.
    • age: Age of respondents.
    • marital: Marital status (factor).
    • race: Race (factor).
    • rincome: Reported income (factor).
  • Load the Dataset:

Code
library(forcats) 
data("gss_cat") 
head(gss_cat)
# A tibble: 6 × 9
   year marital         age race  rincome        partyid     relig denom tvhours
  <int> <fct>         <int> <fct> <fct>          <fct>       <fct> <fct>   <int>
1  2000 Never married    26 White $8000 to 9999  Ind,near r… Prot… Sout…      12
2  2000 Divorced         48 White $8000 to 9999  Not str re… Prot… Bapt…      NA
3  2000 Widowed          67 White Not applicable Independent Prot… No d…       2
4  2000 Never married    39 White Not applicable Ind,near r… Orth… Not …       4
5  2000 Divorced         25 White Not applicable Not str de… None  Not …       1
6  2000 Married          25 White $20000 - 24999 Strong dem… Prot… Sout…      NA

Exploring the Dataset

Code
str(gss_cat) 
tibble [21,483 × 9] (S3: tbl_df/tbl/data.frame)
 $ year   : int [1:21483] 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ...
 $ marital: Factor w/ 6 levels "No answer","Never married",..: 2 4 5 2 4 6 2 4 6 6 ...
 $ age    : int [1:21483] 26 48 67 39 25 25 36 44 44 47 ...
 $ race   : Factor w/ 4 levels "Other","Black",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ rincome: Factor w/ 16 levels "No answer","Don't know",..: 8 8 16 16 16 5 4 9 4 4 ...
 $ partyid: Factor w/ 10 levels "No answer","Don't know",..: 6 5 7 6 9 10 5 8 9 4 ...
 $ relig  : Factor w/ 16 levels "No answer","Don't know",..: 15 15 15 6 12 15 5 15 15 15 ...
 $ denom  : Factor w/ 30 levels "No answer","Don't know",..: 25 23 3 30 30 25 30 15 4 25 ...
 $ tvhours: int [1:21483] 12 NA 2 4 1 NA 3 NA 0 3 ...
Code
summary(gss_cat) 
      year               marital           age                    race      
 Min.   :2000   No answer    :   17   Min.   :18.00   Other         : 1959  
 1st Qu.:2002   Never married: 5416   1st Qu.:33.00   Black         : 3129  
 Median :2006   Separated    :  743   Median :46.00   White         :16395  
 Mean   :2007   Divorced     : 3383   Mean   :47.18   Not applicable:    0  
 3rd Qu.:2010   Widowed      : 1807   3rd Qu.:59.00                         
 Max.   :2014   Married      :10117   Max.   :89.00                         
                                      NA's   :76                            
           rincome                   partyid            relig      
 $25000 or more:7363   Independent       :4119   Protestant:10846  
 Not applicable:7043   Not str democrat  :3690   Catholic  : 5124  
 $20000 - 24999:1283   Strong democrat   :3490   None      : 3523  
 $10000 - 14999:1168   Not str republican:3032   Christian :  689  
 $15000 - 19999:1048   Ind,near dem      :2499   Jewish    :  388  
 Refused       : 975   Strong republican :2314   Other     :  224  
 (Other)       :2603   (Other)           :2339   (Other)   :  689  
              denom          tvhours      
 Not applicable  :10072   Min.   : 0.000  
 Other           : 2534   1st Qu.: 1.000  
 No denomination : 1683   Median : 2.000  
 Southern baptist: 1536   Mean   : 2.981  
 Baptist-dk which: 1457   3rd Qu.: 4.000  
 United methodist: 1067   Max.   :24.000  
 (Other)         : 3134   NA's   :10146   

fct_relevel(): Reordering Factor Levels

Code
  ggplot(gss_cat , aes(y = fct_relevel(race, "Other","White", "Black"))) + 
  geom_bar()

Code
gss_cat %>%
  mutate(race = fct_relevel(race, "Other", "White",  "Black")) %>%
  ggplot(aes(y = race)) +
  geom_bar()

  • The order of the y axis is from bottom to top
Code
gss_cat %>%
  mutate(race = fct_relevel(race, "White", "Black", "Other")) %>%
  ggplot(aes(y = race)) +
  geom_bar()

  • Reorder levels by frequency.
  • Use the order defined by the number of respondents of different party
  • The order is descending, from most frequent to least frequent
Code
gss_cat %>%
  mutate(race = fct_infreq(race)) %>%
  ggplot(aes(y = race)) + geom_bar()

Code
gss_cat %>%
  mutate(race = fct_rev(fct_infreq(race))) %>%
  ggplot(aes(y = race)) + geom_bar()

fct_reorder()

  • Reorder based on numeric values: fct_reorder()
Code
gss_cat %>% count(race)
# A tibble: 3 × 2
  race      n
  <fct> <int>
1 Other  1959
2 Black  3129
3 White 16395
Code
gss_cat %>%
  count(race) %>%
  mutate(race = fct_reorder(race, n))%>%
  ggplot(aes(n, race)) + geom_col()

Compare to see the difference

Code
gss_cat %>%
  count(race) %>%
  mutate(race = fct_reorder(race, n))%>%
  ggplot(aes(n, race)) + geom_col()

Code
gss_cat %>% mutate(race = fct_rev(fct_infreq(race))) %>% 
  ggplot(aes(y = race)) + geom_bar()

fct_recode()

  • Recoding Factor Levels or simply rename levels.
Code
gss_cat %>% 
  mutate(marital1 = fct_recode(marital, 
                              "Single" = "Single", 
                              "Married" = "Married", 
                              "Widowed" = "Widowed",
                              "Divorced" = "Divorced/Separated",
                              "Divorced" = "Separated")) %>%  
  ggplot(aes(x = marital1)) +  
  geom_bar() 

fct_collapse()

  • Combine multiple levels into one.
Code
gss_cat %>% 
  mutate(marital2= fct_collapse(marital, 
                               "Single" = c("Single", "No answer"),
                                "Married" = c("Married"), 
                                "Separated" = 
                                  c("Widowed", "Divorced", "Divorced/Separated", "Separated")))%>%  
  ggplot(aes(x = marital2)) +  
  geom_bar() 

fct_drop()

  • Dropping or Remove unused levels.
Code
gss_cat$marital <- fct_drop(gss_cat$marital) 
levels(gss_cat$marital)
[1] "No answer"     "Never married" "Separated"     "Divorced"     
[5] "Widowed"       "Married"      

fct_explicit_na()

  • Handling Missing Levels
  • Make missing values explicit.
Code
gss_cat$rincome <- fct_explicit_na(gss_cat$rincome, 
                                   na_level = "Not specified") 
levels(gss_cat$rincome)
 [1] "No answer"      "Don't know"     "Refused"        "$25000 or more"
 [5] "$20000 - 24999" "$15000 - 19999" "$10000 - 14999" "$8000 to 9999" 
 [9] "$7000 to 7999"  "$6000 to 6999"  "$5000 to 5999"  "$4000 to 4999" 
[13] "$3000 to 3999"  "$1000 to 2999"  "Lt $1000"       "Not applicable"

fct_lump()

  • Collapse rare levels into “Other”.
Code
gss_cat$partyid <- fct_lump(gss_cat$partyid, n = 3) # Keep top 3 levels 
levels(gss_cat$partyid)
[1] "Independent"      "Not str democrat" "Strong democrat"  "Other"           

fct_anon()

  • Replace levels with random codes.
Code
gss_cat$race <- fct_anon(gss_cat$race) 
levels(gss_cat$race)
[1] "1" "2" "3" "4"

fct_other()

  • Replace specific levels with “Other”.
Code
gss_cat$relig <- fct_other(gss_cat$relig, keep = c("Protestant", "Catholic"))
levels(gss_cat$relig)
[1] "Catholic"   "Protestant" "Other"     

Practical Example

  • Analyze the rincome variable (reported income).
  • Steps:
    1. Recode income levels for clarity.
    2. Reorder levels by frequency.
    3. Visualize using ggplot2.
Code
library(ggplot2) 
gss_cat$rincome <- fct_recode(gss_cat$rincome,"Low" = "Lt $1000", 
                              "Medium" = "$1000 to 2999", 
                              "High" = "$3000 to 3999") 
gss_cat$rincome <- fct_infreq(gss_cat$rincome) 
ggplot(gss_cat, aes(rincome)) + geom_bar() + 
    theme(axis.text.x = element_text(angle = 45, hjust = 1))

Ordering other plot elements

The gapminder dataset: Life expectancy data

Code
library(gapminder)
gapminder
# A tibble: 1,704 × 6
   country     continent  year lifeExp      pop gdpPercap
   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
 1 Afghanistan Asia       1952    28.8  8425333      779.
 2 Afghanistan Asia       1957    30.3  9240934      821.
 3 Afghanistan Asia       1962    32.0 10267083      853.
 4 Afghanistan Asia       1967    34.0 11537966      836.
 5 Afghanistan Asia       1972    36.1 13079460      740.
 6 Afghanistan Asia       1977    38.4 14880372      786.
 7 Afghanistan Asia       1982    39.9 12881816      978.
 8 Afghanistan Asia       1987    40.8 13867957      852.
 9 Afghanistan Asia       1992    41.7 16317921      649.
10 Afghanistan Asia       1997    41.8 22227415      635.
# ℹ 1,694 more rows

Life expectancy in the Africans in 2007

Code
gapminder %>%
  filter(
    year == 2007,
    continent == "Africa"
  ) %>%
  ggplot(aes(lifeExp, country)) + 
  geom_point()

Life expectancy in the Africans in 2007

Code
gapminder %>%
  filter(
    year == 2007,
    continent == "Africa"
  ) %>%
  ggplot(aes(lifeExp, country)) + 
  geom_point()
  • Default order is alphabetic, from bottom to top

Life expectancy, ordered from highest to lowest

Code
gapminder %>%
  filter(
    year == 2007,
    continent == "Africa"
  ) %>%
  mutate(
    country = fct_reorder(country, lifeExp)
  ) %>%
  ggplot(aes(lifeExp, country)) + 
  geom_point()

Order is ascending from bottom to top