Yebelay Berehan – Factor Variables Management with forcats

What are Factor Variables?

Factors are used to represent categorical data in R.
Levels: Unique categories in the factor (e.g., “Male”, “Female”).

DataTypes

-  Nominal: Categories without a specific order (e.g., colors, gender).
-   Ordinal: Categories with a meaningful order or ranking (e.g., education levels, satisfaction ratings).

Example:

Code

gender <- factor(c("Male", "Female", "Female", "Male"))
levels(gender)

[1] "Female" "Male"

Why Use Factors?

Advantages:

Efficient storage of categorical data.
Essential for statistical modeling and visualization.
Ensures proper handling of categorical variables in analyses.

Challenges with Factors

Levels may not match the data (e.g., missing levels).
Reordering levels can be cumbersome.
Combining or recoding levels requires manual effort.
Therefore we use the forcats package, simplifies these tasks.

What is forcats package?:

A tidyverse package for working with factors.
Provides functions to manipulate factor levels easily.

Key Functions in `forcats`

Overview:
- fct_relevel(): Reorder levels manually.
- fct_infreq(): Reorder levels by frequency.
- fct_reorder(): Reorder levels by another variable.
- fct_lump(): Collapse least/most frequent levels.
- fct_recode(): Recode levels.
- fct_collapse(): Combine levels.
- fct_drop(): Drop unused levels.
- fct_other(): Replace levels with “Other”.
- fct_anon(): Anonymize levels (e.g., for privacy).

Let use use `gss_cat` dataset

A sample dataset from the General Social Survey, included in the forcats package.
Variables:
- year: Year of the survey.
- age: Age of respondents.
- marital: Marital status (factor).
- race: Race (factor).
- rincome: Reported income (factor).
Load the Dataset:

Code

library(forcats) 
data("gss_cat") 
head(gss_cat)

# A tibble: 6 × 9
   year marital         age race  rincome        partyid     relig denom tvhours
  <int> <fct>         <int> <fct> <fct>          <fct>       <fct> <fct>   <int>
1  2000 Never married    26 White $8000 to 9999  Ind,near r… Prot… Sout…      12
2  2000 Divorced         48 White $8000 to 9999  Not str re… Prot… Bapt…      NA
3  2000 Widowed          67 White Not applicable Independent Prot… No d…       2
4  2000 Never married    39 White Not applicable Ind,near r… Orth… Not …       4
5  2000 Divorced         25 White Not applicable Not str de… None  Not …       1
6  2000 Married          25 White $20000 - 24999 Strong dem… Prot… Sout…      NA

Code

str(gss_cat)

tibble [21,483 × 9] (S3: tbl_df/tbl/data.frame)
 $ year   : int [1:21483] 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ...
 $ marital: Factor w/ 6 levels "No answer","Never married",..: 2 4 5 2 4 6 2 4 6 6 ...
 $ age    : int [1:21483] 26 48 67 39 25 25 36 44 44 47 ...
 $ race   : Factor w/ 4 levels "Other","Black",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ rincome: Factor w/ 16 levels "No answer","Don't know",..: 8 8 16 16 16 5 4 9 4 4 ...
 $ partyid: Factor w/ 10 levels "No answer","Don't know",..: 6 5 7 6 9 10 5 8 9 4 ...
 $ relig  : Factor w/ 16 levels "No answer","Don't know",..: 15 15 15 6 12 15 5 15 15 15 ...
 $ denom  : Factor w/ 30 levels "No answer","Don't know",..: 25 23 3 30 30 25 30 15 4 25 ...
 $ tvhours: int [1:21483] 12 NA 2 4 1 NA 3 NA 0 3 ...

Code

summary(gss_cat)

      year               marital           age                    race      
 Min.   :2000   No answer    :   17   Min.   :18.00   Other         : 1959  
 1st Qu.:2002   Never married: 5416   1st Qu.:33.00   Black         : 3129  
 Median :2006   Separated    :  743   Median :46.00   White         :16395  
 Mean   :2007   Divorced     : 3383   Mean   :47.18   Not applicable:    0  
 3rd Qu.:2010   Widowed      : 1807   3rd Qu.:59.00                         
 Max.   :2014   Married      :10117   Max.   :89.00                         
                                      NA's   :76                            
           rincome                   partyid            relig      
 $25000 or more:7363   Independent       :4119   Protestant:10846  
 Not applicable:7043   Not str democrat  :3690   Catholic  : 5124  
 $20000 - 24999:1283   Strong democrat   :3490   None      : 3523  
 $10000 - 14999:1168   Not str republican:3032   Christian :  689  
 $15000 - 19999:1048   Ind,near dem      :2499   Jewish    :  388  
 Refused       : 975   Strong republican :2314   Other     :  224  
 (Other)       :2603   (Other)           :2339   (Other)   :  689  
              denom          tvhours      
 Not applicable  :10072   Min.   : 0.000  
 Other           : 2534   1st Qu.: 1.000  
 No denomination : 1683   Median : 2.000  
 Southern baptist: 1536   Mean   : 2.981  
 Baptist-dk which: 1457   3rd Qu.: 4.000  
 United methodist: 1067   Max.   :24.000  
 (Other)         : 3134   NA's   :10146

`fct_relevel()`: Reordering Factor Levels

Code

  ggplot(gss_cat , aes(y = fct_relevel(race, "Other","White", "Black"))) + 
  geom_bar()

Code

gss_cat %>%
  mutate(race = fct_relevel(race, "Other", "White",  "Black")) %>%
  ggplot(aes(y = race)) +
  geom_bar()

The order of the y axis is from bottom to top

Code

gss_cat %>%
  mutate(race = fct_relevel(race, "White", "Black", "Other")) %>%
  ggplot(aes(y = race)) +
  geom_bar()

fct_infreq()
Reverse order: fct_rev()

Reorder levels by frequency.
Use the order defined by the number of respondents of different party
The order is descending, from most frequent to least frequent

Code

gss_cat %>%
  mutate(race = fct_infreq(race)) %>%
  ggplot(aes(y = race)) + geom_bar()

Code

gss_cat %>%
  mutate(race = fct_rev(fct_infreq(race))) %>%
  ggplot(aes(y = race)) + geom_bar()

`fct_reorder()`

Reorder based on numeric values: fct_reorder()

Code

gss_cat %>% count(race)

# A tibble: 3 × 2
  race      n
  <fct> <int>
1 Other  1959
2 Black  3129
3 White 16395

Code

gss_cat %>%
  count(race) %>%
  mutate(race = fct_reorder(race, n))%>%
  ggplot(aes(n, race)) + geom_col()

Compare to see the difference

Code

gss_cat %>%
  count(race) %>%
  mutate(race = fct_reorder(race, n))%>%
  ggplot(aes(n, race)) + geom_col()

Code

gss_cat %>% mutate(race = fct_rev(fct_infreq(race))) %>% 
  ggplot(aes(y = race)) + geom_bar()

`fct_recode()`

Recoding Factor Levels or simply rename levels.

Code

gss_cat %>% 
  mutate(marital1 = fct_recode(marital, 
                              "Single" = "Single", 
                              "Married" = "Married", 
                              "Widowed" = "Widowed",
                              "Divorced" = "Divorced/Separated",
                              "Divorced" = "Separated")) %>%  
  ggplot(aes(x = marital1)) +  
  geom_bar()

`fct_collapse()`

Combine multiple levels into one.

Code

gss_cat %>% 
  mutate(marital2= fct_collapse(marital, 
                               "Single" = c("Single", "No answer"),
                                "Married" = c("Married"), 
                                "Separated" = 
                                  c("Widowed", "Divorced", "Divorced/Separated", "Separated")))%>%  
  ggplot(aes(x = marital2)) +  
  geom_bar()

`fct_drop()`

Dropping or Remove unused levels.

Code

gss_cat$marital <- fct_drop(gss_cat$marital) 
levels(gss_cat$marital)

[1] "No answer"     "Never married" "Separated"     "Divorced"     
[5] "Widowed"       "Married"

`fct_explicit_na()`

Handling Missing Levels
Make missing values explicit.

Code

gss_cat$rincome <- fct_explicit_na(gss_cat$rincome, 
                                   na_level = "Not specified") 
levels(gss_cat$rincome)

 [1] "No answer"      "Don't know"     "Refused"        "$25000 or more"
 [5] "$20000 - 24999" "$15000 - 19999" "$10000 - 14999" "$8000 to 9999" 
 [9] "$7000 to 7999"  "$6000 to 6999"  "$5000 to 5999"  "$4000 to 4999" 
[13] "$3000 to 3999"  "$1000 to 2999"  "Lt $1000"       "Not applicable"

`fct_lump()`

Collapse rare levels into “Other”.

Code

gss_cat$partyid <- fct_lump(gss_cat$partyid, n = 3) # Keep top 3 levels 
levels(gss_cat$partyid)

[1] "Independent"      "Not str democrat" "Strong democrat"  "Other"

`fct_anon()`

Replace levels with random codes.

Code

gss_cat$race <- fct_anon(gss_cat$race) 
levels(gss_cat$race)

[1] "1" "2" "3" "4"

`fct_other()`

Replace specific levels with “Other”.

Code

gss_cat$relig <- fct_other(gss_cat$relig, keep = c("Protestant", "Catholic"))
levels(gss_cat$relig)

[1] "Catholic"   "Protestant" "Other"

Practical Example

Analyze the rincome variable (reported income).

Code
outpu

Steps:
1. Recode income levels for clarity.
2. Reorder levels by frequency.
3. Visualize using ggplot2.

Code

library(ggplot2) 
gss_cat$rincome <- fct_recode(gss_cat$rincome,"Low" = "Lt $1000", 
                              "Medium" = "$1000 to 2999", 
                              "High" = "$3000 to 3999") 
gss_cat$rincome <- fct_infreq(gss_cat$rincome) 
ggplot(gss_cat, aes(rincome)) + geom_bar() + 
    theme(axis.text.x = element_text(angle = 45, hjust = 1))

What are Factor Variables?

Why Use Factors?

Challenges with Factors

Key Functions in `forcats`

Let use use `gss_cat` dataset

Exploring the Dataset

`fct_relevel()`: Reordering Factor Levels

`fct_reorder()`

Compare to see the difference

`fct_recode()`

`fct_collapse()`

`fct_drop()`

`fct_explicit_na()`

`fct_lump()`

`fct_anon()`

`fct_other()`

Practical Example

Ordering other plot elements

The gapminder dataset: Life expectancy data

Life expectancy in the Africans in 2007

Life expectancy in the Africans in 2007

Life expectancy, ordered from highest to lowest