[1] "Female" "Male"
Factor Variables Management with forcats
National Data Management Center for Health (NDMC) at EPHI
Factors are used to represent categorical data in R.
Levels: Unique categories in the factor (e.g., “Male”, “Female”).

DataTypes
- Nominal: Categories without a specific order (e.g., colors, gender).
- Ordinal: Categories with a meaningful order or ranking (e.g., education levels, satisfaction ratings).
Example:
Advantages:
Levels may not match the data (e.g., missing levels).
Reordering levels can be cumbersome.
Combining or recoding levels requires manual effort.
Therefore we use the forcats package, simplifies these tasks.
What is forcats package?:
forcats
Overview:
fct_relevel(): Reorder levels manually.fct_infreq(): Reorder levels by frequency.fct_reorder(): Reorder levels by another variable.fct_lump(): Collapse least/most frequent levels.fct_recode(): Recode levels.fct_collapse(): Combine levels.fct_drop(): Drop unused levels.fct_other(): Replace levels with “Other”.fct_anon(): Anonymize levels (e.g., for privacy).gss_cat datasetA sample dataset from the General Social Survey, included in the forcats package.
Variables:
year: Year of the survey.age: Age of respondents.marital: Marital status (factor).race: Race (factor).rincome: Reported income (factor).Load the Dataset:
# A tibble: 6 × 9
year marital age race rincome partyid relig denom tvhours
<int> <fct> <int> <fct> <fct> <fct> <fct> <fct> <int>
1 2000 Never married 26 White $8000 to 9999 Ind,near r… Prot… Sout… 12
2 2000 Divorced 48 White $8000 to 9999 Not str re… Prot… Bapt… NA
3 2000 Widowed 67 White Not applicable Independent Prot… No d… 2
4 2000 Never married 39 White Not applicable Ind,near r… Orth… Not … 4
5 2000 Divorced 25 White Not applicable Not str de… None Not … 1
6 2000 Married 25 White $20000 - 24999 Strong dem… Prot… Sout… NA
tibble [21,483 × 9] (S3: tbl_df/tbl/data.frame)
$ year : int [1:21483] 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ...
$ marital: Factor w/ 6 levels "No answer","Never married",..: 2 4 5 2 4 6 2 4 6 6 ...
$ age : int [1:21483] 26 48 67 39 25 25 36 44 44 47 ...
$ race : Factor w/ 4 levels "Other","Black",..: 3 3 3 3 3 3 3 3 3 3 ...
$ rincome: Factor w/ 16 levels "No answer","Don't know",..: 8 8 16 16 16 5 4 9 4 4 ...
$ partyid: Factor w/ 10 levels "No answer","Don't know",..: 6 5 7 6 9 10 5 8 9 4 ...
$ relig : Factor w/ 16 levels "No answer","Don't know",..: 15 15 15 6 12 15 5 15 15 15 ...
$ denom : Factor w/ 30 levels "No answer","Don't know",..: 25 23 3 30 30 25 30 15 4 25 ...
$ tvhours: int [1:21483] 12 NA 2 4 1 NA 3 NA 0 3 ...
year marital age race
Min. :2000 No answer : 17 Min. :18.00 Other : 1959
1st Qu.:2002 Never married: 5416 1st Qu.:33.00 Black : 3129
Median :2006 Separated : 743 Median :46.00 White :16395
Mean :2007 Divorced : 3383 Mean :47.18 Not applicable: 0
3rd Qu.:2010 Widowed : 1807 3rd Qu.:59.00
Max. :2014 Married :10117 Max. :89.00
NA's :76
rincome partyid relig
$25000 or more:7363 Independent :4119 Protestant:10846
Not applicable:7043 Not str democrat :3690 Catholic : 5124
$20000 - 24999:1283 Strong democrat :3490 None : 3523
$10000 - 14999:1168 Not str republican:3032 Christian : 689
$15000 - 19999:1048 Ind,near dem :2499 Jewish : 388
Refused : 975 Strong republican :2314 Other : 224
(Other) :2603 (Other) :2339 (Other) : 689
denom tvhours
Not applicable :10072 Min. : 0.000
Other : 2534 1st Qu.: 1.000
No denomination : 1683 Median : 2.000
Southern baptist: 1536 Mean : 2.981
Baptist-dk which: 1457 3rd Qu.: 4.000
United methodist: 1067 Max. :24.000
(Other) : 3134 NA's :10146
fct_relevel(): Reordering Factor Levelsfct_reorder()# A tibble: 3 × 2
race n
<fct> <int>
1 Other 1959
2 Black 3129
3 White 16395
fct_recode()fct_collapse()fct_drop()fct_explicit_na() [1] "No answer" "Don't know" "Refused" "$25000 or more"
[5] "$20000 - 24999" "$15000 - 19999" "$10000 - 14999" "$8000 to 9999"
[9] "$7000 to 7999" "$6000 to 6999" "$5000 to 5999" "$4000 to 4999"
[13] "$3000 to 3999" "$1000 to 2999" "Lt $1000" "Not applicable"
fct_lump()fct_anon()fct_other()rincome variable (reported income).ggplot2.
# A tibble: 1,704 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
7 Afghanistan Asia 1982 39.9 12881816 978.
8 Afghanistan Asia 1987 40.8 13867957 852.
9 Afghanistan Asia 1992 41.7 16317921 649.
10 Afghanistan Asia 1997 41.8 22227415 635.
# ℹ 1,694 more rows
Order is ascending from bottom to top