Reshaping data

National Data Management Center for Health (NDMC) at EPHI



Converting data from wide to long or long to wide

Reshaping data using tidyr package

  • Following are the four important functions to tidy (clean) the data:
Function Objective Arguments
gather() Transform the data from wide to long (data, key, value, na.rm = FALSE)
spread() Transform the data from long to wide (data, key, value)
separate() Split one variables into two (data, col, into, sep= ” “, remove = TRUE)
unit() Unit two variables into one (data, col, conc ,sep= ” “, remove = TRUE)

gather()

  • The objectives of the gather() function is to transform the data from wide to long.

Syntax

Code
gather(data, key, value, na.rm = FALSE)

Arguments:

  • data: The data frame used to reshape the dataset

  • key: Name of the new column created

  • value: Select the columns used to fill the key column

  • na.rm: Remove missing values. FALSE by default

Example

  • Below, we can visualize the concept of reshaping wide to long.

  • We want to create a single column named growth, filled by the rows of the quarter variables.

Code
library(tidyr)
data <- data.frame(   # Create a dataset
  Treatment = c("A", "B", "C"),
  q1_2021 = c(0.03, 0.05, 0.01),
  q2_2021 = c(0.05, 0.07, 0.02),
  q3_2021 = c(0.04, 0.05, 0.01),
  q4_2021 = c(0.03, 0.02, 0.04),
  q1_2022 = c(0.06, 0.07, 0.04),
  q2_2022 = c(0.08, 0.07, 0.08),
  q3_2022 = c(0.10, 0.08, 0.08),
  q4_2022 = c(0.09, 0.09, 0.10))
data
  Treatment q1_2021 q2_2021 q3_2021 q4_2021 q1_2022 q2_2022 q3_2022 q4_2022
1         A    0.03    0.05    0.04    0.03    0.06    0.08    0.10    0.09
2         B    0.05    0.07    0.05    0.02    0.07    0.07    0.08    0.09
3         C    0.01    0.02    0.01    0.04    0.04    0.08    0.08    0.10

Reshape the data

  • In the gather() function, we create two new variable quarter and growth because our original dataset has one group variable: i.e. treatment and the key-value pairs.
Code
long <-data |>
gather(quarter, growth, q1_2021:q4_2022)
head(long, 12)
   Treatment quarter growth
1          A q1_2021   0.03
2          B q1_2021   0.05
3          C q1_2021   0.01
4          A q2_2021   0.05
5          B q2_2021   0.07
6          C q2_2021   0.02
7          A q3_2021   0.04
8          B q3_2021   0.05
9          C q3_2021   0.01
10         A q4_2021   0.03
11         B q4_2021   0.02
12         C q4_2021   0.04

spread()

  • The spread() function does the opposite of gather.

Syntax

Code
spread(data, key, value)

arguments:

  • data: The data frame used to reshape the dataset

  • key: Column to reshape long to wide

  • value: Rows used to fill the new column

Example

  • We can reshape the long dataset back to data with spread()

Reshape the data

Code
wide <- long |>
  spread(quarter, growth) 
wide
  Treatment q1_2021 q1_2022 q2_2021 q2_2022 q3_2021 q3_2022 q4_2021 q4_2022
1         A    0.03    0.06    0.05    0.08    0.04    0.10    0.03    0.09
2         B    0.05    0.07    0.07    0.07    0.05    0.08    0.02    0.09
3         C    0.01    0.04    0.02    0.08    0.01    0.08    0.04    0.10

separate()

  • The separate() function splits a column into two according to a separator.

  • This function is helpful in some situations where the variable is a date.

  • Our analysis can require focussing on month and year and we want to separate the column into two new variables.

Syntax

Code
separate(data, col, into, sep= "", remove = TRUE)

arguments:

  • data: The data frame used to reshape the dataset
  • col: The column to split
  • into: The name of the new variables
  • sep: Indicates the symbol used that separates the variable, i.e.: “-”, “_”, “&”
  • remove: Remove the old column. By default sets to TRUE.

Example

  • We can split the quarter from the year in the tidier dataset by applying the separate() function.
Code
separate_tidier <-long |>
separate(quarter, c("Qrt", "year"), sep ="_")
head(separate_tidier, 12)
   Treatment Qrt year growth
1          A  q1 2021   0.03
2          B  q1 2021   0.05
3          C  q1 2021   0.01
4          A  q2 2021   0.05
5          B  q2 2021   0.07
6          C  q2 2021   0.02
7          A  q3 2021   0.04
8          B  q3 2021   0.05
9          C  q3 2021   0.01
10         A  q4 2021   0.03
11         B  q4 2021   0.02
12         C  q4 2021   0.04

unite()

  • The unite() function concanates two columns into one.

Syntax

Code
unit(data, col, conc ,sep= "", remove = TRUE)

arguments: - data: The data frame used to reshape the dataset - col: Name of the new column - conc: Name of the columns to concatenate - sep: Indicates the symbol used that unites the variable, i.e: “-”, “_”, “&” - remove: Remove the old columns. By default, sets to TRUE —

Example

  • In the above example, we separated quarter from year.

  • What if we want to merge them.

  • We use the following code:

Code
unit_tidier <- separate_tidier |>
  unite(Quarter, Qrt, year, sep ="_")
head(unit_tidier)

Dealing with Missing Data

Missing data types

  • One of the most important aspects of data cleaning is missing values.

Types of “missing” data:

  • NA - general missing data
  • NaN - stands for “Not a Number”, happens when you do 0/0.
  • Inf and -Inf - Infinity, happens when you divide a positive number (or negative number) by 0.

Finding Missing data

  • is.na - looks for NAN and NA
  • is.nan- looks for NAN
  • is.infinite - looks for Inf or -Inf
Code
test<-c(0,NA, -1)
test/0
[1]  NaN   NA -Inf
Code
test <-test/0
is.na(test)
[1]  TRUE  TRUE FALSE
Code
is.nan(test)
[1]  TRUE FALSE FALSE
Code
is.infinite(test)
[1] FALSE FALSE  TRUE

Useful checking functions

  • any will be TRUE if ANY are true
    • any(is.na(x)) - do we have any NA’s in x?
Code
A = c(1, 2, 4, NA)
B = c(1, 2, 3, 4)
any(is.na(A)) # are there any NAs - YES/TRUE
[1] TRUE
Code
any(is.na(B)) # are there any NAs- NO/FALSE
[1] FALSE

naniar

  • Sometimes you need to look at lots of data though… the naniar package is a good option.
  • The pct_complete() function shows the percentage that is complete for a given data object, (vector or data frame).
Code
#install.packages("naniar")
library(naniar)
pct_complete(airquality)
[1] 95.20697
Code
library(janitor)
library(dplyr)
airquality |> select(Ozone) |>
pct_complete()
[1] 75.81699

naniar plots

  • The gg_miss_var() function creates a nice plot about the number of missing values for each variable, (need a data frame).
Code
gg_miss_var(airquality)

Missing Data Issues

Recall that mathematical operations with NA often result in NAs.

Code
sum(c(1,2,3,NA))
[1] NA
Code
mean(c(2,4,NA))
[1] NA
Code
median(c(1,2,3,NA))
[1] NA

filter() and missing data

  • Be careful with missing data using subsetting!

  • filter() removes missing values by default. Because R can’t tell for sure if an NA value meets the condition.

  • To keep them need to add is.na() conditional.

Code
df <-tibble(Dog = c(0, NA, 2, 3, 1, 1), 
            Cat = c(NA, 8, 6, NA, 2, NA))
Code
df 
# A tibble: 6 × 2
    Dog   Cat
  <dbl> <dbl>
1     0    NA
2    NA     8
3     2     6
4     3    NA
5     1     2
6     1    NA
Code
df |> filter(Dog < 3)
# A tibble: 4 × 2
    Dog   Cat
  <dbl> <dbl>
1     0    NA
2     2     6
3     1     2
4     1    NA

filter() and missing data

Code
df |> filter(Dog < 3 | is.na(Dog))
# A tibble: 5 × 2
    Dog   Cat
  <dbl> <dbl>
1     0    NA
2    NA     8
3     2     6
4     1     2
5     1    NA

To remove rows with NA values for a variable use drop_na()

  • A function from the tidyr package. (Need a data frame to start!)

  • Don’t do this unless you have thought about if dropping NA values makes sense based on knowing what these values mean in your data.

Code
df |> drop_na(Dog)
# A tibble: 5 × 2
    Dog   Cat
  <dbl> <dbl>
1     0    NA
2     2     6
3     3    NA
4     1     2
5     1    NA
Code
df |> drop_na()
# A tibble: 2 × 2
    Dog   Cat
  <dbl> <dbl>
1     2     6
2     1     2

Change a value to be NA

  • The na_if() function of dplyr can be helpful for this. Let’s say we think that all 0 values should be NA.
Code
df 
# A tibble: 6 × 2
    Dog   Cat
  <dbl> <dbl>
1     0    NA
2    NA     8
3     2     6
4     3    NA
5     1     2
6     1    NA
Code
df |> 
  mutate(Dog = na_if(x = Dog, y = 0))
# A tibble: 6 × 2
    Dog   Cat
  <dbl> <dbl>
1    NA    NA
2    NA     8
3     2     6
4     3    NA
5     1     2
6     1    NA

Summary

  • is.na(),any(is.na()), count(), and functions from naniar like gg_miss_var() can help determine if we have NA values

  • filter() automatically removes NA values - can’t confirm or deny if condition is met (need | is.na() to keep them)

  • drop_na() can help you remove NA values from a variable or an entire data frame

  • NA values can change your calculation results

  • think about what NA values represent - don’t drop them if you shouldn’t

Data leaning

  • Cleaning data is one of the most essential parts in data analysis.

  • Data cleaning is the process of converting messy data into reliable data that can be analyzed in R.

  • Data cleaning improves data quality and your productivity in R.

    • Format ugly data frame column names in R

    • Delete all blank rows in R

    • Remove duplicate rows in R

    • Converting numeric to date format

    • Checking consistency by using tably() function

clean_names()

  • This function is used to change and clean up names of columns in data frames.

  • It can be used to ensure consistency.

  • You can choose to change all names to lower cases, separated by underscores, variations on internal capital letters between words, title case or other styles.

  • It can also be used to remove parts of names and any special characters, including replacing % symbols with the word percent.

  • The most recommended variable names in R is one word, lower case, without special characters.

  • clean_names() function will be used available in janitor R package to clean column names.

Code
library(janitor); library(report)
data<-clean_names(iris)
data |> head()
  sepal_length sepal_width petal_length petal_width species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

remove_empty() & remove_constant()

  • Suppose if you want to remove rows and/or columns of if contain completely empty, then you can use remove_empty() function available in janitor R package.
  • remove_empty() removes both empty rows and empty columns.
Code
data3<- remove_empty(data, which = c("rows","cols"), quiet = F)
data3 |> head()
  sepal_length sepal_width petal_length petal_width species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa
  • Lets add two constant columns to the dataset and see how we can remove all this junk with janitor.
Code
# add two constant columns
data4 <- data3 |> 
  mutate(constant_column   = 42, constant_column_2 = "text") |> head()
Code
library(kableExtra)
data4 |> 
  remove_constant() |> remove_empty() |> kbl() |>
  kable_classic_2(full_width = F) 
sepal_length sepal_width petal_length petal_width
5.1 3.5 1.4 0.2
4.9 3.0 1.4 0.2
4.7 3.2 1.3 0.2
4.6 3.1 1.5 0.2
5.0 3.6 1.4 0.2
5.4 3.9 1.7 0.4

get_dupes()

  • This function retrieves any duplicates in the dataset so that they can be examined during data clean-up operations.

  • One trick is determining if a duplicate is indeed a duplicate.

  • The function returns a data frame which includes a dupe_count column containing the number of duplicates of that value.

Code
data3 |> get_dupes() 
  sepal_length sepal_width petal_length petal_width   species dupe_count
1          5.8         2.7          5.1         1.9 virginica          2
2          5.8         2.7          5.1         1.9 virginica          2
  • We can also get duplicate values by col names
Code
data3 |> get_dupes(sepal_length, sepal_width) |> head()
  sepal_length sepal_width dupe_count petal_length petal_width    species
1          5.8         2.7          4          4.1         1.0 versicolor
2          5.8         2.7          4          3.9         1.2 versicolor
3          5.8         2.7          4          5.1         1.9  virginica
4          5.8         2.7          4          5.1         1.9  virginica
5          5.1         3.8          3          1.5         0.3     setosa
6          5.1         3.8          3          1.9         0.4     setosa
  • We can also use distinct() function available in dplyr R package to remove the duplicate rows.
Code
library(dplyr)
data_cleaned <- data3 |> distinct()
data_cleaned |> head()
  sepal_length sepal_width petal_length petal_width species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

If we want to remove duplicate rows with respect to a specific variable, we can use distinct() function again. For example, we remove duplicate rows with respect to petal_length.

Code
data_cleaned2 <- data_cleaned |>  distinct(petal_length, .keep_all = TRUE)
data_cleaned2 |> head.matrix()
  sepal_length sepal_width petal_length petal_width species
1          5.1         3.5          1.4         0.2  setosa
2          4.7         3.2          1.3         0.2  setosa
3          4.6         3.1          1.5         0.2  setosa
4          5.4         3.9          1.7         0.4  setosa
5          4.8         3.4          1.6         0.2  setosa
6          4.3         3.0          1.1         0.1  setosa

tabyl()

  • This function is used to produce frequency tables and contingency tables, i.e. counts of each category or combination of categories of data.

  • Unlike the base R table() function, tabyl() returns a data frame which makes results easier to work with.

  • The code below creates a data frame showing the number of rows of data (n) for each location in the dataset.

Code
data_cleaned2 |> tabyl(species)
    species  n   percent
     setosa  9 0.2093023
 versicolor 19 0.4418605
  virginica 15 0.3488372
  • Also returned is a percent column, showing the percentage of rows containing data for that location.
Code
data_cleaned2 |> tabyl(species)|>
  adorn_pct_formatting(digits = 2)
    species  n percent
     setosa  9  20.93%
 versicolor 19  44.19%
  virginica 15  34.88%

Explore missing values

Code
# plot missing data (using raw data)
DataExplorer::plot_missing(
  title   = "% of Missing Data (filtered to cols w/missing data)",
  data    = airquality,
  ggtheme = tidyquant::theme_tq(), 
  missing_only = F) 

convert_to_date()

  • A modern Excel always tries to automate things, and I hate it!

  • For instance you write a number into a cell and it sometimes immediately converts it into date. Then you try to have a date in a cell, and it returns a number.

  • Moreover, Excel also has some strange date encoding systems, which can be confused with a normal numeric columns.

  • Luckily, our dirty dataset has a “date” word in the name of a column “hire_date”, otherwise we wouldn’t know that it is a date:

Code
convert_to_date(d$hire_date)

excel_numeric_to_date()

  • Ever load data from Excel and see a value like 42223 where a date should be?

  • This function converts those serial numbers to class Date, with options for different Excel date encoding systems, preserving fractions of a date as time (in which case the returned value is of class POSIXlt), and specifying a time zone.

Code
excel_numeric_to_date(41103)
[1] "2012-07-13"
Code
excel_numeric_to_date(41103.01) # ignores decimal places, returns Date object
[1] "2012-07-13"
Code
excel_numeric_to_date(41103.01, include_time = TRUE) # returns POSIXlt object
[1] "2012-07-13 00:14:24 EAT"

Convert a mix of date and datetime formats to date

  • Building on excel_numeric_to_date(), the new functions convert_to_date() and convert_to_datetime() are more robust to a mix of inputs.

  • Handy when reading many spreadsheets that should have the same column formats, but don’t.

For instance, here a vector with a date and an Excel datetime sees both values successfully converted to Date class:

Code
convert_to_date(c("2020-02-29", "40000.1"))
[1] "2020-02-29" "2009-07-06"

Dates

  • Before dates in a dataset are loaded into R, typically they are stored as a column of character (string) values.
  • However, dates are inherently numeric and we lose this information when they are stored as strings.
Code
# string dates have no numeric value, so this errors
"2018-03-05" + 1
  • In R, we want to convert our string dates to R class Date, which preserves the dates’ numeric values and allows us to take advantage of the many date-related functions in R.

  • Once converted to class Date, the numeric value for the date represents the number of days since January 1, 1970 (1970-01-01).

as.Date()

  • Base R provides as.Date() to convert strings to dates, but it can be intimidating and unwieldy to use, particularly if your dates are not stored in one of 2 default formats (see ?strptime for more about date formats).
Code
# as.Date only accepts a couple of formats by default
# good
as.Date("2015-02-14")
# bad
as.Date("02/14/2014")
# specify a format to fix
as.Date("02/14/2014", format="%m/%d/%Y")
  • Once our dates are class Date, we can perform date arithmetic.
Code
a <- as.Date("1971-01-01")
class(a)
[1] "Date"
Code
# days since 1970-01-01
as.numeric(a)
[1] 365
Code
# date arithmetic
a - as.Date("1970/12/31")
Time difference of 1 days
Code
## Time difference of 1 days
a + 2
[1] "1971-01-03"
  • We just need something easier to use to convert strings of various formats to class Date!

Package lubridate

  • tidyverse provides the amusingly named package lubridate to help R users convert their string dates to R Date format more easily as well as functions to process those dates.

  • The Date conversion functions in lubridate accept a wide variety of date formats, removing the need for us to remember all of those format specifications.

  • Instead, just take the letters y, m, and d and place them in the order of the year, month and day, respectively, as they are stored in the date column. That ordering produces the name of the function to convert that column to Date (e.g.ymd(), mdy(), dmy()).

Code
library(lubridate)

Using lubridate

  • We’ll first load a dataset with various date formats to demonstrate the flexibility of lubridate functions.
Code
library(readr)
d <- read_csv("data/dates.csv")
d
# A tibble: 3 × 5
  fmt1     fmt2             fmt3           fmt4 decision_time             
  <chr>    <chr>            <date>        <dbl> <chr>                     
1 01/15/89 December 8, 2015 2015-02-27 20090101 Approved 10-10-15 07:15:55
2 02/13/92 January 22, 2012 2016-11-15 20080819 Denied 09-27-11 14:57:23  
3 03/15/84 March 3, 2010    2017-12-25 20071011 Approved 04-24-15 02:03:03
  • Currently, the first 4 columns are stored as chr, chr, date, and int.

  • The third column is actually read in as Date by read_csv() because of its particular formatting signals a date to read_csv().

  • The first pair of columns use the order month-day-year, while the second pair of columns use year-month-day.

  • So, we want mdy() and ymd().

Code
# no format specifications needed
# just ordering y,m, and d
dates <- data.frame(f1=mdy(d$fmt1), f2=mdy(d$fmt2),
                    f3=ymd(d$fmt3), f4=ymd(d$fmt4))
dates
          f1         f2         f3         f4
1 1989-01-15 2015-12-08 2015-02-27 2009-01-01
2 1992-02-13 2012-01-22 2016-11-15 2008-08-19
3 1984-03-15 2010-03-03 2017-12-25 2007-10-11

Date-time variables

  • If your date column additionally contains time information (i.e. is date-time), you can add one or more of h, m, and s to y, m, and d to form the function name to convert the string to class POSIXct.

  • Class POSIXct stores date-time variables as a number representing the number of seconds since the beginning of 1970.

  • The fifth column of our dates dataset, decision_time, is a date-time variable.

  • (It also contains an “Approved/Denied” string at the beginning that the lubridate() function will ignore!).

  • Specifically, the date-time is recorded as month, day, year, hour, minute, second, so we want mdy_hms():

Code
# month day year hour minute second
mdy_hms(d$decision_time)
[1] "2015-10-10 07:15:55 UTC" "2011-09-27 14:57:23 UTC"
[3] "2015-04-24 02:03:03 UTC"
Code
# POSIXct is a standard way of representing calendar time
class(mdy_hms(d$decision_time))
[1] "POSIXct" "POSIXt" 
  • The ‘time zone’ standard UTC (Coordinated Universal Time) is used as the default zone. Use the argument, tz= to assign a zone.
  • To see a list of valid time zone specifications, run OlsonNames().
Code
# we'll use this for our dates variable
dates$decision_time <- mdy_hms(d$decision_time, tz="US/Pacific")
dates$decision_time
[1] "2015-10-10 07:15:55 PDT" "2011-09-27 14:57:23 PDT"
[3] "2015-04-24 02:03:03 PDT"

Extracting information from Date variables

  • lubridate provides several functions to extract specific information from Date variables including:
    • day(): day of the month
    • wday(): weekday
    • yday(): day of the year
    • month(): month of the year
    • year(): year
  • Some examples of extracting information from Date variables.
Code
# we'll use the first column of our dates dataset
dates$f1
[1] "1989-01-15" "1992-02-13" "1984-03-15"
Code
# day of the month
day(dates$f1)
[1] 15 13 15
Code
# day of the year
yday(dates$f1)
[1] 15 44 75
  • Some more date information extraction:
Code
# weekday as numbers
wday(dates$f1)
[1] 1 5 5
Code
# weekday with labels
wday(dates$f1, label=TRUE)
[1] Sun Thu Thu
Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat
Code
# month of the year
month(dates$f1)
[1] 1 2 3

Extracting information from POSIXct variables

  • lubridate also has functions to extract time information from POSIXct date-time variables.

    • hour()
    • minute()
    • second()
    Code
    with(dates,  ## with() tells R to look for variables in object "dates"
    data.frame(time=decision_time, h=hour(decision_time),
               m=minute(decision_time), s=second(decision_time)))
                     time  h  m  s
    1 2015-10-10 07:15:55  7 15 55
    2 2011-09-27 14:57:23 14 57 23
    3 2015-04-24 02:03:03  2  3  3
  • Two sets of functions for date-time arithmetic

  • If you need to add or subtract time to your date variables, lubridate provides 2 sets of functions.

  • One set gives “intuitive” results, and will ignore conventions like leap year.

  • These include seconds(), minutes(), hours(), days(), weeks(), years().

  • The other set will adhere to those conventions, and are named by adding d to the names of the previous functions, dseconds(), dminutes(), dhours(), ddays(), dweeks(), dyears().

Code
#2016 is a leap year
# the intuitive result
ymd("2015-02-14") + years(2) 
[1] "2017-02-14"
Code
# the exact result
ymd("2015-02-14") + dyears(2)
[1] "2017-02-13 12:00:00 UTC"

Easy data cleaning with the janitor package

Clean column names

  • add some points about
    • count
    • tably
    • percent for making quick graphs

https://www.r-bloggers.com/2024/05/easy-data-cleaning-with-the-janitor-package/