Reshaping data
National Data Management Center for Health (NDMC) at EPHI
tidyr package| Function | Objective | Arguments |
|---|---|---|
gather() |
Transform the data from wide to long | (data, key, value, na.rm = FALSE) |
spread() |
Transform the data from long to wide | (data, key, value) |
separate() |
Split one variables into two | (data, col, into, sep= ” “, remove = TRUE) |
unit() |
Unit two variables into one | (data, col, conc ,sep= ” “, remove = TRUE) |
gather()gather() function is to transform the data from wide to long.Syntax
Arguments:
data: The data frame used to reshape the dataset
key: Name of the new column created
value: Select the columns used to fill the key column
na.rm: Remove missing values. FALSE by default
Example
Below, we can visualize the concept of reshaping wide to long.
We want to create a single column named growth, filled by the rows of the quarter variables.
library(tidyr)
data <- data.frame( # Create a dataset
Treatment = c("A", "B", "C"),
q1_2021 = c(0.03, 0.05, 0.01),
q2_2021 = c(0.05, 0.07, 0.02),
q3_2021 = c(0.04, 0.05, 0.01),
q4_2021 = c(0.03, 0.02, 0.04),
q1_2022 = c(0.06, 0.07, 0.04),
q2_2022 = c(0.08, 0.07, 0.08),
q3_2022 = c(0.10, 0.08, 0.08),
q4_2022 = c(0.09, 0.09, 0.10))
data Treatment q1_2021 q2_2021 q3_2021 q4_2021 q1_2022 q2_2022 q3_2022 q4_2022
1 A 0.03 0.05 0.04 0.03 0.06 0.08 0.10 0.09
2 B 0.05 0.07 0.05 0.02 0.07 0.07 0.08 0.09
3 C 0.01 0.02 0.01 0.04 0.04 0.08 0.08 0.10
gather() function, we create two new variable quarter and growth because our original dataset has one group variable: i.e. treatment and the key-value pairs. Treatment quarter growth
1 A q1_2021 0.03
2 B q1_2021 0.05
3 C q1_2021 0.01
4 A q2_2021 0.05
5 B q2_2021 0.07
6 C q2_2021 0.02
7 A q3_2021 0.04
8 B q3_2021 0.05
9 C q3_2021 0.01
10 A q4_2021 0.03
11 B q4_2021 0.02
12 C q4_2021 0.04
spread() function does the opposite of gather.Syntax
arguments:
data: The data frame used to reshape the dataset
key: Column to reshape long to wide
value: Rows used to fill the new column
Example
spread()
The separate() function splits a column into two according to a separator.
This function is helpful in some situations where the variable is a date.
Our analysis can require focussing on month and year and we want to separate the column into two new variables.
Syntax
arguments:
Example
separate() function. Treatment Qrt year growth
1 A q1 2021 0.03
2 B q1 2021 0.05
3 C q1 2021 0.01
4 A q2 2021 0.05
5 B q2 2021 0.07
6 C q2 2021 0.02
7 A q3 2021 0.04
8 B q3 2021 0.05
9 C q3 2021 0.01
10 A q4 2021 0.03
11 B q4 2021 0.02
12 C q4 2021 0.04
unite() function concanates two columns into one.Syntax
arguments: - data: The data frame used to reshape the dataset - col: Name of the new column - conc: Name of the columns to concatenate - sep: Indicates the symbol used that unites the variable, i.e: “-”, “_”, “&” - remove: Remove the old columns. By default, sets to TRUE —
Example
In the above example, we separated quarter from year.
What if we want to merge them.
We use the following code:
NA - general missing dataNaN - stands for “Not a Number”, happens when you do 0/0.Inf and -Inf - Infinity, happens when you divide a positive number (or negative number) by 0.is.na - looks for NAN and NA
is.nan- looks for NAN
is.infinite - looks for Inf or -Inf[1] TRUE TRUE FALSE
[1] TRUE FALSE FALSE
[1] FALSE FALSE TRUE
any will be TRUE if ANY are true
any(is.na(x)) - do we have any NA’s in x?naniarnaniar package is a good option.pct_complete() function shows the percentage that is complete for a given data object, (vector or data frame).naniar plotsgg_miss_var() function creates a nice plot about the number of missing values for each variable, (need a data frame).Recall that mathematical operations with NA often result in NAs.
NA values for a variable use drop_na()
A function from the tidyr package. (Need a data frame to start!)
Don’t do this unless you have thought about if dropping NA values makes sense based on knowing what these values mean in your data.
NA
na_if() function of dplyr can be helpful for this. Let’s say we think that all 0 values should be NA.is.na(),any(is.na()), count(), and functions from naniar like gg_miss_var() can help determine if we have NA values
filter() automatically removes NA values - can’t confirm or deny if condition is met (need | is.na() to keep them)
drop_na() can help you remove NA values from a variable or an entire data frame
NA values can change your calculation results
think about what NA values represent - don’t drop them if you shouldn’t
Cleaning data is one of the most essential parts in data analysis.
Data cleaning is the process of converting messy data into reliable data that can be analyzed in R.
Data cleaning improves data quality and your productivity in R.
Format ugly data frame column names in R
Delete all blank rows in R
Remove duplicate rows in R
Converting numeric to date format
Checking consistency by using tably() function
clean_names()This function is used to change and clean up names of columns in data frames.
It can be used to ensure consistency.
You can choose to change all names to lower cases, separated by underscores, variations on internal capital letters between words, title case or other styles.
It can also be used to remove parts of names and any special characters, including replacing % symbols with the word percent.
The most recommended variable names in R is one word, lower case, without special characters.
clean_names() function will be used available in janitor R package to clean column names.
remove_empty() & remove_constant()
remove_empty() function available in janitor R package.remove_empty() removes both empty rows and empty columns. sepal_length sepal_width petal_length petal_width species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
get_dupes()This function retrieves any duplicates in the dataset so that they can be examined during data clean-up operations.
One trick is determining if a duplicate is indeed a duplicate.
The function returns a data frame which includes a dupe_count column containing the number of duplicates of that value.
sepal_length sepal_width dupe_count petal_length petal_width species
1 5.8 2.7 4 4.1 1.0 versicolor
2 5.8 2.7 4 3.9 1.2 versicolor
3 5.8 2.7 4 5.1 1.9 virginica
4 5.8 2.7 4 5.1 1.9 virginica
5 5.1 3.8 3 1.5 0.3 setosa
6 5.1 3.8 3 1.9 0.4 setosa
distinct() function available in dplyr R package to remove the duplicate rows.If we want to remove duplicate rows with respect to a specific variable, we can use distinct() function again. For example, we remove duplicate rows with respect to petal_length.
sepal_length sepal_width petal_length petal_width species
1 5.1 3.5 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.4 3.9 1.7 0.4 setosa
5 4.8 3.4 1.6 0.2 setosa
6 4.3 3.0 1.1 0.1 setosa
tabyl()This function is used to produce frequency tables and contingency tables, i.e. counts of each category or combination of categories of data.
Unlike the base R table() function, tabyl() returns a data frame which makes results easier to work with.
The code below creates a data frame showing the number of rows of data (n) for each location in the dataset.
convert_to_date()A modern Excel always tries to automate things, and I hate it!
For instance you write a number into a cell and it sometimes immediately converts it into date. Then you try to have a date in a cell, and it returns a number.
Moreover, Excel also has some strange date encoding systems, which can be confused with a normal numeric columns.
Luckily, our dirty dataset has a “date” word in the name of a column “hire_date”, otherwise we wouldn’t know that it is a date:
excel_numeric_to_date()Ever load data from Excel and see a value like 42223 where a date should be?
This function converts those serial numbers to class Date, with options for different Excel date encoding systems, preserving fractions of a date as time (in which case the returned value is of class POSIXlt), and specifying a time zone.
Building on excel_numeric_to_date(), the new functions convert_to_date() and convert_to_datetime() are more robust to a mix of inputs.
Handy when reading many spreadsheets that should have the same column formats, but don’t.
For instance, here a vector with a date and an Excel datetime sees both values successfully converted to Date class:
In R, we want to convert our string dates to R class Date, which preserves the dates’ numeric values and allows us to take advantage of the many date-related functions in R.
Once converted to class Date, the numeric value for the date represents the number of days since January 1, 1970 (1970-01-01).
as.Date()as.Date() to convert strings to dates, but it can be intimidating and unwieldy to use, particularly if your dates are not stored in one of 2 default formats (see ?strptime for more about date formats).[1] "Date"
[1] 365
Time difference of 1 days
[1] "1971-01-03"
tidyverse provides the amusingly named package lubridate to help R users convert their string dates to R Date format more easily as well as functions to process those dates.
The Date conversion functions in lubridate accept a wide variety of date formats, removing the need for us to remember all of those format specifications.
Instead, just take the letters y, m, and d and place them in the order of the year, month and day, respectively, as they are stored in the date column. That ordering produces the name of the function to convert that column to Date (e.g.ymd(), mdy(), dmy()).
# A tibble: 3 × 5
fmt1 fmt2 fmt3 fmt4 decision_time
<chr> <chr> <date> <dbl> <chr>
1 01/15/89 December 8, 2015 2015-02-27 20090101 Approved 10-10-15 07:15:55
2 02/13/92 January 22, 2012 2016-11-15 20080819 Denied 09-27-11 14:57:23
3 03/15/84 March 3, 2010 2017-12-25 20071011 Approved 04-24-15 02:03:03
Currently, the first 4 columns are stored as chr, chr, date, and int.
The third column is actually read in as Date by read_csv() because of its particular formatting signals a date to read_csv().
The first pair of columns use the order month-day-year, while the second pair of columns use year-month-day.
So, we want mdy() and ymd().
f1 f2 f3 f4
1 1989-01-15 2015-12-08 2015-02-27 2009-01-01
2 1992-02-13 2012-01-22 2016-11-15 2008-08-19
3 1984-03-15 2010-03-03 2017-12-25 2007-10-11
If your date column additionally contains time information (i.e. is date-time), you can add one or more of h, m, and s to y, m, and d to form the function name to convert the string to class POSIXct.
Class POSIXct stores date-time variables as a number representing the number of seconds since the beginning of 1970.
The fifth column of our dates dataset, decision_time, is a date-time variable.
(It also contains an “Approved/Denied” string at the beginning that the lubridate() function will ignore!).
Specifically, the date-time is recorded as month, day, year, hour, minute, second, so we want mdy_hms():
[1] "POSIXct" "POSIXt"
UTC (Coordinated Universal Time) is used as the default zone. Use the argument, tz= to assign a zone.OlsonNames().day(): day of the monthwday(): weekdayyday(): day of the yearmonth(): month of the yearyear(): yearlubridate also has functions to extract time information from POSIXct date-time variables.
hour()minute()second()Two sets of functions for date-time arithmetic
If you need to add or subtract time to your date variables, lubridate provides 2 sets of functions.
One set gives “intuitive” results, and will ignore conventions like leap year.
These include seconds(), minutes(), hours(), days(), weeks(), years().
The other set will adhere to those conventions, and are named by adding d to the names of the previous functions, dseconds(), dminutes(), dhours(), ddays(), dweeks(), dyears().
Easy data cleaning with the janitor package
https://www.r-bloggers.com/2024/05/easy-data-cleaning-with-the-janitor-package/