30:00
Introduction to R and RStudio
National Data Management Center for Health (NDMC) at EPHI
Take few minutes to introduce ourselves.
Please share …
30:00
01. Module 1
02. Module 2
03. Module 3
forcats
04. Module 4
Be able to setting up and utilizing R & RStudio
Navigating R & Rmarkdown scripts and RStudio projects
Basic operations in R/RStudio
Be able to importing and inspecting data sets in R
Understand data structures
Know how to get help
Manipulating data through filtering, summarizing, transforming, and joining
Visualizing data using the renowned ggplot2 package
It is free, versatile, fast, and modern.
It has a large and friendly community of users.
With more than 22,000 add-on packages available, R offers more functions for data analysis than any other statistical software.
R makes it easy to construct reproducible analyses and workflows that allow you to easily repeat the same analysis more than once.
It is flexible enough to be used to create interactive web pages (eg. my draft website) and automated reports.
Simply currently the best tool there is for data analysis.
Advantages
Drawbacks of R
Best Integrated Development Environment (IDE) for R.
Powerful and makes using R easier
RStudio can:
User-friendly interfaces
R is like a car’s engine
Rstudio is like a car’s dashboard
Go to cran.rstudio.com to access the R installation page. Then click the download link for Windows:
Choose the base sub-directory.
Then click on the download link at the top of the page to download the latest version of R:
File -> New -> R Script
, or click on the icon with the +
sign and select R Script
, or simply press Ctrl+Shift+N
.The size and position of the panes can be customized.
On the top right of each pane, there are buttons to adjust the pane size.
Also, place your mouse pointer/cursor on the borderline between panes and when the pointer changes its shape, click and drag to adjust the pane size.
For more options, go to View > Panes on the menu bar.
Alternatively, try Tools > Global Options > Pane Layout.
The overall appearance can be customized as well.
Go to Tools > Global Options> Appearance on the menu bar to change themes, fonts, and more.
By default, RStudio is arranged into four window panes.
If you only see three panes, open a new script with File > New File > R Script
. This should reveal one more pane.
Before we go any further, we will rearrange these panes to improve the usability of the interface.
Then under Pane Layout
, adjust the pane arrangement. The arrangement we recommend is shown below.
At the top left pane is the Source tab, and at the top right pane, you should have the Console tab.
Then at the bottom left pane, no tab options should checked—this section should be left empty, with the drop-down saying just “TabSet”.
Finally, at the bottom right pane, you should check the following tabs: Environment, History, Files, Plots, Packages, Help and Viewer.
First, open a new script under the File menu if one is not yet open: File > New File > R Script
. In the script, type the following:
To run code, place your cursor anywhere in the code, then hit Control
+ Enter
on Windows.
This should send the code to the Console and run it.
You can also run multiple lines at once.
Now drag your cursor to highlight both lines and press Control
+ Enter
.
To run the entire script, you can use Control
+ A
to select all code, then press `Control
+ Enter
.
To put the window back, click on the same button on the now-external window.
Next, save the script. Hit Control
+ S
to bring up the Save dialog box.
The console, at the bottom left, is where code is executed. You can type code directly here, but it will not be saved.
Type a random piece of code (maybe a calculation like 3 + 3
) and press ‘Enter’.
If you place your cursor on the last line of the console, and you press the up arrow, you can go back to the last code that was run. Keep pressing it to cycle to the previous lines.
To run any of these previous lines, press Enter.
At the top right of the RStudio Window, you should see the Environment tab.
The Environment tab shows datasets and other objects that are loaded into R’s working memory, or “workspace”.
To explore this tab, let’s import a dataset into your environment from R.
Type the code below into your script and run it:
You have now imported the dataset and stored it in an object named data
. (You could have named the object anything you want.)
data
dataset from the Environment tab. This opens it in a ‘View’ tab.rm()
function.Notice that the data
object no longer shows up in your environment after having run that code.
You can click a line to highlight it, then send it to the console or to your script with the “To Console” and “To Source” icons at the top of this tab.
To select multiple lines, use the “Shift-click” method: click the first item you want to select, then hold down the “Shift” key and click the last item you want to select.
Finally, notice that there is a search bar at the top right of the History pane where you can search for past commands that you have run.
The tab allows you to interact with your computer’s file system.
Try playing with some of the buttons here, to see what they do. You should try at least the following:
Make a new folder
Delete that folder
Make a new R Script
Rename that script
That code creates a plot of the two variables in the women
dataset.
You should see this figure in the Plots tab.
Now, test out the buttons at the top of this tab to explore what they do.
In particular, try to export a plot to your computer.
Packages are collections of R code that extend the functionality of R.
it is important to know that to use a package, you need to install then load it.
Packages need to be installed only once, but must be loaded in each new R session.
All the package names you see (in blue font) are packages that are installed on your system. And packages with a checkmark are packages which are loaded in the current session.
Best resourse showing how to adjust panels, personalize how Rstudio looks, etc
1. Base Packages: Providing the basic functionality, maintained by the R Core Development group. Currently, there are 14 packages, these are
[1] "base" "compiler" "datasets" "graphics" "grDevices" "grid"
[7] "methods" "parallel" "splines" "stats" "stats4" "tcltk"
[13] "tools" "utils"
2. Recommended Packages: also a default package, mainly including additional more complex statistical procedures. These are 15 packages
3. Contributed packages: Due to the open nature of R, anyone can contribute new packages at any time.
Go to CRAN and download new version
More efficient: install installr
package, load it, and run updateR()
basic Rgui
Version should update automatically in RStudio
Then update the R packages with the code:
To updating RStudio: Go to RStudio and download new version
Click on Help>Check for Updates, follow menu prompts
Notice that the histogram above shows up in a Viewer tab. This tab allows you to preview HTML files and interactive objects.
Lastly, the Help tab shows the documentation for different R objects. Try typing out and running each line below to see what this documentation looks like.
RStudio has a number of useful options for changing it’s look and functionality. Let’s try these.
You may not understand all the changes made for now. That’s fine.
In the RStudio menu at the top of the screen, select Tools > Global Options
to bring up RStudio’s options.
Now, under Appearance
, choose your ideal theme. (We like the “Crimson Editor” and “Tomorrow Night” themes.)
Under Code > Display
, check “Highlight R function calls”.
What this does is give your R functions a unique color, improving readability.
Also under Code > Display
, check “Rainbow parentheses”.
What this does is make your “nested parentheses” easier to read by giving each pair a unique color.
Finally under General > Basic
, uncheck the box that says “Restore .RData into workspace at startup”.
You don’t want to restore any data to your workspace (or environment) when you start RStudio.
Starting with a clean workspace each time is less likely to lead to errors.
This also means that you never want to “save your workspace to .RData on exit”, so set this to Never.
The Rstudio command palette gives instant, searchable access to many of the RStudio menu options and settings that we have seen so far.
The palette can be invoked with the keyboard shortcut Ctrl
+ Shift
+ P
.
It’s also available on the Tools menu (Tools -> Show Command Palette).
Try using it to:
Create a new script (Search “new script” and click on the relevant option)
Rename a script (Search “rename” and click on the relevant option)
The workspace is a working environment where R will store and remember user-defined objects: vectors, matrices, data frames, lists, variables, etc.
Use projects to keep everything together
Create an RStudio project foreach data analysis project, for each homework assignment, etc.
A project is associated with a directory folder
Only use relative paths, never absolute paths
read.csv("data/mydata.csv")
read.csv("/home/yourname/Documents/stuff/mydata.csv")
Advantages of using projects
File \> New Project and then choose: New Directory\> Name for the directory \> Click on Create Project
objects
.variable <- value
or variable = value
or variable -> value
..4you
is illegal).<-
(Alt + -
) that could be read as assign
or place into
or read in
etc.A set of scalars arranged in a one-dimensional array.
Data values are all the same mode(data type), but can hold any mode.
Vectors can be created using the following functions:
c()
function to combine individual values
x <- c(10.4, 5.6, 3.1, 6.4, 21.7)
seq()
to create more complex sequences
seq(from=1, to=10, by=2) or seq(1,10 )
rep()
to create replicates of values: rep(1:4, times=2, each=2)
class(x):
returns class/type of vector xlength(x):
returns the total number of elementsx[length(x)]:
returns last value of vector xrev(x):
returns reversed vectorsort(x):
returns sorted vectorunique(x):
returns vector without multiple elementsrange(x):
Range of xquantile(x):
Quantiles of x for the given probabilitieswhich.max(x):
index of maximumwhich.min(x):
index of minimumfactor()
levels()
.matrix()
e.g.
dim()
can also be used to retrieve dimensions of an object!Assign names to rows and columns of a matrix
a data set in R is stored a data frame.
Two-dimensional, arranged in rows and columns created using the function: data.frame()
e.g.
Use head()
and tail()
to view the first (and last) five rows
Use View()
to view an entire data.table
object
Use str()
to view the structure of data.table
object
Use tables()
to show all loaded data.table
objects
Use colnames()
or names()
to look variable names
Use colSums(is.na())
to sum missing data
Use subset()
to subset data.
Use attributes()
to look attributes of the dataframe
Use dim()
or ncol()
and nrow()
to see dimensions of the dataframe
Use summary()
to see basic statistics for each variables
iris[] # the whole data frame
iris[1, 1] # 1st element in 1st column
iris[1, 6] # 1st element in the 6th column
iris[, 1] # first column in the data frame
iris[1] # first column in the data frame
iris[1:3, 3]
iris[3, ] # the 3rd row
iris[1:6, ] # the 1st to 6th rows
iris[c(1,4), ] # rows 1 and 4 only
iris[c(1,4), c(1,3) ]
iris[, -1] # the whole except first column
iris$Variable1 # Also extracts the first column
iris[,c("col3", "col4")]# extract by name of column
Importing data is rather easy in R but that may also depend on the nature of the data to be imported and from what format.
Most data are in tabular form such as a spreadsheet or a comma-separated file (.csv).
Base R has a series of read functions to import tabular data from plain text files with columns delimited by: space, tab, and comma, with or without a header containing the column names.
With an added package it is also possible to import directly from a Microsoft Excel spreadsheet format or other foreign formats from various sources.
In base R the standard commands to read text files are based on the read.table()
function.
The following table lists the collection of the base R read functions.
For more details use the help command help(read.table) that will display help for all.
Function name | Assumes header | Separator | Decimal | File type |
---|---|---|---|---|
read.table() | No | ” ” | . | .text |
read.csv() | Yes | “,” | . | .csv |
read.csv2() | Yes | “;” | , | .csv |
read.delim() | Yes | “tab” | . | .text |
read.delim2() | Yes | “tab” | , | .text |
read.table()
and comma separated files using read.csv()
functions.file.choose()
function can be used to select the file interactively. i.e.Useful arguments - Check these arguments carefully when you load your data
header = TRUE
argument tells R that the first row of your file contains the variable namessep = ","
argument tells R that fields are separated by commastrip.white = TRUE
argument removes white space before or after factors that has been mistakenly inserted during data entry.na.strings = " "
argument replaces empty cells by NA (missing data in R)Stata to R: Different packages for stata version >=13 vs. <13
Excel to R: There are several packages
CSV to R: There are several packages
Text file to R: Available in R base, used for text and csv files
The data import wizard is a quick and easy way to import your data
It’s actually way better to follow the reproducible steps – and hardly any more effort – below…
The data import wizard will help you find the proper package for importing your data. For example, use…
library(readxl)
for Excel datalibrary(haven)
for SPSS, SAS, Statalirary(readr)
for CSV or other delimetersJust start with File > Import Dataset
to get started composing that code, then paste your code into a script.
R to Stata: Use the libraries haven or readstata13
R to Excel: Note the package readxl does not work here
R to csv: Use readr package
R to a text file: