Foundations: what data visualization is and what to ask before you plot
The grammar of graphics: the building blocks of ggplot2
Hands-on: building a plot layer by layer with the palmerpenguins data
Customization: axes, titles, legends, themes, and backgrounds
Color: palettes for discrete and continuous data, colorblind-friendly choices
Annotation: reference lines, text, and labels
Coordinates, scales, and smoothing
Facets and multi-panel figures
A gallery of geoms: bar, histogram, boxplot, violin, line
Interactive graphics and saving your plots
From beginner to advanced: principles, chart choice, and a guided learning path
1. Foundations
What is data visualization?
Data visualization is the presentation of data in a pictorial or graphical format; a data visualization tool is the software that generates this presentation.
Effective data visualization gives users intuitive means to:
interactively explore and analyze data,
identify interesting patterns,
infer correlations and causalities, and
support sense-making activities.
Good visual presentation enhances the message: the same data can inform or mislead depending on how it is shown.
Ask before you plot
The effectiveness of a visualization depends on a few key questions:
What would you like to communicate?
Who is your audience? Researchers? Journalists? The general public? Grant reviewers?
How is your message best represented?
Is it through a box plot or a scatter plot?
Should you use blue or red?
What scale should you use?
Should you add or remove information?
These choices drive the key principles, methods, and concepts needed to visualize data for publications, reports, or presentations.
ggplot2 is a system for declaratively creating graphics, based on the Grammar of Graphics: a grammar used to describe and create a wide range of statistical graphics.
You provide the data, tell ggplot2 how to map variables to aesthetics, and which graphical primitives to use; it takes care of the details.
Graphs are composed of layers, so it is easy to add elements to existing graphs.
Plots are easy to manage, reproduce, and save.
Less work is needed to make beautiful, eye-catching, publication-quality graphics.
The building blocks of a ggplot
Data: the raw data that you want to plot.
Geometriesgeom_*(): the geometric shapes that represent the data.
Aestheticsaes(): properties of the geometric and statistical objects, such as position, color, size, shape, and transparency.
Scalesscale_*(): maps between the data and the aesthetic dimensions, such as data range to plot width or factor values to colors.
Statistical transformationsstat_*(): statistical summaries of the data, such as quantiles, fitted curves, and sums.
Coordinate systemcoord_*(): the transformation used for mapping data coordinates into the plane of the data rectangle.
Facetsfacet_*(): the arrangement of the data into a grid of plots.
Visual themestheme(): the overall visual defaults of a plot, such as background, grids, axes, default typeface, sizes, and colors.
Components of the layered grammar
Layer
Data
Mapping
Statistical transformation (stat)
Geometric object (geom)
Position adjustment (position)
Scale
Coordinate system (coord)
Faceting (facet)
“Source: BloggoType”
Data and aesthetics
Code
# data and aestheticsggplot(data, mapping =aes(x, y, ...))
We will practice with the palmerpenguins data set: size measurements for three penguin species observed on three islands in the Palmer Archipelago, Antarctica.
This data set is often used to replace the iris data set, which has some problems for teaching data science, including its ties to eugenics.
Code
library(palmerpenguins)data(penguins)
species, island, and sex are factor variables.
The bill measurements depicted in the image are numeric variables.
Flipper length and body mass are integer variables.
ggplot2 requires the data as an object of class data.frame or tibble (common in the tidyverse). More complex plots also require the long data frame format.
ggplot(data = penguins, aes(x = bill_length_mm, y = bill_depth_mm))
Code
ggplot(data = penguins,aes(x = bill_length_mm, y = bill_depth_mm)) +geom_point()
Code
ggplot(data = penguins,aes(x = bill_length_mm, y = bill_depth_mm)) +geom_point() +facet_wrap(~species) +coord_trans(x ="log10", y ="log10")
Code
ggplot(data = penguins, # Dataaes(x = bill_length_mm, # Your X-valuey = bill_depth_mm, # Your Y-valuecol = species)) +# Aestheticsgeom_point(size =5, alpha =0.8) +# Pointgeom_smooth(method ="lm") # Linear regression
A reusable base plot
To keep the customization examples short, we store two base plots and add layers to them throughout the rest of the training:
Code
# color mapped to a categorical variable (species)p <-ggplot(penguins,aes(x = bill_length_mm, y = bill_depth_mm, color = species)) +geom_point() +labs(x ="Bill length (mm)", y ="Bill depth (mm)")# color mapped to a continuous variable (body mass)p2 <-ggplot(penguins,aes(x = bill_length_mm, y = bill_depth_mm, color = body_mass_g)) +geom_point() +labs(x ="Bill length (mm)", y ="Bill depth (mm)")
Every slide that follows builds on p (discrete color) or p2 (continuous color), so you can focus on the one new function being demonstrated.
4. Customization: axes, titles, legends, themes
Anatomy of a themed plot
Customizing plots means adjusting elements to improve readability, presentation, and informativeness. Title and axis components can change in size, color, and face:
element_text() inside theme() sets text properties such as size, color, and font face:
Code
p +theme(axis.title =element_text(size =15, face ="italic"))
The face argument can be bold, italic, or bold.italic:
Code
p +theme(axis.title =element_text(color ="sienna", size =15, face ="bold"),axis.title.y =element_text(face ="bold.italic"))
vjust controls the vertical alignment, typically ranging between 0 and 1, but it can extend beyond this range:
Code
p +theme(axis.title.x =element_text(vjust =0, size =15),axis.title.y =element_text(vjust =2, size =15))
Use margin() with parameters t (top) and r (right) to add distance between the axis and its title. For the y-axis, change the right margin, not the bottom margin:
axis.text (and axis.text.x / axis.text.y) modify the appearance of the axis numbers:
Code
p +theme(axis.text =element_text(color ="dodgerblue", size =12),axis.text.x =element_text(face ="italic"))
angle, hjust, and vjust rotate and position any text element (hjust: 0 = left, 1 = right; vjust: 0 = top, 1 = bottom):
Code
p +theme(axis.text.x =element_text(angle =50, vjust =1, hjust =1, size =12))
element_blank() removes axis text and ticks entirely:
Code
p +theme(axis.ticks.y =element_blank(), axis.text.y =element_blank())
Remove axis titles by setting them to NULL or empty quotes in labs():
Code
p +labs(x =NULL, y ="")
Tip: NULL removes the element, while empty quotes " " keep the space for the axis title but print nothing.
Axis limits
xlim() and ylim() limit the axis range:
Code
p +ylim(c(0, 20))
Alternatively, use scale_y_continuous(limits = c(0, 20)) or coord_cartesian(ylim = c(0, 20)). The former removes data points outside the range, while the latter zooms without removing data points.
Plot titles
Three functions work together to customize titles:
ggtitle(): sets the text for the main title, for example ggtitle("Main Title").
labs(): sets title, subtitle, caption, and tag in one call.
theme(): styles each element via plot.title, plot.subtitle, plot.caption, and plot.tag, using element_text(face, size, family, hjust, vjust, margin, lineheight).
p +labs(title ="Relationship between bill length and depth",subtitle ="for different penguin species",caption ="scatter plot", tag ="Fig. 1")
Code
p +labs(title ="Relationship between bill length and depth") +theme(plot.title =element_text(face ="bold",margin =margin(10, 0, 10, 0), size =14))
Code
library(showtext)font_add_google("Playfair Display", "Playfair")font_add_google("Bangers", "Bangers")showtext_auto()p +labs(title ="Relationship between bill length and depth") +theme(plot.title =element_text(family ="Bangers", hjust =0.5, size =25),plot.subtitle =element_text(family ="Playfair", hjust =0.5, size =15))
Code
p +ggtitle("Relationship between bill length and depth across different \n species using scatter plot") +theme(plot.title =element_text(lineheight =0.8, size =16))
Legends: the default
ggplot2 adds a legend by default when a variable is mapped to an aesthetic. The default legend title is the variable specified in the color argument:
theme(legend.title = element_blank()) removes the legend title; setting the name to NULL via scale_color_discrete(name = NULL) or labs(color = NULL) achieves the same:
Code
p +theme(legend.title =element_blank())
Three equivalent ways: labs(color = "new title"), scale_color_discrete(name = "new title"), or guides(color = guide_legend("new title")):
Code
p +labs(color ="species\nindicated\nby colors:") +theme(legend.title =element_text(family ="Playfair",color ="blue", size =14, face ="bold"))
theme(legend.title = element_text(family, color, size, face)) styles the title:
Code
p +theme(legend.title =element_text(family ="Playfair",color ="chocolate", size =14, face ="bold"))
The default theme is theme_gray(), with two arguments for the base font size (base_size, a number) and font family (base_family, a string such as “serif”, “sans”, “mono”):
Code
p +theme_grey() +labs(title ="Default: Grey")
plot + theme_gray()
plot + theme_bw()
plot + theme_linedraw()
plot + theme_light()
plot + theme_dark()
plot + theme_minimal()
plot + theme_classic()
plot + theme_void()
The ggthemes package offers additional predefined themes.
theme() has many arguments to modify individual components of a plot, including:
all line, rectangle, text, and title elements,
the aspect ratio of the panel,
axis title, text, ticks, and lines,
legend background, margin, text, title, position, and more,
element_blank() removes grid lines selectively (panel.grid.minor) or entirely (panel.grid):
Code
p +theme(panel.grid =element_blank())
Specify the spacing of grid lines through axis breaks with scale_*_continuous():
Code
p +scale_y_continuous(breaks =seq(0, 30, 5), minor_breaks =seq(0, 60, 2.5))
5. Color
color and fill
The color argument defines the outline color and fill the filling color of plot elements:
geom_point(color = "steelblue", size = 2)
For point shapes 21 to 24, both apply:
Code
ggplot(penguins, aes(x = bill_length_mm, y = bill_depth_mm)) +geom_point(shape =21, size =2, stroke =1, color ="#3cc08f", fill ="#c08f3c") +labs(x ="Bill length (mm)", y ="Bill depth (mm)")
scale_color_* and scale_fill_* functions modify colors mapped to variables; they differ for categorical (qualitative) and continuous (quantitative) variables.
library(ggtext)lab_md <-"This plot shows **Bill length** in *mm* versus **Bill depth** in *mm* across species type"p +geom_richtext(aes(x =45, y =22.5, label = lab_md), stat ="unique")
geom_textbox() provides dynamic wrapping for longer annotations:
Code
lab_long <-"**Association**<br><i style='font-size:8pt;color:black;'>This graph is a scatter plot showing the association between bill length and bill depth for each species type, so we can see that there is a clear association.</i>"p +geom_textbox(aes(x =45, y =20, label = lab_long),width =unit(25, "lines"), stat ="unique")
p +geom_point(color ="gray40", alpha = .3) +geom_smooth(method ="lm",formula = y ~ x +I(x^2) +I(x^3) +I(x^4) +I(x^5),color ="black", fill ="firebrick")
8. Facets and multi-panel figures
facet_grid() vs facet_wrap()
facet_grid() facets the plot with a variable in a single direction (horizontal or vertical).
facet_wrap() places the facets next to each other and wraps them according to the provided number of columns and/or rows.
patchwork: combine multiple plots with simple syntax such as p1 + p2, p1 / p2, or (g + p2) / p1, and define complex layouts with a design matrix via plot_layout(design = layout).
cowplot: another package for combining plots, for example plot_grid(plot_grid(g, p1), p2, ncol = 1).
How many penguins of each species are in this data set?
Code
ggplot(penguins, aes(x = species, fill = species)) +geom_bar() +labs(title ="Number of Penguins by Species",x ="Species", y ="Count", fill ="Species") +theme_minimal()
Number of penguin species on each island:
Code
ggplot(penguins) +geom_bar(aes(x = island, fill = species)) +labs(title ="Population of penguin species on each island",y ="count of species") +theme(text =element_text(size =14))
Body mass by species and sex (stat = "identity" plots the values as given):
Code
ggplot(penguins, aes(x = species, y = body_mass_g, fill = sex)) +geom_bar(stat ="identity", position ="dodge") +labs(title ="Body Mass by Species and Sex",x ="Species", y ="Body Mass (g)", fill ="Sex") +theme_minimal()
Histograms: geom_histogram()
A histogram is an accurate graphical representation of the distribution of numeric data. There is only one aesthetic required: the x variable.
Code
ggplot(penguins, aes(x = bill_length_mm)) +geom_histogram() +ggtitle("Histogram of penguin bill length")
ggplot(penguins, aes(x = species, y = body_mass_g, fill = species)) +geom_boxplot() +labs(title ="Body Mass Distribution of Penguins by Species",x ="Species", y ="Body Mass (g)", fill ="Species") +theme_minimal()
Boxplot with annotations via geom_signif() from ggsignif:
Code
library(ggsignif)ggplot(penguins, aes(x = species, y = bill_length_mm, fill = species)) +geom_boxplot() +# specify the comparison we are interested ingeom_signif(comparisons =list(c("Adelie", "Gentoo")), map_signif_level =TRUE)
Violin plots: geom_violin()
A violin plot visualizes the distribution of a numeric variable for one or several groups. It is close to a boxplot but allows a deeper understanding of the distribution.
geom_line() displays values over time and requires a group = aesthetic: use group = 1 for a single line, or group = variable_name to split lines by a variable.
What separates an adequate chart from an excellent one is rarely the code; it is the design decisions behind it:
Show the data honestly. Use sensible axis ranges, never truncate or distort to exaggerate an effect, and show uncertainty where it matters.
Match the chart to the question. Choose the geometry from the question you are answering, not from habit.
Reduce clutter. Remove grid lines, borders, and legends that do not carry information; every element should earn its place.
Use color with intent. Color should encode meaning, not decorate. Limit categorical palettes to 5 to 7 colors and use colorblind-safe palettes such as viridis by default.
Label directly where possible. Direct labels beat legends; titles should state the finding, not just the variables.
Design for the audience. A figure for a donor report needs larger text, fewer panels, and a clearer takeaway than a figure for exploratory analysis.
Always cite the data source under the figure, and label tables and figures with descriptions rather than variable names.
Two interactive chart choosers: from Data to Viz (a decision tree from your data type to the right chart, with R code) and the R Graph Gallery (hundreds of charts with copy-paste code).
Rebuild charts you admire. Take a figure from a journal or news outlet and reproduce it in ggplot2; you will learn more from one reproduction than from ten tutorials.
Iterate in public. Share drafts with colleagues, ask what they read from the chart in five seconds, and revise until the takeaway is immediate.
Build a personal theme and palette so every figure you produce is consistent and branded by default.
Critique before you decorate. First ask whether the chart answers the question, then polish.
Practice on a schedule. One #TidyTuesday figure per week compounds quickly into an expert portfolio.
Thank you. Questions and discussion are welcome. Part 2 of this training (graph_part2.qmd) continues with the advanced chart gallery: ridgeline, lollipop, dumbbell, diverging bars, heatmaps, slope charts, uncertainty, storytelling, and a reusable C4ED theme.