Chapter 9 Data Visualization

9.1 Data

ACTIVE (Advanced Cognitive Training for Independent and Vital Elderly), 1999-2001 [United States] was a multisite randomized controlled trial conducted at six field sites with New England Research Institutes (NERI) as the coordinating center. The field sites included the University of Alabama at Birmingham, Hebrew Rehabilitation Center for the Aged in Boston, Indiana University, Johns Hopkins University in Baltimore, Pennsylvania State University, and the University of Florida/Wayne State University (Detroit). The primary aim of the trial was to test the effects of three distinct cognitive interventions – previously found to be successful in improving elders’ performance on basic measures of cognition under laboratory or small-scale field conditions – on measures of cognitively demanding daily activities. Trainings consisted of an initial series of ten group sessions followed by four-session booster trainings at one and three years. The three cognitive interventions focused on memory, executive reasoning, and speed of processing. The design included a no-contact control group. Participants were assessed at baseline, immediately after training, and annually thereafter. A total of 2,832 older adults were enrolled in the trial, and 2,802 were included in the analytical sample. Twenty-six percent of the participants were African American.

There are 13 variables in the data set active.csv.

  • site: a total of 6 sites
  • age
  • edu: years of education
  • group: there are four groups - a control group and three other training groups (memory, reasoning, and speed)
  • booster: whether received booster training
  • sex: 1 Male 2 Female
  • reason: reasoning ability
  • ufov: useful field of view variable
  • hvltt: Hopkins Verbal Learning Test total score at time 1
  • hvltt2: Hopkins Verbal Learning Test total score at time 2
  • hvltt3: Hopkins Verbal Learning Test total score at time 3
  • hvltt4: Hopkins Verbal Learning Test total score at time 4
  • mmse: Mini-mental state examination total score
active <- read.csv('data/active.csv')
str(active)
## 'data.frame':    1575 obs. of  14 variables:
##  $ site   : int  1 1 6 5 4 1 4 6 6 5 ...
##  $ age    : int  76 67 67 72 69 70 77 79 73 78 ...
##  $ edu    : int  12 10 13 16 12 13 13 12 13 13 ...
##  $ group  : int  1 1 3 1 4 1 1 4 4 4 ...
##  $ booster: int  1 1 1 1 0 1 1 0 0 0 ...
##  $ sex    : int  1 2 2 2 2 1 2 2 2 2 ...
##  $ reason : int  28 13 24 33 30 35 32 38 17 14 ...
##  $ ufov   : int  16 20 16 16 16 23 23 20 16 26 ...
##  $ hvltt  : int  28 24 24 35 35 29 18 25 24 22 ...
##  $ hvltt2 : int  28 22 24 34 29 27 16 26 17 19 ...
##  $ hvltt3 : int  17 20 28 32 34 26 27 25 20 21 ...
##  $ hvltt4 : int  22 27 27 34 34 29 30 29 11 26 ...
##  $ mmse   : int  27 25 27 30 28 23 26 30 26 27 ...
##  $ id     : int  1 2 3 4 5 6 7 8 9 10 ...

9.2 Data visualization using base R functions

R is very well known for its ability in visualizing data. R base already provides many useful functions for plotting data.

For example, to get a scatterplot, we can use the plot function.

attach(active)
plot(x = hvltt, y = reason)

Each point represents a pair of data from a subject. If two subjects have the same data, the points will overlap and become one. To see where the data are clustered, we can add a small noise to the data. jitter function is used here and the option factor specifies how much noise to add.

plot(x = jitter(hvltt, factor=10), y = reason)

We can change the labels for x-axis and y-axis.

plot(x = jitter(hvltt, factor=2), y = reason,
     xlab="Verbal ability", ylab="Reasoning ability")

A title can be added.

plot(x = jitter(hvltt, factor = 2), y = reason,
     xlab = "Verbal ability", ylab = "Reasoning ability",
     main = "Relationship between verbal and reasoning ability")

9.2.1 Save a plot

Different ways can be used to save a figure. One is to plot it first and save it to the computer. The other way is to directly plot to a file. The most common file formats include pdf, svg, png, and jpeg.

For example, to generate a pdf plot, we can use:

pdf('data/active.pdf')
plot(x = jitter(hvltt, factor = 2), y = reason,
     xlab = "Verbal ability", ylab = "Reasoning ability",
     main = "Relationship between verbal and reasoning ability")
dev.off()
## png 
##   2

jpeg('data/active.jpg')
plot(x = jitter(hvltt, factor = 2), y = reason,
     xlab = "Verbal ability", ylab = "Reasoning ability",
     main = "Relationship between verbal and reasoning ability")
dev.off()
## png 
##   2

The function pdf() opens a pdf file/device for plotting. Then, any plot process used in R can be applied after. At the end of the plot, one closes the file/device using the function dev.off().

For other types of plot, the same procedure can be followed. For example, svg() is for svg plot, png() for PNG plot, jpeg() for jpg plot, bmp() for bmp plot.

9.2.2 More complex plots

The ACTIVE data can be grouped into male and female participants. Suppose we want to visualize the two groups of data.

One way to do it is to generate two plots. The two plots can also be put together into one figure. Note that the function par is used to specify the plot parameters. Particularly, we use mfrow to set the layout of the two plots as one row with two columns.

par(mfrow=c(nr=1,nc=2))
plot(x = jitter(hvltt[sex==1], factor = 2), y = reason[sex==1],
     xlab = "Verbal ability", ylab = "Reasoning ability",
     main = "For Male")

plot(x = jitter(hvltt[sex==2], factor = 2), y = reason[sex==2],
     xlab = "Verbal ability", ylab = "Reasoning ability",
     main = "For Female")

par(mfrow=c(1,1))

The function layout provides more control of the layout of multiple figures. It uses a matrix to tell the location of the figure. In the matrix, each number from 1 to K to indicate plot 1 to plot K. Zero (0) can be used to skip a figure.

layout <- layout(matrix(c(1,2), 1, 2, byrow = TRUE))
layout.show(layout)

layout <- layout(matrix(c(1,1,0,2), 2, 2, byrow = TRUE))
layout.show(layout)


layout <- layout(matrix(c(1,1,2,3), 2, 2, byrow = TRUE))
layout.show(layout)


layout <- layout(matrix(c(1,2,3,4), 2, 2, byrow = TRUE), 
                 widths=c(1,3), heights=c(3,1))
layout.show(layout)


layout <- layout(matrix(c(1,2,3,3), 2, 2, byrow = TRUE), 
                 widths=c(1,3), heights=c(3,1))
layout.show(layout)

Now the active data

layout <- layout(matrix(c(1,1,2,3), 2, 2, byrow = TRUE),
                 widths=c(1,1), heights=c(2,1))

plot(x = jitter(hvltt, factor = 2), y = reason,
     xlab = "Verbal ability", ylab = "Reasoning ability",
     main = "Overall")

plot(x = jitter(hvltt[sex==1], factor = 2), y = reason[sex==1],
     xlab = "Verbal ability", ylab = "Reasoning ability",
     main = "For Male")

plot(x = jitter(hvltt[sex==2], factor = 2), y = reason[sex==2],
     xlab = "Verbal ability", ylab = "Reasoning ability",
     main = "For Female")

Practice

layout <- layout(matrix(c(1,2,3,4), 2, 2, byrow = TRUE))

plot(x = jitter(hvltt[group==1], factor = 2), y = reason[group==1],
     xlab = "Verbal ability", ylab = "Reasoning ability",
     main = "For Male")

plot(x = jitter(hvltt[group==2], factor = 2), y = reason[group==2],
     xlab = "Verbal ability", ylab = "Reasoning ability",
     main = "For Female")

plot(x = jitter(hvltt[group==3], factor = 2), y = reason[group==3],
     xlab = "Verbal ability", ylab = "Reasoning ability",
     main = "For Male")

plot(x = jitter(hvltt[group==4], factor = 2), y = reason[group==4],
     xlab = "Verbal ability", ylab = "Reasoning ability",
     main = "For Female")

Another way to visualize the same data is to plot them in the same plot.

par(mfrow=c(1,1))
plot(x = jitter(hvltt[sex==1], factor = 2), y = reason[sex==1],
     xlab = "Verbal ability", ylab = "Reasoning ability",
     xlim=c(0, max(hvltt)), ylim=c(0, max(reason)), col='red', pch=1)

points(x = jitter(hvltt[sex==2], factor = 2), y = reason[sex==2],
       col='blue', pch=2)

legend('topleft', c('Male', 'Female'), col=c('red','blue'), pch = c(1,2))

We can further add regression lines to figure.

  • plot first plots the data for the male participants.
    • xlim: the range of x-axis
    • ylim: the range of y-axis
    • col: the color of the plot
  • points can add another layer of plot to the existing plot. It cannot be used without an existing plot already.
    • pch: the character to represent a data point
  • abline: add a line with given intercept and slope
    • lty: type of lines
    • lwd: width of lines
  • legend: add a legend to the plot.
plot(x = jitter(hvltt[sex==1], factor = 2), y = reason[sex==1],
     xlab = "Verbal ability", ylab = "Reasoning ability",
     xlim=c(0, max(hvltt)), ylim=c(0, max(reason)), col='red')

points(x = jitter(hvltt[sex==2], factor = 2), y = reason[sex==2],
       col='blue', pch=2)

abline(lm(reason[sex==1] ~ hvltt[sex==1]), col='red', lty=1, lwd=3)

abline(lm(reason[sex==2] ~ hvltt[sex==2]), col='blue', lty=2, lwd=2)

legend('topleft', c('Male', 'Female'), col=c('red','blue'), 
       pch = c(1,2), lty=c(1,2))

9.3 ggplot2 - The Layered Grammar of Graphics

Wickham (2010) proposed some general rules to build a statistical graph that serves as the basis of the popular R package ggplot2 and many other visualization packages.

Each plot can have several layers and each layer has the following components, explicitly or implicitly.

  • data and aesthetic mappings,
  • geometric objects,
  • the coordinate system
  • scales,
  • statistical transformations, and
  • facet specification.

9.3.1 An example

We now plot the ACTIVE data again.

ggplot() + 
  layer(
    data = active, geom = 'point', stat = 'identity', 
    mapping = aes(x = hvltt, y = reason), position = 'identity'
  ) +
  scale_y_continuous() +
  scale_x_continuous() +
  coord_cartesian()

To generate the above plot:

  • ggplot: starts the plot area
  • layer: creates a new layer of plot. A layer is a combination of data, stat, geom, and aesthetic mappings with a potential position adjustment.
    • data: a data frame
    • geom: the geometric object to display the data such as a point, a droplet, a line, etc.
    • stat: how to use or transform the data. identity is to use the data as is.
    • mapping: connect the variables in the data with the plotting element.
    • position: how to arrange geoms in the plot, e.g., just put where it is or permutate its position or stack them together.
  • scale_y_continuous and scale_x_continuous: how to map data from the data set to the aesthetic attributes of a plot.
  • coord_cartesian: coord_ specify the type of coordinate system to be used.

The full options can be found in the document of ggplot2. A fast reference is given below.

We can also change the title and labels of a plot.

ggplot() + 
  layer(
    data = active, geom = 'point', stat = 'identity', 
    mapping = aes(x = hvltt, y = reason), position = 'identity'
  ) +
  scale_y_continuous() +
  scale_x_continuous() +
  coord_cartesian() + 
  ggtitle('Relationship between verbal and reasoning ability') +
  xlab('Verbal ability') +
  ylab('Reasoning ability')

Note that by default, the title is left-justified. To center the title, one needs to change the “theme” of a plot.

ggplot() + 
  layer(
    data = active, geom = 'point', stat = 'identity', 
    mapping = aes(x = hvltt, y = reason), position = 'identity'
  ) +
  scale_y_continuous() +
  scale_x_continuous() +
  coord_cartesian() + 
  ggtitle('Relationship between verbal and reasoning ability') +
  xlab('Verbal ability') +
  ylab('Reasoning ability') +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5))

We can also permutate the data.

ggplot() + 
  layer(
    data = active, geom = 'point', stat = 'identity', 
    mapping = aes(x = hvltt, y = reason), position = 'jitter'
  ) +
  scale_y_continuous() +
  scale_x_continuous() +
  coord_cartesian() + 
  ggtitle('Relationship between verbal and reasoning ability') +
  xlab('Verbal ability') +
  ylab('Reasoning ability')

More layers can be added to a plot. For example, we can add the regression line to the scatter plot.

ggplot() + 
  layer(
    data = active, geom = 'point', stat = 'identity', 
    mapping = aes(x = hvltt, y = reason), position = 'jitter'
  ) +
  layer(
    data = active, geom = 'smooth', stat = 'smooth', params = list(method = "lm"),
    mapping = aes(x = hvltt, y = reason), position = 'identity'
  ) +
  scale_y_continuous() +
  scale_x_continuous() +
  coord_cartesian() + 
  ggtitle('Relationship between verbal and reasoning ability') +
  xlab('Verbal ability') +
  ylab('Reasoning ability')

ggplot2 provides a unified way to generate all kinds of plots. But comparing the code of ggplot2 with the plot function we used earlier, it seemed to be more tedious or verbose to use ggplot2. However, in a plot, an intelligent guess can be used to avoid the setup of many parameters to simplify the plotting process. For example, for the same plot above, we can use

ggplot(active, aes(hvltt, reason)) + 
  geom_point() +
  geom_smooth(method=lm) + 
  labs(title = 'Relationship between verbal and reasoning ability',
       x = 'Verbal ability', y = 'Reasoning ability')

  • The two layers share the same data so we put them together in ggplot.
  • The function geom_point is defined to generate a scatterplot. Nothing is specified in the function because all the defaults are used. One can change the default if needed.
  • The function geom_smooth is used to plot the regression line. By default, the smoothing method will be used. We change the default to method=lm to produce a regression line instead.
  • We also combine the title and labels together using the function labs.

ggplot2 also provides a quick plot function qplot similiar to plot function in the R base package. For example,

qplot(hvltt, reason)


qplot(hvltt, reason, geom=c('jitter','smooth'), method='lm')

9.3.2 Grouping

Previously, we plotted the data according to a third variable. For example, we can have different colors and shapes for different groups. This can be done easily using ggplot2.

ggplot(data = active) + 
  geom_jitter(aes(hvltt, reason, color = sex))


ggplot(data = active) + 
  geom_jitter(aes(hvltt, reason, color = age))

In most situations, it is beneficial to convert the grouping variable to a factor and label each category informatively.

active$sex2 <- factor(active$sex, levels=1:2, labels=c('Male', 'Female'))
active$group2 <- factor(active$group, levels=1:4,
                        labels = c('Memory', 'Reasoning', 'Speed', 'Control'))
ggplot(data = active) + 
  geom_jitter(aes(hvltt, reason, color = sex2))


ggplot(data = active) + 
  geom_jitter(aes(hvltt, reason, color = group2, shape = group2)) +
  scale_color_discrete(name="Treatment\nGroups") + 
  scale_shape_discrete(name="Treatment\nGroups")

Multiple layers can be created, too.

ggplot(data = active, aes(hvltt, reason, color = group2, shape = group2)) + 
  geom_jitter() +
  geom_smooth(method=lm, se=FALSE) + 
  scale_color_discrete(name="Treatment\nGroups") + 
  scale_shape_discrete(name="Treatment\nGroups")

9.3.3 Faceting

We can also generate a plot for each subset of data. This kind of plot is called conditioned or trellis plots. In ggplot2, it is called faceting. By faceting, we plot make plots for a subset of data.

ggplot(active, aes(hvltt, reason)) + 
  geom_point() +
  facet_wrap(~ sex)

Again, it is often beneficial to convert the grouping variable to a factor and label each category informatively.

active$sex2 <- factor(active$sex, levels=1:2, labels=c('Male', 'Female'))
active$group2 <- factor(active$group, levels=1:4,
                        labels = c('Memory', 'Reasoning', 'Speed', 'Control'))

ggplot(active, aes(hvltt, reason)) + 
  geom_point() +
  facet_wrap(~ sex2)


ggplot(active, aes(hvltt, reason)) + 
  geom_point() +
  facet_wrap(~ sex2, labeller = label_both)

The faceting can be done on more than one variable.

ggplot(active, aes(hvltt, reason)) + 
  geom_point() +
  facet_wrap(~ sex2*group2, nrow = 2)


ggplot(active, aes(hvltt, reason)) + 
  geom_point() +
  facet_grid(sex2 ~ group2)

9.4 Specific types of plot

9.4.1 Bar plot - geom_bar

A bar plot or bar graph is a chart of rectangular bars with their lengths proportional to the values or proportions that they represent. It is good for displaying the distribution of a categorical variable.

9.4.1.1 A simple barpot

ggplot(data = active) +
  geom_bar(aes(group2, fill = group2)) +
  theme(legend.position = "none")

9.4.1.2 Grouped barplot

We can stack the barplot together

ggplot(data = active) +
  geom_bar(aes(group2, fill = sex2), position = "stack")


ggplot(data = active) +
  geom_bar(aes(group2, fill = sex2), position = "fill")

Or put them side by side.

ggplot(data = active) +
  geom_bar(aes(group2, fill = sex2), position = "dodge")

9.4.2 Boxplot

9.4.2.1 A simple boxplot

A box plot or boxplot (also known as a box-and-whisker diagram or plot) is a convenient way of graphically displaying summaries of a variable. Often times, the five-number summary is used: the smallest observation, lower quartile (Q1), median (Q2), upper quartile (Q3), and largest observation.

ggplot(data = active) +
  geom_boxplot(aes(y = hvltt))

9.4.2.2 Grouped boxplot

ggplot(data = active) +
  geom_boxplot(aes(x = group2, y = hvltt))


ggplot(data = active) +
  geom_boxplot(aes(x = group2, y = hvltt)) +
  coord_flip()

9.4.3 Histogram

A histogram is a graphical display of frequencies over a set of continuous intervals for a continuous variable. The range of a variable is divided into a list of equal intervals. Within each interval, the number of participants, frequency, is counted. Then, the frequencies can be plotted with attached bars. Heights of the bars stand for frequencies or relative frequencies.

9.4.3.1 A simple histogram

ggplot(data = active, aes(x = hvltt)) +
  geom_histogram(bins = 20)

Plot the density instead of count

ggplot(data = active, aes(x = hvltt)) +
  geom_histogram(bins = 20, aes(y = ..density..))

9.4.3.2 Histogram with a density curve

ggplot(data = active, aes(x = hvltt)) +
  geom_histogram(bins = 20, aes(y = ..density..)) + 
  geom_density(size=1.5, color='red')

9.4.3.3 Multiple-group histogram

ggplot(active, aes(x=hvltt, fill=sex2)) +
    geom_histogram(alpha=.5, position="identity")


ggplot(active, aes(x=hvltt, fill=sex2)) +
    geom_histogram(position="identity")

ggplot(active, aes(x=hvltt, fill=sex2)) +
    geom_density(alpha = 0.3)


ggplot(active, aes(x=hvltt, color=sex2)) +
    geom_density(alpha = 0.3)


ggplot(active, aes(x=hvltt, group=sex2)) +
    geom_density(alpha = 0.3)

9.4.4 Line plot

9.4.4.1 A simple line plot

data.example <- data.frame(time=1:4, hvltt=c(28,28,17,22))

ggplot(data=data.example, aes(x=time, y=hvltt)) +
  geom_line()+
  geom_point()


ggplot(data=data.example, aes(x=time, y=hvltt)) +
  geom_line(color='red', linetype="dashed", size=1.1)+
  geom_point(size=3, color='blue')

As an example, we plot the active data.

active.long <- active %>%
                  select(id, sex, starts_with('hvlt')) %>% 
                  gather(starts_with('hvlt'), 
                         key='item', value='hvltt') %>%
                  arrange(id)

ggplot(data=active.long, aes(x=item, y=hvltt, group=id)) +
  geom_line() +
  xlab('Time')


ggplot(data=active.long, aes(x=item, y=hvltt, color=id)) +
  geom_line() +
  xlab('Time')

ggplot(data=active.long, aes(x=item, y=hvltt, group=id)) +
  geom_line() +
  xlab('Time') + 
  stat_summary(aes(group = 1, color='red'), geom = "point", 
               fun.y = mean, size = 3) +
  stat_summary(aes(group = 1, color='red'), geom = "line", 
               fun.y = mean, size = 2)


ggplot(data=active.long, aes(x=item, y=hvltt, group=id)) +
  geom_line() +
  xlab('Time') + 
  stat_summary(aes(group = 1), color='red', geom = "point", 
               fun.y = mean, size = 3) +
  stat_summary(aes(group = 1), color='red', geom = "line", 
               fun.y = mean, size = 2)

9.4.5 Maps

Every location on earth has a global address that can be specified by two numbers - latitude and longitude. latitude is a geographic coordinate that specifies the north-south position of a point on the Earth’s surface. Latitude ranges from 0° at the Equator to 90° (North or South) at the poles. A positive number is used for a position on the Northern Hemisphere and a negative number is used for the Southern Hemisphere. Longitude is a geographic coordinate that specifies the east-west position of a point on the Earth’s surface. By convention, the Prime Meridian, which passes through the Royal Observatory, Greenwich, England, was allocated the position of 0° longitude. The longitude of other places is measured as the angle east or west from the Prime Meridian, ranging from 0° at the Prime Meridian to +180° eastward and −180° westward.

9.4.5.1 Plot simple map outlines

Plotting data on a map is interesting but not an easy job. ggplot2 and some R packages built based on it has made it easy. A map is simply a polygon. Therefore, with the particular shape data, a map can be generated.

The package maps includes the map outline data. For example, for the world map, we have:

world <- map_data("world")
str(world)
## 'data.frame':    99338 obs. of  6 variables:
##  $ long     : num  -69.9 -69.9 -69.9 -70 -70.1 ...
##  $ lat      : num  12.5 12.4 12.4 12.5 12.5 ...
##  $ group    : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ order    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ region   : chr  "Aruba" "Aruba" "Aruba" "Aruba" ...
##  $ subregion: chr  NA NA NA NA ...

ggplot() + geom_polygon(data = world, 
                        aes(x=long, y = lat, group = group)) + 
  coord_fixed(1.3)


ggplot() + geom_polygon(data = world, 
                        aes(x=long, y = lat)) + 
  coord_fixed(1.3)

  • Data
    • long is longitude. Things to the west of the prime meridian are negative.
    • lat is latitude.
    • order. This just shows in which order ggplot should “connect the dots”
    • region and subregion tell what region or subregion a set of points surrounds.
    • group. This is very important! ggplot2’s functions can take a group argument which controls (amongst other things) whether adjacent points should be connected by lines. If they are in the same group, then they get connected, but if they are in different groups then they don’t. Essentially, having two points in different groups means that ggplot “lifts the pen” when going between them.
  • Plot
    • Maps in this format can be plotted with the polygon geom. i.e. using geom_polygon.
    • geom_polygon drawn lines between points and “closes them up” (i.e. draws a line from the last point back to the first point)
    • You have to map the group aesthetic to the group column
    • Of course, x = long and y = lat are the other aesthetics.
    • coord_fixed fixes the relationship between one unit in the y direction and one unit in the x direction. It is basically the width/height ratio

We now focus on the US map. The outline of the map can be obtained using the data below.

usa <- map_data("usa")
str(usa)
## 'data.frame':    7243 obs. of  6 variables:
##  $ long     : num  -101 -101 -101 -101 -101 ...
##  $ lat      : num  29.7 29.7 29.7 29.6 29.6 ...
##  $ group    : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ order    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ region   : chr  "main" "main" "main" "main" ...
##  $ subregion: chr  NA NA NA NA ...
ggplot() + geom_polygon(data = usa, fill=NA, color='blue',
                        aes(x=long, y = lat, group = group)) + 
  coord_fixed(1.3) 

We now turn to state maps. The data come with maps package only include the United States mainland generated from US Department of the Census data.

usstates <- map_data('state')
head(usstates)
##        long      lat group order  region subregion
## 1 -87.46201 30.38968     1     1 alabama      <NA>
## 2 -87.48493 30.37249     1     2 alabama      <NA>
## 3 -87.52503 30.37249     1     3 alabama      <NA>
## 4 -87.53076 30.33239     1     4 alabama      <NA>
## 5 -87.57087 30.32665     1     5 alabama      <NA>
## 6 -87.58806 30.32665     1     6 alabama      <NA>

ggplot(data = usstates) + 
  geom_polygon(aes(x = long, y = lat, fill = region, group = group), color = "white") + 
  coord_fixed(1.3) +
  guides(fill=FALSE) 


ggplot(data = usstates) + 
  geom_polygon(aes(x = long, y = lat, fill = region, group = group), color = "white") + 
  coord_fixed(1.3)

The county-level data for the US are also available.

uscounty <- map_data('county')
head(uscounty)
##        long      lat group order  region subregion
## 1 -86.50517 32.34920     1     1 alabama   autauga
## 2 -86.53382 32.35493     1     2 alabama   autauga
## 3 -86.54527 32.36639     1     3 alabama   autauga
## 4 -86.55673 32.37785     1     4 alabama   autauga
## 5 -86.57966 32.38357     1     5 alabama   autauga
## 6 -86.59111 32.37785     1     6 alabama   autauga

ggplot(data = uscounty) + 
  geom_polygon(aes(x = long, y = lat, fill = region, group = group), color = "white") + 
  coord_fixed(1.3) +
  guides(fill=FALSE) 

We now turn to a particular state, Indiana.

indiana <- subset(uscounty, region == "indiana")
ggplot(data = indiana) + 
  geom_polygon(aes(x = long, y = lat, fill = subregion, group = group), color = "white") + 
  coord_fixed(1.3) +
  guides(fill=FALSE)

We can remove some of the information to keep the map only.

map_theme <- theme(axis.line=element_blank(),
      axis.text.x=element_blank(),
      axis.text.y=element_blank(),
      axis.ticks=element_blank(),
      axis.title.x=element_blank(),
      axis.title.y=element_blank(),
      legend.position="none",
      panel.background=element_blank(),
      panel.border=element_blank(),
      panel.grid.major=element_blank(),
      panel.grid.minor=element_blank(),
      plot.background=element_blank())

ggplot(data = indiana) + 
  geom_polygon(aes(x = long, y = lat, fill = subregion, group = group), color = "white") + 
  coord_fixed(1.3) +
  guides(fill=FALSE) +
  map_theme

9.4.5.2 Displaying data on maps

We can display data on maps. Here, we will display some county-level data of the state of Indiana. The information we will use includes education data and population data. The education data are available on this webpage: http://www.stats.indiana.edu/web/county/edattain00.asp. The population data are available here: http://www.stats.indiana.edu/population/PopTotals/historic_counts_counties.asp.

Instead of typing the data manually, we will use the R package htmltab to directly get the data into R.

## education data
edu <- htmltab(doc="http://www.stats.indiana.edu/web/county/edattain00.asp")

## population data
population <- htmltab(doc="http://www.stats.indiana.edu/population/PopTotals/historic_counts_counties.asp")

demo.data <- cbind(edu[, c(6, 8)], population[, c(1, 12:13)])

## change variable names
names(demo.data) <- c('edu2010', 'edu1990', 'subregion', 'pop2000', 'pop2010') 

## remove the first line of data
demo.data <- demo.data[-1, ]

## convert to numeric
demo.data$edu1990 <- as.numeric(demo.data$edu1990)
demo.data$edu2010 <- as.numeric(demo.data$edu2010)

## first replace comma and then change to numeric
demo.data$pop2000 <- as.numeric(gsub(",", "", demo.data$pop2000))
demo.data$pop2010 <- as.numeric(gsub(",", "", demo.data$pop2010))

## convert uppper case to lower case letters
demo.data$subregion <- tolower(demo.data$subregion)

## calculate a difference score
demo.data$edudiff <- demo.data$edu2010 - demo.data$edu1990
demo.data$popdiff <- demo.data$pop2010 - demo.data$pop2000

We can then combine the information with the map data.

indiana <- inner_join(indiana, demo.data, by = "subregion")
str(indiana)
## 'data.frame':    1501 obs. of  12 variables:
##  $ long     : num  -84.8 -85.1 -85.1 -84.8 -84.8 ...
##  $ lat      : num  40.6 40.6 40.9 40.9 40.7 ...
##  $ group    : num  663 663 663 663 663 663 664 664 664 664 ...
##  $ order    : int  23681 23682 23683 23684 23685 23686 23688 23689 23690 23691 ...
##  $ region   : chr  "indiana" "indiana" "indiana" "indiana" ...
##  $ subregion: chr  "adams" "adams" "adams" "adams" ...
##  $ edu2010  : num  10.7 10.7 10.7 10.7 10.7 10.7 22.7 22.7 22.7 22.7 ...
##  $ edu1990  : num  10.7 10.7 10.7 10.7 10.7 10.7 19 19 19 19 ...
##  $ pop2000  : num  33625 33625 33625 33625 33625 ...
##  $ pop2010  : num  34387 34387 34387 34387 34387 ...
##  $ edudiff  : num  0 0 0 0 0 0 3.7 3.7 3.7 3.7 ...
##  $ popdiff  : num  762 762 762 762 762 ...

We now plot the population data on maps.

map_theme <- theme(
  axis.text = element_blank(),
  axis.line = element_blank(),
  axis.ticks = element_blank(),
  panel.border = element_blank(),
  panel.grid = element_blank(),
  axis.title = element_blank()
  )

ggplot(data = indiana) + 
  geom_polygon(aes(x = long, y = lat, fill = pop2010, group = group), color = "white") + 
  coord_fixed(1.3) +
  map_theme

Notice the missing pieces.

indiana <- subset(uscounty, region == "indiana")
name1 <- unique(indiana$subregion) 
name2 <- unique(demo.data$subregion) 
cbind(name1,name2)
##       name1         name2        
##  [1,] "adams"       "adams"      
##  [2,] "allen"       "allen"      
##  [3,] "bartholomew" "bartholomew"
##  [4,] "benton"      "benton"     
##  [5,] "blackford"   "blackford"  
##  [6,] "boone"       "boone"      
##  [7,] "brown"       "brown"      
##  [8,] "carroll"     "carroll"    
##  [9,] "cass"        "cass"       
## [10,] "clark"       "clark"      
## [11,] "clay"        "clay"       
## [12,] "clinton"     "clinton"    
## [13,] "crawford"    "crawford"   
## [14,] "daviess"     "daviess"    
## [15,] "dearborn"    "dearborn"   
## [16,] "decatur"     "decatur"    
## [17,] "de kalb"     "dekalb"     
## [18,] "delaware"    "delaware"   
## [19,] "dubois"      "dubois"     
## [20,] "elkhart"     "elkhart"    
## [21,] "fayette"     "fayette"    
## [22,] "floyd"       "floyd"      
## [23,] "fountain"    "fountain"   
## [24,] "franklin"    "franklin"   
## [25,] "fulton"      "fulton"     
## [26,] "gibson"      "gibson"     
## [27,] "grant"       "grant"      
## [28,] "greene"      "greene"     
## [29,] "hamilton"    "hamilton"   
## [30,] "hancock"     "hancock"    
## [31,] "harrison"    "harrison"   
## [32,] "hendricks"   "hendricks"  
## [33,] "henry"       "henry"      
## [34,] "howard"      "howard"     
## [35,] "huntington"  "huntington" 
## [36,] "jackson"     "jackson"    
## [37,] "jasper"      "jasper"     
## [38,] "jay"         "jay"        
## [39,] "jefferson"   "jefferson"  
## [40,] "jennings"    "jennings"   
## [41,] "johnson"     "johnson"    
## [42,] "knox"        "knox"       
## [43,] "kosciusko"   "kosciusko"  
## [44,] "lagrange"    "lagrange"   
## [45,] "lake"        "lake"       
## [46,] "la porte"    "laporte"    
## [47,] "lawrence"    "lawrence"   
## [48,] "madison"     "madison"    
## [49,] "marion"      "marion"     
## [50,] "marshall"    "marshall"   
## [51,] "martin"      "martin"     
## [52,] "miami"       "miami"      
## [53,] "monroe"      "monroe"     
## [54,] "montgomery"  "montgomery" 
## [55,] "morgan"      "morgan"     
## [56,] "newton"      "newton"     
## [57,] "noble"       "noble"      
## [58,] "ohio"        "ohio"       
## [59,] "orange"      "orange"     
## [60,] "owen"        "owen"       
## [61,] "parke"       "parke"      
## [62,] "perry"       "perry"      
## [63,] "pike"        "pike"       
## [64,] "porter"      "porter"     
## [65,] "posey"       "posey"      
## [66,] "pulaski"     "pulaski"    
## [67,] "putnam"      "putnam"     
## [68,] "randolph"    "randolph"   
## [69,] "ripley"      "ripley"     
## [70,] "rush"        "rush"       
## [71,] "st joseph"   "st. joseph" 
## [72,] "scott"       "scott"      
## [73,] "shelby"      "shelby"     
## [74,] "spencer"     "spencer"    
## [75,] "starke"      "starke"     
## [76,] "steuben"     "steuben"    
## [77,] "sullivan"    "sullivan"   
## [78,] "switzerland" "switzerland"
## [79,] "tippecanoe"  "tippecanoe" 
## [80,] "tipton"      "tipton"     
## [81,] "union"       "union"      
## [82,] "vanderburgh" "vanderburgh"
## [83,] "vermillion"  "vermillion" 
## [84,] "vigo"        "vigo"       
## [85,] "wabash"      "wabash"     
## [86,] "warren"      "warren"     
## [87,] "warrick"     "warrick"    
## [88,] "washington"  "washington" 
## [89,] "wayne"       "wayne"      
## [90,] "wells"       "wells"      
## [91,] "white"       "white"      
## [92,] "whitley"     "whitley"
name1 == name2
##  [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [13]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [25]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [37]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE
## [49]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [61]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE
## [73]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [85]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE

demo.data$subregion <- name1

indiana <- inner_join(indiana, demo.data, by = "subregion")


ggplot(data = indiana) + 
  geom_polygon(aes(x = long, y = lat, fill = pop2010, group = group), color = "white") + 
  coord_fixed(1.3) +
  scale_fill_gradient(trans = "log10") +
  map_theme


ggplot(data = indiana) + 
  geom_polygon(aes(x = long, y = lat, fill = pop2010, group = group), color = "white") + 
  coord_fixed(1.3) +
  map_theme

We now visualize the change of population for Indiana counties.

pop.change.plot <- ggplot(data = indiana) + 
  geom_polygon(aes(x = long, y = lat, fill = cut(popdiff,breaks=c(-3500, -100, 100, 100000)), group = group), color = "white") + 
  coord_fixed(1.3) +
  map_theme +
  scale_fill_discrete(name="Population", labels=c('Move out', 'No change', 'Move in'))

pop.change.plot


pop.change.plot <- ggplot(data = indiana) + 
  geom_polygon(aes(x = long, y = lat, fill =popdiff, group = group), color = "white") + 
  coord_fixed(1.3) +
  map_theme 

pop.change.plot

We now label each county using its name.

cnames <- aggregate(cbind(long, lat) ~ subregion, data=indiana, FUN=mean)

pop.change.plot +
  geom_text(data=cnames, aes(long, lat, label = substr(subregion, 1, 2)))

9.4.5.3 R package usmap

usa.mapdata <- us_map('states')
str(usa.mapdata)
## 'data.frame':    12999 obs. of  9 variables:
##  $ long : num  1091779 1091268 1091140 1090940 1090913 ...
##  $ lat  : num  -1380695 -1376372 -1362998 -1343517 -1341006 ...
##  $ order: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ hole : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ piece: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ group: chr  "01.1" "01.1" "01.1" "01.1" ...
##  $ fips : chr  "01" "01" "01" "01" ...
##  $ abbr : chr  "AL" "AL" "AL" "AL" ...
##  $ full : chr  "Alabama" "Alabama" "Alabama" "Alabama" ...

ggplot(data = usa.mapdata) + 
  geom_polygon(aes(x = long, y = lat, fill = full, group = group), color = "white") + 
  guides(fill=FALSE) 

9.4.5.4 R package ggmap

ggmap is an R package that makes it easy to retrieve raster map tiles from popular online mapping services like Google Maps and Stamen Maps and plot them using the ggplot2 framework.

The function get_stamenmap can download a map tile from a tile server for Stamen Maps. It can be used for without any limitation but Stamen maps don’t cover the entire world. The use of the function get_stamenmap is to cut a tile from a world map based on latitude and longitude. We can specify the area using a vector with the left longitude, bottom latitude, right longitude, and top latitude. Different ways can be used to find such information.

  • Use Google Maps
    • On your computer, open Google Maps. If you’re using Maps in Lite mode, you’ll see a lightning bolt at the bottom and you won’t be able to get the coordinates of a place.
    • Right-click the place or area on the map.
    • Select What’s here?
    • At the bottom, you’ll see a card with the coordinates.
  • Use OpenStreetMap
    • On your computer, open OpenStreetMap.
    • Right-click the place or area on the map.
    • Select Show address?
    • The coordinates are on the left panel.
    • You can also use “Export” tab to find the coordinate for an area.

For example, to get the area map around Notre Dame, we can use the following:

ND <- c(left = -86.29, bottom = 41.68, right = -86.19, top = 41.72)
NDdata <- get_stamenmap(ND, zoom=12) 
str(NDdata)
##  'ggmap' chr [1:156, 1:290] "#DBDCD4" "#E5E6DE" "#E6E6E1" "#E5E6DE" ...
##  - attr(*, "bb")='data.frame':   1 obs. of  4 variables:
##   ..$ ll.lat: num 41.7
##   ..$ ll.lon: num -86.3
##   ..$ ur.lat: num 41.7
##   ..$ ur.lon: num -86.2
##  - attr(*, "source")= chr "stamen"
##  - attr(*, "maptype")= chr "terrain"
##  - attr(*, "zoom")= num 12

NDmap <- ggmap(NDdata)  ## zoom can be a number from 1 to 18 to specify the details of the map.
NDmap

We can add the information to the map, too.

txt <- "
lat, long, place
41.698462, -86.233917, ND Stadium
41.698118, -86.232704, Corbett
41.704477, -86.237384, Haggar
"

nd.places <- read.csv(text = txt)

NDmap + 
  geom_point(data=nd.places, aes(long, lat), color='blue') + 
  geom_text(data=nd.places, aes(long, lat, label = place))

The following example is adapted from the help document of ggmap to visualize the crime data in Houston. First of all, the data crime include crime information from January 2010 to August 2010 geocoded with Google Maps. Note that qmplot provides a function for quick map visualization.

str(crime)
## 'data.frame':    86314 obs. of  17 variables:
##  $ time    : POSIXct, format: "2010-01-01 01:00:00" "2010-01-01 01:00:00" ...
##  $ date    : chr  "1/1/2010" "1/1/2010" "1/1/2010" "1/1/2010" ...
##  $ hour    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ premise : chr  "18A" "13R" "20R" "20R" ...
##  $ offense : Factor w/ 7 levels "aggravated assault",..: 4 6 1 1 1 3 3 3 3 3 ...
##  $ beat    : chr  "15E30" "13D10" "16E20" "2A30" ...
##  $ block   : chr  "9600-9699" "4700-4799" "5000-5099" "1000-1099" ...
##  $ street  : chr  "marlive" "telephone" "wickview" "ashland" ...
##  $ type    : chr  "ln" "rd" "ln" "st" ...
##  $ suffix  : chr  "-" "-" "-" "-" ...
##  $ number  : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ month   : Ord.factor w/ 8 levels "january"<"february"<..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ day     : Ord.factor w/ 7 levels "monday"<"tuesday"<..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ location: chr  "apartment parking lot" "road / street / sidewalk" "residence / house" "residence / house" ...
##  $ address : chr  "9650 marlive ln" "4750 telephone rd" "5050 wickview ln" "1050 ashland st" ...
##  $ lon     : num  -95.4 -95.3 -95.5 -95.4 -95.4 ...
##  $ lat     : num  29.7 29.7 29.6 29.8 29.7 ...

crime.downtown <- crime %>% 
  filter(
    -95.39681 <= lon & lon <= -95.34188,
     29.73631 <= lat & lat <=  29.78400
  )

qmplot(lon, lat, data = crime.downtown, maptype = "toner-lite", color = offense)

Since ggmap’s built on top of ggplot2, all ggplot2 method will work.

robberies <- crime.downtown %>% filter(offense == "robbery")

qmplot(lon, lat, data = robberies, geom = "blank", 
  zoom = 14, maptype = "toner-background", darken = .7, legend = "topleft"
) +
  stat_density_2d(aes(fill = ..level..), geom = "polygon", alpha = .3, color = NA) +
  scale_fill_gradient2("Robbery\nPropensity", low = "white", mid = "yellow", high = "red", midpoint = 650)

Faceting works, too:

qmplot(lon, lat, data = crime.downtown, maptype = "toner-background", color = offense) + 
  facet_wrap(~ offense)


timeofday <- rep('Night', nrow(crime.downtown))
timeofday[crime.downtown$hour >= 6 & crime.downtown$hour <= 12] <- 'Morning'
timeofday[crime.downtown$hour >= 13 & crime.downtown$hour <= 20] <- 'Afternoon'

crime.downtown$timeofday <- timeofday

qmplot(lon, lat, data = crime.downtown, maptype = "toner-background", color = offense) + 
  facet_wrap(~ timeofday)

Google Maps can be used as well for map visualization. However, since Google Maps use a center/zoom specification, the input is a bit different. Instead of specifying a tile, we provide the center of the location and the zoom level. In addition, the name of the place can be used directly.

To use Google Maps, we need to get an API key first.

## Use the API key
api_key <- "AIzaSyBq6CMm6jW3XJN1ZAc5aoUm57oVnCXz1Mg"
register_google(key = api_key)

## download Google Maps data
## map zoom; an integer from 3 (continent) to 21 (building), default value 10 (city)
map.nd <- get_googlemap("Notre Dame, in 46556", zoom = 15, maptype = "satellite")
str(map.nd)
##  'ggmap' chr [1:1280, 1:1280] "#526049" "#526049" "#526049" "#59624C" ...
##  - attr(*, "source")= chr "google"
##  - attr(*, "maptype")= chr "satellite"
##  - attr(*, "zoom")= num 15
##  - attr(*, "bb")=Classes 'tbl_df', 'tbl' and 'data.frame':   1 obs. of  4 variables:
##   ..$ ll.lat: num 41.7
##   ..$ ll.lon: num -86.3
##   ..$ ur.lat: num 41.7
##   ..$ ur.lon: num -86.2

ggmap(map.nd) +
  geom_point(data=nd.places, aes(long, lat), color='blue')

Different styles of Google Maps can be plotted, too.

get_googlemap("Notre Dame, in 46556", zoom = 12, maptype = "terrain") %>% ggmap()
get_googlemap("Notre Dame, in 46556", zoom = 12, maptype = "hybrid") %>% ggmap()
get_googlemap("Notre Dame, in 46556", zoom = 12, maptype = "roadmap") %>% ggmap()

Google’s geocoding and reverse geocoding API’s are available through geocode() and revgeocode(), respectively.

geocode("390 Corbett Family Hall, Notre Dame, IN 46556")
## # A tibble: 1 x 2
##     lon   lat
##   <dbl> <dbl>
## 1 -86.2  41.7
revgeocode(c(lon = -86.24188, lat = 41.69803))
## [1] "O'Neill Hall, Notre Dame, IN 46556, USA"

A very useful feature is to get the geocode of a location and mutate it with data through the function mutate_geocode(). For example,

tibble(address = c("Notre Dame, IN 46556", "Mishawaka, IN 46545", "Granger, IN 46530")) %>% 
  mutate_geocode(address)
## # A tibble: 3 x 3
##   address                lon   lat
##   <chr>                <dbl> <dbl>
## 1 Notre Dame, IN 46556 -86.2  41.7
## 2 Mishawaka, IN 46545  -86.1  41.7
## 3 Granger, IN 46530    -86.1  41.7

Treks use Google’s routing API to give you routes (route() and trek() give slightly different results; the latter hugs roads):

trek_df <- trek("Notre Dame, IN 46556", "Granger, IN 46530")

qmap("Notre Dame, IN 46556", zoom = 11) +
  geom_path(
    aes(x = lon, y = lat),  colour = "blue",
    size = 1.5, alpha = .5,
    data = trek_df, lineend = "round"
  )

Map distances, in both length and anticipated time, can be computed with mapdist()). Moreover the function is vectorized:

mapdist(c("houston, texas", "dallas"), "waco, texas")
## # A tibble: 2 x 9
##   from           to               m    km miles seconds minutes hours mode   
##   <chr>          <chr>        <int> <dbl> <dbl>   <int>   <dbl> <dbl> <chr>  
## 1 houston, texas waco, texas 298227  298. 185.    10270   171.   2.85 driving
## 2 dallas         waco, texas 152480  152.  94.8    5375    89.6  1.49 driving