Chapter 5: Graphs and Maps

“The ideal situation occurs when the things that we regard as beautiful are also regarded by other people as useful.”—Donald Knuth

require(lubridate)
require(maps)
require(classInt)
require(maptools)
require(rgdal)
require(mapdata)
require(ggplot2)

Graphs and maps help you reason about your data. They also help you communicate your results. A good graph gives you the most information in the shortest time, with the least ink in the smallest space (Tufte 1997).

con = "http://hurricaneclimatology.squarespace.com/storage/chapter-5/SOI.txt"
SOI = read.table(con, header = TRUE)
con = "http://hurricaneclimatology.squarespace.com/storage/chapter-5/NAO.txt"
NAO = read.table(con, header = TRUE)
con = "http://hurricaneclimatology.squarespace.com/storage/chapter-5/SST.txt"
SST = read.table(con, header = TRUE)
con = "http://hurricaneclimatology.squarespace.com/storage/chapter-5/ATL.txt"
A = read.table(con, header = TRUE)
con = "http://hurricaneclimatology.squarespace.com/storage/chapter-5/H.txt"
US = read.table(con, header = TRUE)

Graphs

With the standard (base) graphics environment you can produce a variety of plots with fine details. Most of the figures in my book are created using the standard graphics environment. The grid graphics environment is more flexible. It allows you to design complex layouts with nested graphs where scaling is maintained when you resize the figure.

The lattice and ggplot2 packages use grid graphics to create specialized graphing functions and methods. The spplot() function is an example of a plot method built with grid graphics that we use to create maps of our spatial data. The ggplot2 package is an implementation of the grammar of graphics combining advantages from the standard and lattice graphic environments.

Box plot

A box plot is a graph of the five-number data summary (min, max, median, q25, q75). The summary() method applied to a data vector computes the mean and five other statistics including the minimum, the first quartile value, the median, the third quartile value, and the maximum. A box plot is a graph of these numbers. This is done using the boxplot() function. For example, to create a box plot of your October SOI data, type

boxplot(SOI$Oct, ylab = "October SOI (s.d.)")
ggplot(SOI, aes(x = factor(0), y = Oct)) + geom_boxplot() + ylab("October SOI (s.d.)") + 
    xlab("")

With annotation.

boxplot(SOI$Aug, ylab = "August SOI [s.d.]")
f = fivenum(SOI$Aug)
text(rep(1.25, 5), f, labels = c("minimum", "lower", "median", "upper", "maximum"), 
    adj = c(0, 0), cex = 0.75)

plot of chunk boxPlot2

The line inside the box is the median value. The bottom of the box (lower hinge) is the first quartile value and the top of the box (upper hinge) is the third quartile. The vertical line (whisker) from the top of the box extends to the maximum value and the vertical line from the bottom of the box extends to the minimum value.

Hinge values equal the quartiles exactly when there is an odd number of observations. Otherwise hinges are the middle value of the lower (or upper) half of the observations if there is an odd number of observations below the median and are the middle of two values if there is an even number of observations below the median.

The fivenum() function gives the five numbers used by boxplot(). The height of the box is essentially the interquartile range (IQR()) and the range is the distance from the minimum to the maximum.

By default the whiskers are drawn as a dashed line extending from the box to the minimum and maximum data values. Convention is to make the length of the whiskers no longer than 1.5 times the height of the box. The outliers, data values larger or smaller than this range, are marked separately with points. The text identifies the values. Here with the default options we see one data value greater than 1.5 times the interquartile range. In this case the upper whisker extends to the last data value less than 1.5 times the IQR.

For example, if you type

Q1 = fivenum(SOI$Aug)[2]
Q2 = fivenum(SOI$Aug)[3]
Q3 = fivenum(SOI$Aug)[4]
Q2 + (Q3 - Q1) * 1.5
## [1] 2.285

One observation is larger than 2.3. In this case, the upper whisker ends at the next highest observation value less than 2.3. Observations above and below the whiskers are considered outliers. You can find the value of the single outlier of the August SOI by typing

sort(SOI$Aug)
##   [1] -2.740 -2.660 -2.550 -2.440 -2.310 -2.200 -2.140 -2.110 -2.030 -1.890
##  [11] -1.840 -1.750 -1.730 -1.670 -1.620 -1.460 -1.420 -1.410 -1.410 -1.400
##  [21] -1.390 -1.370 -1.240 -1.200 -1.180 -1.130 -1.120 -1.090 -1.090 -1.080
##  [31] -1.070 -1.060 -1.050 -1.020 -1.010 -1.000 -0.970 -0.960 -0.940 -0.920
##  [41] -0.920 -0.900 -0.900 -0.900 -0.880 -0.850 -0.820 -0.750 -0.750 -0.710
##  [51] -0.710 -0.690 -0.680 -0.640 -0.640 -0.640 -0.630 -0.590 -0.590 -0.550
##  [61] -0.500 -0.430 -0.420 -0.390 -0.380 -0.370 -0.350 -0.280 -0.230 -0.210
##  [71] -0.200 -0.170 -0.160 -0.150 -0.110 -0.110 -0.110 -0.110 -0.060 -0.030
##  [81] -0.010  0.000  0.010  0.050  0.060  0.100  0.120  0.130  0.190  0.200
##  [91]  0.200  0.260  0.300  0.310  0.330  0.330  0.340  0.360  0.380  0.400
## [101]  0.440  0.440  0.480  0.480  0.530  0.560  0.600  0.640  0.660  0.660
## [111]  0.690  0.720  0.720  0.750  0.750  0.800  0.800  0.830  0.880  0.890
## [121]  0.890  0.900  0.940  0.980  0.990  1.050  1.060  1.080  1.120  1.140
## [131]  1.180  1.240  1.260  1.300  1.320  1.330  1.360  1.380  1.420  1.430
## [141]  1.510  1.771  2.030  2.200  3.480

Note the value 2.2 is the largest observation in the data less than 2.3.

Your observations are said to be symmetric if the median is near the middle of the box and the length of the two whiskers are about equal. A symmetric set of observations will also have about the same number of high and low outliers.

To summarize, 25% of all your observations are below the lower quartile (below the box), 50% are below (and above) the median, and 25% are above the upper quartile. The box contains 50% of all your data. The upper whisker extends from the upper quartile to the maximum and the lower quartile extends from the lower quartile value to the minimum except if they exceed 1.5 times the interquartile range above the upper or below the lower quartiles. In this case outliers are plotted as points. This outlier option can be turned off by setting the range to zero.

The box plot is an efficient graphical summary of your data, but it can be designed better. For example, the box sides provide redundant information. By removing the box lines altogether, the same information is available with less ink.

The figure below is series of box plots representing the SOI for each month. The dot represents the median; the ends of the lines towards the dot are the lower and upper quartile, respectively; the ends of the lines towards the bottom and top of the graph are the minimum and maximum values, respectively.

plot(c(1, 12), c(-5, 5), type = "n", xaxt = "n", bty = "n", xlab = "", ylab = "SOI [s.d.]")
axis(1, at = seq(1, 12, 1), labels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", 
    "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"), cex = 0.5)
for (i in 1:12) {
    points(i, fivenum(SOI[, i + 1])[3], pch = 19)
    lines(c(i, i), c(fivenum(SOI[, i + 1])[1], fivenum(SOI[, i + 1])[2]))
    lines(c(i, i), c(fivenum(SOI[, i + 1])[4], fivenum(SOI[, i + 1])[5]))
}

plot of chunk TufteboxplotFig

Here the space between the whisker and the point is used to show information.

Histogram

A histogram is a graph of the distribution of your observations. It shows where the values tend to cluster and where they tend to be sparse. The histogram is similar to a bar plot. The histogram uses bars to indicate frequency (or proportion) in data intervals, whereas a bar plot uses bars to indicate the frequency of data by categories. The hist() function creates a histogram.

As an example, consider NOAA's annual values of accumulated cyclone energy (ACE) for the North Atlantic and July SOI values. Annual ACE is calculated by squaring the maximum wind speed for each six hour tropical cyclone observation and summing over all cyclones in the season.

The values obtained from NOAA (http://www.aoml.noaa.gov/hrd/tcfaq/E11.html) are expressed in units of knots squared \( \times 10^4 \). You create the two histograms and plot them side-by-side. Details on plotting options are given in Murrell (2005). After your histogram is plotted the function rug() adds tick marks along the horizontal axis at the location of each observation.

hist(A$ACE)
rug(A$ACE)

plot of chunk histogramACE

hist(SOI$Jul)
rug(SOI$Jul)

plot of chunk histogramACE

Plot titles are useful for presentations, but should not be used for publication. The default horizontal axis label is the name of the data vector. The default vertical axis is frequency and is labeled accordingly.

You create a histogram using ggplot, type

ggplot(SOI, aes(Jul)) + geom_histogram()
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust
## this.

plot of chunk ggplotHist

The hist() function has lots of options. Default values for these options provide a good starting point, but you might want to make adjustments. Thus it is good to know how the histogram is assembled. First a contiguous collection of disjoint intervals, called bins (or classes), is chosen that cover the range of data values. The default for the number bins is the value $\lceil \log_2(n)+1\rceil $, where \( n \) is the sample size and \( \lceil \rceil \) indicates the ceiling value (next largest integer). If you type

n = length(SOI$Jul)
ceiling(log(n, base = 2) + 1)
## [1] 9

you can see that adjustments are made to this number so that the cut points correspond to whole number data values. In the case of ACE the adjustment results in 7 bins and in the case of the SOI it results in 11 bins. Thus the computed number of bins is a suggestion that gets modified to make for nice breaks.

The bins are contiguous and disjoints so the intervals look like \( (a, b] \) or \( [a, b) \) where the interval \( (a, b] \) means from \( a \) to \( b \) including \( b \) but not \( a \). Next, the number of data values in each of the intervals is counted. Finally, a bar is drawn above the interval so that the bar height is the number of data values (frequency). A useful argument to make your histogram more visually understandable is prob=TRUE, which allows you to set the bar height to the density, where the sum of the densities times the bar interval width equals one.

You conclude that ACE is positively skewed with a relatively few years having very large values. Clearly ACE does not follow a normal distribution. In contrast, the SOI appears quite symmetric with rather short tails as you would expect from a normal distribution.

Density plot

A histogram outlines the general shape of your data. Usually that is sufficient. You can adjust the number of bins (or bin width) to get more or less detail on the shape. An alternative is a density plot. A density plot captures the distribution shape by essentially smoothing the histogram. Instead of specifying the bin width you specify the amount (and type) of smoothing.

There are two steps to produce a density plot. First you need to use the density() function to obtain a set of kernel density estimates from your observations. Second you need to plot these estimates, typically by using the plot() method.

A kernel density is a function that provides an estimate of the average number of values at any location in the space defined by your data. This is illustrated in below where the October SOI values in the period 2005–2010 are indicated as a rug and a kernel density function is shown as the black curve. The height of the function, representing the local density, is a sum of the heights of the individual kernels shown in red.

x = tail(SOI$Oct)
xlims = c(-3, 3)
bandw = 0.5
plot(density(x, bw = bandw), xlim = xlims, type = "n", main = "", xlab = "October SOI [s.d.]")
rug(x, lwd = 3)
d = density(x[1], bw = bandw)
lines(d$x, d$y/length(x), xlim = xlims, col = "red")
d = density(x[2], bw = bandw)
lines(d$x, d$y/length(x), xlim = xlims, col = "red")
d = density(x[3], bw = bandw)
lines(d$x, d$y/length(x), xlim = xlims, col = "red")
d = density(x[4], bw = bandw)
lines(d$x, d$y/length(x), xlim = xlims, col = "red")
d = density(x[5], bw = bandw)
lines(d$x, d$y/length(x), xlim = xlims, col = "red")
d = density(x[6], bw = bandw)
lines(d$x, d$y/length(x), xlim = xlims, col = "red")
lines(density(x, bw = bandw), xlim = xlims, lwd = 2)

plot of chunk kernelDensity

The kernel is a Gaussian (normal) distribution centered at each data value. The width of the kernel, called the bandwidth, controls the amount of smoothing. The bandwidth is the standard deviation of the kernel in the density() function. This means the inflection points on the kernel are separated by one bandwidth in units of the data values. Here with the SOI in units of standard deviation the bandwidth equals .5 s.d.

A larger bandwidth produces a smoother density plot or a fixed number of observations because the kernels have greater overlap. The next figure shows the density plot of June NAO values from the period 1851–2010 using bandwidths of .1, .2, .5, and 1.

par(mfrow = c(2, 2), mar = c(5, 5, 4, 2) + 0.1, mex = 0.7, las = 1, mgp = c(3, 
    0.4, 0), tcl = -0.3)
x = NAO$Jun
xlims = range(x)
plot(density(x, bw = 0.1), xlim = xlims, main = "", xlab = "June NAO [s.d.]", 
    lwd = 2)
rug(x)
mtext("a", side = 3, line = 1, adj = 0, cex = 1.1)
plot(density(x, bw = 0.2), xlim = xlims, main = "", xlab = "June NAO [s.d.]", 
    lwd = 2)
rug(x)
mtext("b", side = 3, line = 1, adj = 0, cex = 1.1)
plot(density(x, bw = 0.5), xlim = xlims, main = "", xlab = "June NAO [s.d.]", 
    lwd = 2)
rug(x)
mtext("c", side = 3, line = 1, adj = 0, cex = 1.1)
plot(density(x, bw = 1), xlim = xlims, main = "", xlab = "June NAO [s.d.]", 
    lwd = 2)
rug(x)
mtext("d", side = 3, line = 1, adj = 0, cex = 1.1)

plot of chunk kerneldensityplot

par(mfrow = c(1, 1))

The smallest bandwidth produces a density plot that is has spikes as it captures the fine scale variability in the distribution of values. As the bandwidth increases the spikes disappear and the density gets smoother. The largest bandwidth produces a smooth symmetric density centered on the value of zero.

To create a density plot for the NAO values with a histogram overlay, type

par(las = 1, mgp = c(2.3, 0.4, 0), tcl = -0.3)
d = density(NAO$Jun, bw = 0.5)
plot(d, main = "", xlab = "June NAO [s.d.]", lwd = 2, col = "red")
hist(NAO$Jun, prob = TRUE, add = TRUE)
rug(NAO$Jun)

plot of chunk histogramDensity

The density function takes your vector of data values as input and allows you to specify a bandwidth using the bw argument. Here you are using the vector of June NAO values and a bandwidth of .5 s.d. The bandwidth units are the same as the units of your data, here s.d. for the NAO. The output is saved as a density object, here called d. The object is then plotted using the plot() method. You turn off the default plot title with the main=“” and you specify a label for the values to be plotted below the horizontal axis. You specify the line width as 2 and the line color as red.

You then add the histogram over the density line using the hist() function. You use prob=TRUE to make the bar height proportional to the density. The argument add=TRUE is needed so that the histogram plots on the same device.

A reason for plotting the histogram or density is to see whether your data have a normal distribution. The Q-Q plot provides another way to make this assessment.

Q-Q plot

A Q-Q plot graphically compares distributions. It does this by plotting quantile (Q) values of one distribution against the corresponding quantile (Q) values of the other distribution. In the case of assessing whether or not your data are normally distributed, the sample quantiles are plotted on the vertical axis and quantiles from a standard normal distribution are plotted along the horizontal axis. In this case it is called a Q-Q normal plot.

That is, the $k$th smallest observation is plotted against the expected value of the $k$th smallest random value from a \( N(0, 1) \) sample of size \( n \). The pattern of points in the plot is then used to compare your data against a normal distribution. If your data are normally distributed then the points should align along the \( y=x \) line.

This is done using the qqnorm() function. To make side-by-side Q-Q normal plots for the ACE values and the July SOI values you type

c = 0.5144^2
par(mfrow = c(1, 2), pty = "s", las = 1, mgp = c(2, 0.4, 0), tcl = -0.3)
qqnorm(A$ACE * c, main = "", ylab = expression(paste("Sample quantiles (ACE [x", 
    10^4, m^2, s^-2, "])")), xlab = "Theoretical quantiles")
qqline(A$ACE * c, col = "red")
mtext("a", side = 3, line = 1, adj = 0, cex = 1.1)
qqnorm(SOI$Jul, main = "", ylab = "Sample quantiles (SOI [s.d.])", xlab = "Theoretical quantiles")
qqline(SOI$Jul, col = "red")
mtext("b", side = 3, line = 1, adj = 0, cex = 1.1)

plot of chunk qqnormplot

par(mfrow = c(1, 1))

The quantiles are non-decreasing. The \( y=x \) line is added to the plot using the qqline() function. Additionally we adjust the vertical axis label and turn the default title off.

The plots show that July SOI values appear to have a normal distribution while the seasonal ACE does not. For observations that have a positive skew, like the ACE, the pattern of points on a Q-Q normal plot is concave upward. For observations that have a negative skew the pattern of points is concave downward. For values that have a symmetric distribution but with fatter tails than the normal (e.g., the \( t \)-distribution), the pattern of points resembles an inverse sine function.

The Q-Q normal plot is useful in checking the residuals from a regression model. The assumption is that the residuals are independent and identically distributed from a normal distribution centered on zero. Previously created a multiple linear regression model for August SST using March SST and year as the explanatory variables. To examine the assumption of normally distributed residuals with a Q-Q normal plot type

model = lm(Aug ~ Year + Mar, data = SST)
qqnorm(model$residuals)
qqline(model$residuals, col = "red")

plot of chunk qqnormResiduals

The points align along the \( y=x \) axis indicating a normal distribution.

Scatter plot

The plot() method creates a scatter plot. The values of one variable are plotted against the values of the other variable as points in a Cartesian plane. The values named in the first argument are plotted along the horizontal axis.

This pairing of two variables is useful in generating and testing hypotheses about a possible relationship. In the context of correlation which variable gets plotted on which axis is irrelevant. Either way, the scatter of points illustrates the amount of correlation. However, in the context of a statistical model, by convention the dependent variable (the variable you are interested in explaining) is plotted on the vertical axis and the explanatory variable is plotted on the horizontal axis.

For example, if your interest is whether pre-hurricane season ocean warmth (e.g., June SST) is related to ACE, your model is

ace = A$ACE * 0.5144^2
sst = SST$Jun
model = lm(ace ~ sst)

and you plot ACE on the vertical axis. Since your slope and intercept coefficients are saved as part of the object model, you first create a scatter plot then use the abline() function to add the linear regression line. Here the function extracts the intercept and slope coefficient values from the model object and draws the straight line using the point-intercept formula.

Note the model formula syntax (ace ~ sst) as the first argument in the plot() method.

ci.lwr = predict(model, data.frame(sst = sort(sst)), level = 0.95, interval = "confidence")[, 
    2]
ci.upr = predict(model, data.frame(sst = sort(sst)), level = 0.95, interval = "confidence")[, 
    3]
par(mfrow = c(1, 1), pty = "s", las = 1, mgp = c(2, 0.4, 0), tcl = -0.3)
plot(ace ~ sst, type = "n", ylab = "ACE [times 10^4 m^2 s^-2]", xlab = "SST [deg C]")
grid()
xx = c(sort(sst), rev(sort(sst)))
yy = c(ci.upr, rev(ci.lwr))
polygon(xx, yy, col = "gray", border = "gray")
points(sst, ace, pch = 19, cex = 0.8)
abline(model, col = "red", lwd = 2)

plot of chunk acesstscatterplot

The relationship between ACE and SST is summarized by the linear regression model shown by the straight line. The slope of the line indicates that for every 1$\circ\( C increase in SST the average value of ACE increases by 27 \)\times 104\( ~m \)2\( /s \)2$ (type coef(model[2])).

Since the regression line is based on a sample of data you should display it inside a band of uncertainty. Recall there are two types of uncertainty bands; a confidence band (narrow) and a prediction band (wide). The confidence band reflects the uncertainty about the line itself, which like the standard error of the mean indicates the precision by which you know the mean. In regression, the mean is not constant but rather a function of the explanatory variable.

The width of the band is inverse related to the sample size. In a large sample of data, the confidence band will be narrow reflecting a well-determined line. Note that it's impossible to draw a horizontal line that fits completely within the band. This indicates that there is a significant relationship between ACE and SST.

The band is narrowest in the middle which is understood by the fact that the predicted value at the mean SST will be the mean of ACE, whatever the slope, and thus the standard error of the predicted value at this point is the standard error of the mean of ACE. At other values of SST we need to add the variability associated with the estimated slope. This variability is larger for values of SST farther from the mean, which is why the band looks like a bow tie.

The prediction band adds another layer of uncertainty; the uncertainty about future values of ACE. The prediction band captures the majority of the observed points in the scatter plot. Unlike the confidence band, the width of the prediction band depends strongly on the assumption of normally distributed errors with a constant variance across the values of the explanatory variable.

Things are simpler with ggplot().

ace.df = data.frame(ace = ace, sst = sst)
ggplot(ace.df, aes(x = sst, y = ace)) + geom_point() + geom_smooth(method = "lm")
## Warning: Removed 5 rows containing missing values (stat_smooth).
## Warning: Removed 5 rows containing missing values (geom_point).

plot of chunk ACEvsSSTggplot

Conditional scatter plot

Scatter plots conditional on the values of a third variable can be informative. This is done with the coplot() function. The syntax is the same as above except you add the name of the conditioning variable after a vertical bar.

For example, as you saw above, there is a positive relationship between ACE and SST. The conditioning plot answers the question; is there a change in the relationship depending on values of the third variable. Here you use August SOI values as the conditioning variable and type

soi = SOI$Aug
coplot(ace ~ sst | soi, panel = panel.smooth)

plot of chunk coplot

## 
##  Missing rows :  1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15

The syntax is read 'conditioning plot of ACE versus SST given values of SOI.' The function divides the range of the conditioning variable (SOI) into six intervals with each interval having approximately the same number of years. The range of SOI values in each interval overlaps by 50%. The conditioning intervals are plotted in the top panel as horizontal bars (shingles).

The scatter plots of ACE and SST are arranged in a matrix of panels below the shingles. The panels are arranged from lower left to upper right. The lower left panel corresponds to the lowest range of SOI values (less than about -1 s.d.) and the upper right panel corresponds to the highest range of SOI values (greater than about +.5 s.d.). Half of the data points in a panel are shared with the panel to the left and half of the data points are shared with the panel to the right. This is indicated by the amount of shingle overlap.

Results show a positive, nearly linear, relationship between ACE and SST for all ranges of SOI values. However over the SOI range between -1.5 and 0 the relationship is nonlinear. ACE is least sensitive to SST when SOI is the most negative (El Nino years) as indicated by the nearly flat line in the lower-left panel. The argument panel adds a local linear curve (red line) through the set of points in each plot.

Time series

Hurricane data often take the form of a time series. A time series is a sequence of data values measured at successive times and spaced at uniform intervals. You can treat a time series as a vector and use structured data functions to generate time series.

However, additional functions are available for data that are converted to time-series objects. Time-series objects are created using the ts() function. You do this with the monthly NAO data frame as follows. First create a matrix of the monthly values skipping the year column in the data frame. Second take the transpose of this matrix (switch the rows with the columns) using the t() function and save the matrix is a vector. Finally, create a time-series object, specifying the frequency of values and the start month. Here the first value is from January 1851.

nao.m = as.matrix(NAO[, 2:13])
nao.v = as.vector(t(nao.m))
nao.ts = ts(nao.v, frequency = 12, start = c(1851, 1))

Also create a time-series object for the cumulative sum of the monthly SOI values. The is done with the cumsum() function applied to your data vector.

nao.cts = ts(cumsum(nao.v), frequency = 12, start = c(1851, 1))

This results in objects of class ts, which is used for time series having numeric time information. Additional classes for working with time series data that can handle dates and other types of time information are available in other packages. For example, the fts package implements regular and irregular time series based on POSIXct time stamps and the zoo package provides functions for most time series classes.

Time-series graph

The objects of class ts are easy to plot as a time series. For instance, you plot the cumulative sum of the NAO values using the plot() method. The method recognizes the object as a time series and plots it accordingly eliminating the need to specify a separate time variable.

par(las = 1, mgp = c(2, 0.4, 0), tcl = -0.3)
plot(nao.cts, xlab = "Year", ylab = "Cumulative NAO [s.d.]")
abline(h = 0, col = "red")
grid()
lines(nao.cts, lwd = 2)

plot of chunk plotTS

The cumulative sum indicates a pattern typical of a random walk. That is, over the long term there is a tendency for more positive-value months leading to a wandering' of the cumulative sum away from the zero line. This tendency begins to reverse in the late 20th century.

Autocorrelation

Autocorrelation is correlation between values from a single variable. For time data it refers to single series correlated with itself as a function of temporal lag. For spatial data it refers to single variable correlated with itself as a function of spatial lag, which can be a vector of distance and orientation. In both cases the term 'autocorrelation function' is used, but with spatial data the term is often qualified with 'spatial.'

As an example, save 30 random values from a standard normal distribution in a vector where the elements are considered ordered in time. First create a time-series object. Then use the lag.plot() function to create a scatter plot of the time series against a lagged copy where the lagged copy starts one time interval earlier.

t0 = ts(rnorm(30))
lag.plot(t0, lag = 3)

plot of chunk lagPlot

With N values, the plot for lag one contains N-1 points. The points are plotted using the text number indicating the temporal order so that the first point labeled '1' is given by the the coordinates (t0[1], t0[2]). The correlation at lag one can be inferred by the scatter of points. The plot can be repeated for any number of lags, but with higher lags the number of points decreases and plots are drawn for each lag.

You use the autocorrelation function acf() to quantify the correlation at various temporal lags. The function accepts univariate and multivariate numeric time-series objects and produces a plot of the autocorrelation values as a function of lag.

To create a plot of the autocorrelation and partial autocorrelation functions for the NAO time series object created in the previous section, type

par(mfrow = c(1, 2), pty = "s", las = 1)
acf(nao.ts, ci.col = "red", xlab = "Lag [years]", main = "", ylab = "Autocorrelation")
mtext("a", side = 3, line = 1, adj = 0, cex = 1.1)
pacf(nao.ts, ci.col = "red", xlab = "Lag [years]", main = "", ylab = "Partial autocorrelation")
mtext("b", side = 3, line = 1, adj = 0, cex = 1.1)

plot of chunk NAOacfpcf

The lag values plotted on the horizontal axis are plotted in units of time rather than numbers of observations. Dashed lines are the 95% confidence limits. Here the time-series object is created using monthly frequency, so the lags are given in fractions of 12 with 1.0 corresponding to a year. The maximum lag is calculated as 10 * log N where N is the number of observations. This is changed with the argument lag.max.

The lag-zero autocorrelation is fixed at 1 by convention. The non-zero autocorrelation are all less then .1 in absolute value indicative of an uncorrelated process. By default the plot includes 95% confidence limits (ci) computed as +/-1.96/sqrt(N).

The partial autocorrelation function pacf() computes the autocorrelation at lag k after the linear dependencies between lags 1 to k-1 are removed. The partial autocorrelation determines the extent of the lag in an autoregressive model. Here the partial autocorrelation vacillates between positive and negative values indicative of a moving-average process.

A moving-average process is one in which the expectation of the current value of the series is linear related to previous white noise errors.

If your regression model uses time-series data it is important to examine the autocorrelation in the model residuals. If residuals from your regression model have significant autocorrelation then the assumption of independence is violated. This violation does not bias the coefficient estimates, but with positive autocorrelation the standard errors on the coefficients tend to be too small giving you unwarranted confidence in your inferences.

Dates and times

You have various options for working with date and time data in R. The as.Date() function gives you flexibility in handling dates through the format argument. The default format is a four-digit year, a month, then a day, separated by dashes or slashes. For example, the character string “1992-8-24” will be accepted as a date by typing

Andrew = as.Date("1992-8-24")

Although the print method displays it as a character string, the object is a Date class stored as the number of days since January 1, 1970, with negative numbers for earlier dates.

If your input dates are not in the standard year, month, day order, a format string can be composed using the elements shown in the table below. For instance, if your date is specified as August 29, 2005 then you type

Katrina = as.Date("August 29, 2005", format = "%B %d, %Y")

Format codes for dates.

Code Value
%d Day of the month (decimal number)
%m Month (decimal number)
%b Month (abbreviated, e.g., Jan)
%B Month (full name)
%y Year (2 digit)
%Y Year (4 digit)

Without knowing how many leap years between hurricanes Andrew and Katrina, you can find the number of days between them by typing

difftime(Katrina, Andrew, units = "days")
## Time difference of 4753 days

Or you can obtain the number of days from today since Andrew by typing

difftime(Sys.Date(), Andrew, units = "days")
## Time difference of 7493 days

The function Sys.Date() gives the current day in year-month-day format as a Date object.

The portable operating system interface (POSIX) has formats for dates and times, with functionality for converting between time zones (Spector 2008). The POSIX date/time classes store times to the nearest second. There are two such classes differing only in the way the values are stored internally. The POSIXct class stores date/time values as the number of seconds since January 1, 1970, while the POSIXlt class stores them as a list. The list contains elements for second, minute, hour, day, month, and year among others.

The default input format for POSIX dates consist of the year, month, and day, separated by slashes or dashes with time information followed after white space.

The time information is is in the format hour:minutes:seconds or simply hour:minutes. For example, according to the U.S. National Hurricane Center, Hurricane Andrew hit Homestead Air Force Base at 9:05 UTC on August 24, 1992. You add time information to your Andrew date object and convert it to a POSIXct object.

Andrew = as.POSIXct(paste(Andrew, "09:05"), tz = "GMT")

You then retrieve your local time from your operating system as a character string and use the date-time conversion strptime() function to convert the string to a POSIXlt class.

mytime = strptime(Sys.time(), format = "%Y-%m-%d %H:%M:%S", tz = "EST5EDT")

Our time zone is U.S. Eastern time, so we use tz=“EST5EDT”. You then find the number of hours since Andrew's landfall by typing,

difftime(mytime, Andrew, units = "hours")
## Time difference of 179841 hours

Additional functionality for working with times is available in the chron and lubridate packages. In particular, lubridate (great package name) makes it easy to work with dates and times by providing functions to identify and parse date-time data, extract and modify components (years, months, days, hours, minutes, and seconds), perform accurate math on date-times, handle time zones and Daylight Savings Time (Grolemund and Wickham 2011).

For example, to return the day of the week from your object Andrew you use the wday() function in the package by typing,

require(lubridate)
wday(Andrew, label = TRUE, abbr = FALSE)
## [1] Monday
## 7 Levels: Sunday < Monday < Tuesday < Wednesday < Thursday < ... < Saturday

Other examples of useful functions in the package related to the Andrew time object include, the year, was it a leap year, what week of the year was it, and what local time was it. Finally, what is the current time in Chicago?

year(Andrew)
## [1] 1992
leap_year(Andrew)
## [1] TRUE
week(Andrew)
## [1] 34
with_tz(Andrew, tz = "America/New_york")
## [1] "1992-08-24 05:05:00 EDT"
now(tz = "America/Chicago")
## [1] "2013-02-28 12:02:42 CST"

Maps

Maps are among the most compelling graphics as the space they map is the space in which hurricanes occur. Various packages are available for creating maps.

Boundaries

Sometimes all that is needed is a reference map to show your study location. This can be created using state and country boundaries. For example, the maps package is used to draw country and state borders. To draw a map of the United States with state boundaries, type

require(maps)
map("state")

The function map() creates the country outline and adds the state boundaries. The package contains outlines for countries around the world (e.g., type map()).

map("state", interior = FALSE)
map("state", boundary = FALSE, col = "gray", add = TRUE)

plot of chunk mapexample

The coordinate system is latitude and longitude, so you can overlay other spatial data. As an example, first input the track of Hurricane Ivan (2004) as it approached the U.S. Gulf coast. Then list the first six rows of data.

con = "http://hurricaneclimatology.squarespace.com/storage/chapter-5/Ivan.txt"
Ivan = read.table(con, header = TRUE)
head(Ivan)
##   Year Mo Da Hr    Lon   Lat  Wind WindS Pmin Rmw   Hb Speed L Lhr
## 1 2004  9 15  8 -87.56 25.94 117.9 118.3  917  37 1.27 11.62 0 -24
## 2 2004  9 15  9 -87.65 26.12 117.7 117.3  917  37 1.27 12.05 0 -23
## 3 2004  9 15 10 -87.74 26.31 117.5 116.4  917  37 1.26 12.38 0 -22
## 4 2004  9 15 11 -87.82 26.50 117.2 115.5  918  37 1.26 12.60 0 -21
## 5 2004  9 15 12 -87.90 26.70 116.7 115.0  918  37 1.26 12.74 0 -20
## 6 2004  9 15 13 -87.97 26.90 116.2 114.8  919  37 1.26 12.78 0 -19

The data frame Ivan contains the latitude and longitude position of the hurricane every hour from 24 hours before landfall until 12 hours after landfall.

Here your geographic domain is the southeast, so first create a character vector of state names.

cs = c("texas", "louisiana", "mississippi", "alabama", "florida", "georgia", 
    "south carolina")

Next use map() to plot the state boundaries and fill the state polygons with a gray shade. Finally connect the hourly location points with the lines() function and add an arrow head.

map("state", region = cs, boundary = FALSE, col = "gray", fill = TRUE)
Lo = Ivan$Lon
La = Ivan$Lat
n = length(Lo)
lines(Lo, La, lwd = 2.5, col = "red")
arrows(Lo[n - 1], La[n - 1], Lo[n], La[n], lwd = 2.5, length = 0.1, col = "red")

plot of chunk IvanTrackMap

Hurricane Ivan moved northward from the central Gulf of Mexico and made landfall in the western panhandle region of Florida before moving into southeastern Alabama.

The boundary data in the maps package is sufficient for use with small scale maps as the number of boundary points is not sufficient for close-up (high resolution) views. Higher-resolution boundary data are available in the mapdata package.

Spatial data types

Point data

Consider the set of events defined by the location at which a hurricane first reaches its lifetime maximum intensity. The data are available in the file LMI.txt and are input by typing

con = "http://hurricaneclimatology.squarespace.com/storage/chapter-5/LMI.txt"
LMI.df = read.table(con, header = TRUE)
LMI.df$WmaxS = LMI.df$WmaxS * 0.5144
head(LMI.df[, c(4:10, 11)])
##               name   Yr Mo Da hr    lon   lat   Wmax
## 30861.5 DENNIS     1981  8 20 23 -70.82 36.97  70.45
## 30891.4 EMILY      1981  9  6 10 -58.14 40.63  80.60
## 30930.2 FLOYD      1981  9  7  2 -69.06 26.76 100.44
## 30972.2 GERT       1981  9 11 14 -71.66 29.40  90.46
## 31003.5 HARVEY     1981  9 14 23 -62.63 28.26 115.11
## 31054.4 IRENE      1981  9 28 16 -56.41 27.93 105.49

The Wmax column is a spline interpolated maximum wind speed and WmaxS is first smoothed then spline interpolated to allow time derivatives to be computed.

The raw wind speed values are given in 5 kt increments. Although knots (kt) are the operational unit used for reporting tropical cyclone intensity to the public in the United States, here you use the SI units of m/s.

We use the term 'intensity' as shorthand for 'maximum wind speed,' where maximum wind speed refers to the estimated fastest wind velocity somewhere in the core of the hurricane. Lifetime maximum refers to the highest maximum wind speed during the life of the hurricane.

You draw a map of the event locations with the plot() method using the longitude coordinate as the x variable and the latitude coordinate as the y variable by typing

with(LMI.df, plot(lon, lat, pch = 19))
map("world", col = "gray", add = TRUE)
grid()

plot of chunk locationMap

Adding country borders and latitude/longitude grid lines [grid() function] enhances the geographic information. The argument pch specifies a point character using an integer code. Here 19 refers to a solid circle (type ?points for more information). The with() function allows you use the column names from the data frame in the plot() method.

Note the order of function calls. By plotting the events first, then adding the country borders, the borders are clipped to the plot window. The dimensions of the plot window are default to be slightly larger than the range of the longitude and latitude coordinates. The function chooses a reasonable number of axis tics that are placed along the range of coordinate values at reasonable intervals.

Since the events are marked by storm intensity it is informative to add this information to the map. Hurricane intensity, as indexed by an estimate of the wind speed maximum at 10 m height somewhere inside the hurricane, is a continuous variable. You can choose a set of discrete intensity intervals and group the events by these class intervals. For example, you might want to choose the Saffir/Simpson hurricane intensity scale.

To efficiently communicate differences in intensities with colors, you should limit the number classes to six or less. The package classInt is a collection of functions for choosing class intervals. Here you require the package and create a vector of lifetime maxima. You then obtain class boundaries using the classIntervals() function.

Here the number of class intervals is set to five and the method of determining the interval breaks is based on Jenks optimization (style=“jenks”). Given the number of classes, the optimization minimizes the variance of the values within the intervals while maximizing the variance between the intervals.

require(classInt)
lmi = LMI.df$WmaxS
q5 = classIntervals(lmi, n = 5, style = "jenks", dataPrecision = 1)

The dataPrecision argument sets the number of digits to the right of the decimal place.

Next you choose a palette of colors. This is best left to someone with an understanding of hues and color separation schemes. The palettes described and printed in Brewer et al. (2003) for continuous, diverging, and categorical variables can be examined on maps at http://colorbrewer2.org/. Select the HEX radio button for a color palette of your choice and then copy and paste the hex code into a character vector preceded by the pound symbol.

For example, here you create a character vector (cls) of length 5 containing the hex codes from the color brewer website from a sequential color ramp ranging between yellow, orange, and red.

cls = c("#FFFFB2", "#FECC5C", "#FD8D3C", "#F03B20", "#BD0026")

To use your own set of colors simply modify this list. A character vector of color hex codes is generated automatically with functions in the colorRamps package.

The empirical cumulative distribution function of cyclone intensities with the corresponding class intervals and colors is then plotted by typing

plot(q5, pal = cls, main = "", xlab = expression(paste("Wind Speed [m ", s^-1, 
    "]")), ylab = "Cumulative Frequency")

plot of chunk plotCumFreq

The points (with horizontal dashes) are the lifetime maximum intensity wind speeds in rank order from lowest to highest.

Once you are satisfied with the class intervals and color palette, you can plot the events on a map. First you need to assign a color for each event depending on its wind speed value. This is done with the findColours() function as

q5c = findColours(q5, cls)

Now, instead of black dots with a color bar, each value is assigned a color corresponding to the class interval. For convenience you create the axis labels and save them as an expression object. You do this with the expression() and paste() functions to get the degree symbol.

xl = expression(paste("Longitude [", {
}^o, "E]"))
yl = expression(paste("Latitude [", {
}^o, "N]"))

Since the degree symbol is not attached to a character you use {} in front of the superscript symbol. You again use the plot() method on the location coordinates, but this time set the color argument to the corresponding vector of colors saved in q5c.

plot(LMI.df$lon, LMI.df$lat, xlab = xl, ylab = yl, col = q5c, pch = 19)
points(LMI.df$lon, LMI.df$lat)
map("world", add = TRUE)
axis(3)
axis(4)
grid()
legend("bottomright", bg = "white", fill = attr(q5c, "palette"), legend = names(attr(q5c, 
    "table")), title = expression(paste("Wind Speed [m ", s^-1, "]")))

plot of chunk LocationPlot

You add country boundaries, place axis labels in the top and right margins, and add a coordinate grid. To complete the map you add a legend.

Note that fill colors and names for the legend are obtained using the attr() function on the q5c object. The function retrieves the table attribute of the object.

The spatial distribution of lifetime maxima is fairly uniform over the ocean for locations west of the -40E longitude. Fewer events are noted over the eastern Caribbean Sea and southwestern Gulf of Mexico. Events over the western Caribbean tend to have the highest intensities. Also, as you might expect, there is a tendency for a hurricane that reaches its lifetime maximum at lower latitudes to have a higher intensity.

Areal data

A shapefile stores geometry and attribute information for spatial data. The geometry for a feature is stored as a shape consisting of a set of vector coordinates. Shapefiles support point, line, and area data.

Area data are represented as closed loop polygons. Each attribute record has a one-to-one relationship with the associated shape record. For example, a shapefile might consist of the set of polygons for the counties in Florida and an attribute might be population. Associated with each county population record (attribute) is an associated shape record.

The shapefile is a set of several files in a single directory. The three individual files with extensions *.shp (file of geometries), *.shx (index file to the geometries), and *.dbf (file for storing attribute data) form the core of the directory. Note there is no standard for specifying missing attribute values. The *.prj file, if present, contains the coordinate reference system (CRS).

Information in a shapefile format makes it easy to map. As an example, consider the U.S. Census Bureau boundary file for the state of Florida http://www.census.gov/cgi-bin/geo/shapefiles/national-files.

Browse to Current State and Equivalent, Select State, then Florida. Download the zipped file. Unzip it to your R working directory folder. To make things a bit easier for typing, rename the directory and the shapefiles to FL.

The readShapeSpatial() function from the maptools package reads in the polygon shapefile consisting of the boundaries of the 67 Florida counties.

require(maptools)
tmp = download.file("http://hurricaneclimatology.squarespace.com/storage/chapter-5/FL.zip", 
    "FL.zip", mode = "wb")
unzip("FL.zip")
FLpoly = readShapeSpatial("FL/FL")
class(FLpoly)
## [1] "SpatialPolygonsDataFrame"
## attr(,"package")
## [1] "sp"

Note the shapefiles are in directory FL with file names the same as the directory name. The object FLpoly is a SpatialPolygonsDataFrame. It extends the class data.frame by adding geographic information (see Bivand et al. 2008).

You can use the plot() method to produce a map of the polygon borders. Of greater interest is a map displaying an attribute of the polygons.

For instance, demographic data at the county level are important for emergency managers. First read in a table of the percentage change in population over the ten year period 2000 to 2010.

con = "http://hurricaneclimatology.squarespace.com/storage/chapter-5/FLPop.txt"
FLPop = read.table(con, header = TRUE)
names(FLPop)
## [1] "Order"   "County"  "Pop2010" "Pop2000" "Diff"    "Change"

Here the table rows are arranged in the order of the polygons. You assign the column Change to the data slot of the spatial data frame by typing

FLpoly$Change = FLPop$Change

Then use the function spplot() to create a choropleth map of the attribute Change.

spplot(FLpoly, "Change")

plot of chunk spplotFLpop

Or with a different color ramp and labels.

al = colorRampPalette(c("yellow", "green", "blue"), space = "Lab")
spplot(FLpoly, "Change", col.regions = al(6), at = seq(-20, 100, 20), colorkey = list(space = "bottom", 
    cex = 0.7, labels = paste(seq(-20, 100, 20))), sub = list("Population [% change]", 
    cex = 0.8))

plot of chunk spplotFLpop2

The spplot() method is an example of a lattice plot method (Sarkar 2008) for spatial data with attributes. The function returns an object of class trellis.

If spplot() does not automatically bring up a graphics device you need to wrap it in the print() method. Missing values are not allowed.

Field data

Climate data are often presented as a grid of values. For example, NOAA-CIRES 20th Century Reanalysis version 2 provides monthly sea-surface temperature values at latitude-longitude intersections.

A portion of these data are available in the file JulySST2005.txt. The data are the SST values on a 2 degrees latitude-longitude grid for the month of July 2005. The grid is bounded by -100 and 10E longitudes and the equator and 70N latitude.

First input the data and convert the column of SST values to a matrix using the matrix() function specifying the number of columns as the number of longitudes. The number of rows is inferred based on the length of the vector. By default the matrix is filled by columns.

Next create two structured vectors, one of the meridians and other of the parallels using the seq() function. Specify the geographic limits and an interval of 2 degrees in both directions.

con = "http://hurricaneclimatology.squarespace.com/storage/chapter-5/JulySST2005.txt"
sst.df = read.table(con, header = TRUE)
sst = matrix(sst.df$SST, ncol = 36)
lo = seq(-100, 10, 2)
la = seq(0, 70, 2)

To create a map of the SST field first choose a set of colors. Since the values represent temperature you want the colors to go from blue (cool) to red (warm).

R provides a number of preset color palettes including rainbow(), heat.colors(), cm.colors(), topo.colors(), grey.colors(), and terrain.colors(). The palettes are functions that generate a sequence of color codes interpolated between two or more colors. The cm.colors() is the default palette in spplot() and the colors diverge from white to cyan and magenta.

The RColorBrewer package has palettes for continuous, diverging, and categorical variables and for choices of print and screen projection. The sp package has the bpy.colors() function that produces a range of colors from blue to yellow that work for color and black-and-white print. You can create your own function using the colorRampPalette() function.

Here you save the function as bwr() and use a set of three colors. The number of colors is the argument.

bwr = colorRampPalette(c("blue", "white", "red"))
bwr(6)
## [1] "#0000FF" "#6666FF" "#CCCCFF" "#FFCBCB" "#FF6565" "#FF0000"

The image() function creates a grid of rectangles with colors corresponding to the values in the third argument as specified by the palette and the number of colors set here at 20. The first two arguments correspond to the two dimensional location of the rectangles. The x and y labels use the expression() and paste() functions to get the degree symbol.

To complete the map you add country boundaries and place axis labels in the top and right margins (margins 3 and 4).

image(lo, la, sst, col = bwr(20), xlab = xl, ylab = yl)
map("world", add = TRUE)
axis(3)
axis(4)
r = round(range(sst, na.rm = TRUE))
levs = seq(r[1], r[2], 2)
cl = paste(levs, "C")
contour(lo, la, sst, levels = levs, labels = cl, add = TRUE)

plot of chunk imageFunction

Note that image() interprets the matrix of SST values as a table with the x-axis corresponding to the row number and the y-axis to the column number, with column one at the bottom. This is a counter-clockwise rotation (90 degree) of the conventional matrix layout.

You overlay a contour plot of the SST data using the contour() function. First determine the range of the SST values and round to the nearest whole integer. Note there are missing values (over land) so you need to use the na.rm argument in the range() function.

Next create a string of temperature values at equal intervals between this range. Contours are drawn at these values. Then paste the character string C' onto the interval labels. The corresponding list will be used as contour labels.

Ocean temperatures above about 28C are warm enough to support the development of hurricanes. This covers a large area from the west coast of Africa westward through the Caribbean and Gulf of Mexico and northward toward Bermuda.

Coordinate Reference Systems

For data covering a large geographic area you need a map with a projected coordinate reference system (CRS). A geographic CRS includes a model for the shape of the earth (oblate spheroid) plus latitudes and longitudes.

Longitudes and latitudes can be used to create a two-dimensional coordinate system for plotting hurricane data but this framework is for a sphere rather than a flat map.

A projected CRS is a two-dimensional approximation of the earth as a flat surface. It includes a model for the earth's shape plus a geometric model for projecting coordinates to the plane.

The PROJ.4 Cartographic Projections library uses a +tag=value representation of a CRS, with a tag and value pair within a single character string and the Geospatial Data Abstraction Library (GDAL) contains code for translating between different CRSs. Both the PROJ.4 and GDAL libraries are available in the rgdal package (Keitt et al. 2012).

Here you specify a geographic CRS and save it in a CRS object called ll_crs (lat-lon coordinate reference system).

require(rgdal)
require(mapdata)
ll_crs = CRS("+proj=longlat +ellps=WGS84")

The only default values for the tags in CRS objects are whether the string is a character NA (missing) for an unknown CRS, and whether it contains the string longlat, in which case the CRS is geographic coordinates (Bivand et al. 2008).

Tags begin with '+' and are separated from the value with an equal sign. A space separates the tags. Here you specify the earth's shape using the World Geodetic System (WGS) 1984, which is the reference coordinate system used by the Global Positioning System to reference the earth's center of mass.

As an example, you create a SpatialPoints object called LMI_ll by combining the matrix of event coordinates (location of lifetime maximum intensity) in native longitude and latitude degrees with the CRS object defined above.

LMI_mat = cbind(LMI.df$lon, LMI.df$lat)
LMI_ll = SpatialPoints(LMI_mat, proj4string = ll_crs)
summary(LMI_ll)
## Object of class SpatialPoints
## Coordinates:
##              min    max
## coords.x1 -97.13 -6.866
## coords.x2  11.89 48.053
## Is projected: FALSE 
## proj4string : [+proj=longlat +ellps=WGS84]
## Number of points: 173

Here you are interested in transforming the geographic CRS into a Lambert conformal conic (LCC) planar projection. The projection superimposes a cone over the sphere of the earth, with two reference parallels secant to the globe and intersecting it. The LCC projection is used for aeronautical charts. It is used for hurricane tracking by the U.S. National Hurricane Center.

Other projections, ellipsoids, and datum are available and a list of the various tag options can be generated by typing

projInfo(type = "proj")

Besides the projection tag (lcc) you need to specify the two secant parallels and a meridian. The NHC maps use the parallels 30 and 60N and a meridian of 60W. First save the CRS as a character string then use the spTransform() function to convert the latitude-longitude coordinates to coordinates of a LCC planar projection.

lcc_crs = CRS("+proj=lcc +lat_1=60 +lat_2=30 +lon_0=-60")
LMI_lcc = spTransform(LMI_ll, lcc_crs)

This transforms the set of longitude/latitude event coordinates to a set of projected event coordinates. You need to repeat this transformation for each of the map components.

To transform the country borders, first you save them from a call to the map() function. The function includes arguments to specify a longitude/latitude bounding box. Second, you convert the returned map object to a spatial lines object with the map2SpatialLines() function using a geographic CRS. Finally, you transform the coordinates of the spatial lines object to the LCC coordinates.

brd = map("world", xlim = c(-100, 0), ylim = c(5, 50), interior = FALSE, plot = FALSE)
brd_ll = map2SpatialLines(brd, proj4string = ll_crs)
brd_lcc = spTransform(brd_ll, lcc_crs)

To include longitude/latitude grid lines you need to use the gridlines() function on the longitude/latitude borders and then transform them to LCC coordinates. Similarly to include grid labels you need to convert the locations in longitude/latitude space to LCC space.

grd_ll = gridlines(brd_ll)
grd_lcc = spTransform(grd_ll, lcc_crs)
at_ll = gridat(brd_ll)
at_lcc = spTransform(at_ll, lcc_crs)

Finally plot the events on a projected map. First plot the grid then add the country borders and event locations. Use the text() function to add the grid labels and include a box around the plot.

labs = c("100W", "80W", "60W", "40W", "20W", "0", "10N", "20N", "30N", "40N", 
    "50N")
plot(grd_lcc, col = "grey60", lty = "dotted")
plot(brd_lcc, col = "grey60", add = TRUE)
plot(LMI_lcc, pch = 20, add = TRUE)
text(coordinates(at_lcc), pos = at_lcc$pos, labels = labs, cex = 0.6)

plot of chunk plotMaps

Conformal maps preserve angles and shapes of small figures, but not size. The size distortion is zero at the two reference latitudes. This is useful for hurricane tracking maps.

The spplot() method for points, lines, and polygons has some advantages over successive calls to plot(), but the syntax is somewhat different.

Export

The rgdal package contains drivers (software component plugged in on demand) for reading and writing spatial vector data using the OGR. Historically, OGR was an abbreviation for 'OpenGIS Simple Features Reference Implementation.' However, since OGR is not fully compliant with the OpenGIS Simple Feature specification and is not approved as a reference implementation, the name was changed to 'OGR Simple Features Library.'

OGR is the prefix used everywhere in the library source for class names, filenames, etc. Simple Features Library modeled on the OpenGIS simple features data model supported by the Open Geospatial Consortium, Inc. If the data have a CRS it will be read (and written). The availability of OGR drivers depends on your computing platform. To get a list of the drivers available on your machine, type ogrDrivers().

Here you consider two examples. First export the lifetime maximum intensity events as a Keyhole Markup Language (KML) for overlay using Google earth, and then export the events as an ESRI shapefile suitable for input into ArcMap, the main component of ESRI Geographic Information System (GIS).

First create a spatial points data frame from the spatial points object. This is done using the SpatialPointsDataFrame() function. The first argument is the coordinates of the spatial points object. The underlying CRS for Google earth is geographical in the WGS84 datum, so you use the LMI_ll object defined above and specify the argument proj4string as the character string ll_crs, also defined above.

LMI_sdf = SpatialPointsDataFrame(coordinates(LMI_ll), proj4string = ll_crs, 
    data = as(LMI.df, "data.frame")[c("WmaxS")])
class(LMI_sdf)
## [1] "SpatialPointsDataFrame"
## attr(,"package")
## [1] "sp"

The resulting spatial points data frame (LMI_sdf) contains a data slot with a single variable WmaxS from the LMI.df non-spatial data frame, which was specified by the data argument.

To compactly display the structure of the object, type

str(LMI_sdf, max.level = 3)
## Formal class 'SpatialPointsDataFrame' [package "sp"] with 5 slots
##   ..@ data       :'data.frame':  173 obs. of  1 variable:
##   .. ..$ WmaxS: num [1:173] 34.3 40.8 51.2 46.2 58.2 ...
##   ..@ coords.nrs : num(0) 
##   ..@ coords     : num [1:173, 1:2] -70.8 -58.1 -69.1 -71.7 -62.6 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   ..@ bbox       : num [1:2, 1:2] -97.13 11.89 -6.87 48.05
##   .. ..- attr(*, "dimnames")=List of 2
##   ..@ proj4string:Formal class 'CRS' [package "sp"] with 1 slots

The argument max.level specifies the level of nesting (e.g., lists containing sub lists). By default all nesting levels are shown and this can produce too much output for spatial objects. Note there are five slots with names data, coords.nrs, coords, bbox, and proj4string. The data slot contains a single variable.

The writeOGR() function takes as input the spatial data frame object and the name of the data layer and outputs a file in the working directory of R with a name given by the dsn argument and in a format given by the driver argument.

writeOGR(LMI_sdf, layer = "WmaxS", dsn = "LMI.kml", driver = "KML", overwrite_layer = TRUE)

The resulting file can be viewed in Google earth with pushpins for event locations. The pins can be selected revealing the layer values. Later we will see how to create an overlay image.

You can also export to a shapefile. Since shapefiles can have arbitrary CRS, first transform your spatial data frame into the Lambert conic conformal used by the NHC.

LMI_sdf2 = spTransform(LMI_sdf, lcc_crs)
str(LMI_sdf2, max.level = 2)
## Formal class 'SpatialPointsDataFrame' [package "sp"] with 5 slots
##   ..@ data       :'data.frame':  173 obs. of  1 variable:
##   ..@ coords.nrs : num(0) 
##   ..@ coords     : num [1:173, 1:2] -938432 152192 -913050 -1130462 -260413 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   ..@ bbox       : num [1:2, 1:2] -3905663 1802776 3553780 6873067
##   .. ..- attr(*, "dimnames")=List of 2
##   ..@ proj4string:Formal class 'CRS' [package "sp"] with 1 slots

The coordinate values are no longitude and latitude and neither are the dimensions of the bounding box (bbox slot), but the labels do not change.

You export using the driver ESRI Shapefile. The argument dsn is a folder name.

drv = "ESRI Shapefile"
writeOGR(LMI_sdf2, layer = "WmaxS", dsn = "WmaxS", driver = drv, overwrite_layer = TRUE)

The output contains a set of four files in the Wmax folder including a .prj file with the fully specified coordinate reference system. The data can be imported as a layer to ArcMap.

Other Graphic Packages

R's traditional (standard) graphics offer a nice set of tools for making statistical plots including box plots, histograms, and scatter plots. These basic plot types can be produced using a single function call. Yet some plots require a lot of work and even simple changes can sometimes be tedius.

This is particularly true when you want to make a series of related plots for different groups. Two alternatives are worth mentioning.

lattice

The lattice package (Sarkar 2008) contains functions for creating trellis graphs for a wide variety of plot types. A trellis graph displays a variable or the relationship between variables, conditioned on one or more other variables.

In simple usage, lattice functions work like traditional graphics functions. As an example of a lattice graphic function that produces a density plot of the June NAO values type

require(lattice)
densityplot(~Jun, data = NAO)

plot of chunk latticeExample

The syntax includes the name of the variable and the name of the data frame. The variable is preceeded by the tilde symbol. By default the density plot includes the values as points jittered above the horizontal axis.

It is easy to create a series of plots with the same axes (trellis) as you did with the coplot() function. For instance in an exploratory analysis you might want to see if the annual U.S. hurricane count is related to the NAO.

You first create a variable that splits the NAO into four groups.

steer = equal.count(NAO$Jun, number = 4, overlap = 0.1)

The grouping variable has class shingle and the number of years in each group is the same. The overlap argument indicates the fraction of overlap used to group the years. Using a negative fraction for gaps. To see the range of values for each group type plot(steer).

Next use the histogram() function to plot the percentage of hurricanes by count conditional on your grouping variable.

histogram(~US$All | steer, breaks = seq(0, 8))

plot of chunk histogramFunction

The vertical line indicates the variable to follow is the conditioning variable. The breaks argument is used on the hurricane counts. The resulting four-panel graph is arranged from lower left to upper right with increasing values of the grouping variable. Each panel contains a histogram of U.S. hurricane counts drawn using identical scale for the corresponding range of NAO values. The relative range is shown above each panel as a strip (shingle).

Lattice functions produce an object of class trellis that contains a description of the plot. The print method for objects of this class does the actual drawing of the plot. For example, the following code does the same as above.

dplot = densityplot(~Jun, data = NAO)
print(dplot)

plot of chunk latticeExample2

But now you can use the update() function to modify the plot design. For example, to add an axis label you type

update(dplot, xlab = "June NAO (s.d.)")

plot of chunk update

To save the modified plot for additional changes you will need to reassign it.

ggplot2

The ggplot2 package (Wickham 2009) contains plotting functions that are more flexible than the traditional R graphics. The gg stands for the 'Grammar of Graphics,' a theory of how to create a graphics system (Wilkinson 2005). The grammar specifies how a graphic maps data to attributes of geometric objects. The attributes are things like color, shape, and size and the geometric objects are things like points, lines, bars, and polygons.

The plot is drawn on a specific coordinate system (which can be geographical) and it may contain statistical manipulations of the data. Faceting can be used to replicate the plot but with different subets of your data.

The application of the grammar provides greater power to help you devise graphics specific to your needs, which can help you better understand your data. Here we give a few examples to help you get started.

Returning to your October SOI values. To create a histogram with a bin width of one standard deviation (units of SOI), type

require(ggplot2)
ggplot(SOI, aes(Oct)) + geom_histogram(binwidth = 1)

plot of chunk histogramqplot

The plots are created in layers. The data layer is specified first then the geom_histogram() (geom is short for geometric object) function generates the what you actually see on the plot.

Note the default use of grids and a background gray shade. This can be changed with the theme_bw() function. The function is a predefined theme.

ggplot(SOI, aes(Oct)) + geom_histogram(binwidth = 1) + theme_bw()

plot of chunk changeDefaultTheme

To create a scatter plot type

p1 = ggplot(SOI, aes(x = Aug, y = Sep)) + geom_point()

You add a smoothed line (an example of a statistical manipulation of your data) to the plot using the geom_smooth() function.

p1 + geom_smooth() + xlab("August SOI [s.d.]") + ylab("September SOI [s.d.]")
## geom_smooth: method="auto" and size of largest group is <1000, so using
## loess. Use 'method = x' to change the smoothing method.
## Warning: Removed 15 rows containing missing values (stat_smooth).
## Warning: Removed 15 rows containing missing values (geom_point).

plot of chunk scatterplotqplotSmooth

A 95% confidence band about the smoothed line is also included.

Plots are built layer by layer. Layers are regular R objects and so can be stored as variables. This makes it easy for you to write clean code with a minimal amount of duplication. A set of plots can be enhanced by adding new data as a separate layer.

data = ggplot(SOI, aes(x = Aug, y = Sep))
pts = geom_point()
bestfit = geom_smooth(method = "lm", color = "red")
data + pts + bestfit
## Warning: Removed 15 rows containing missing values (stat_smooth).
## Warning: Removed 15 rows containing missing values (geom_point).

plot of chunk plotsAsLayers

The layers are defined separately then added and rendered to your graphics device in the last line.

The ggplot() function accepts only data frames. The philosophy is that your data are important, and it is better to be explicit about what is done with it before creating a graph. The functions in the plyr and reshape2 packages help you create data frames from other data objects (Teetor 2011).

ggmap

The ggmap package (Kahle and Wickham 2012) extends the grammar of graphics to maps. The function get_map() queries the Google Maps server or OpenStreetMap server for a map at a specified location and zoom. For example to grab a map of Tallahassee, Florida type

require(ggmap)
## Loading required package: ggmap
Tally = get_map(location = "Tallahassee FL", zoom = 13)
str(Tally)
##  chr [1:1280, 1:1280] "#C5C5C1" "#CED5C4" "#D8DECD" "#D9E0CE" ...
##  - attr(*, "class")= chr [1:2] "ggmap" "raster"
##  - attr(*, "bb")='data.frame':   1 obs. of  4 variables:
##   ..$ ll.lat: num 30.4
##   ..$ ll.lon: num -84.3
##   ..$ ur.lat: num 30.5
##   ..$ ur.lon: num -84.2

The result is an object of class ggmap with a matrix (1280 by 1280) of character strings specifing the fill color for each raster.

The level of zoom ranges from 0 for the entire world to 19 for the highest. The default zoom is 10. The default map type (maptype) is terrain with options for roadmap, satellite, hybrid, and others.

To plot the map on your graphics device, type

ggmap(Tally)

plot of chunk plotMap

To determine a center for your map you can use the geocode() function to get a location from Google Maps. For example to determine the location of Florida State University, type

geocode("Florida State University")
##      lon   lat
## 1 -84.29 30.44